WO2024063907A1 - Modelling causation in machine learning - Google Patents
Modelling causation in machine learning Download PDFInfo
- Publication number
- WO2024063907A1 WO2024063907A1 PCT/US2023/031000 US2023031000W WO2024063907A1 WO 2024063907 A1 WO2024063907 A1 WO 2024063907A1 US 2023031000 W US2023031000 W US 2023031000W WO 2024063907 A1 WO2024063907 A1 WO 2024063907A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variables
- variable
- graph
- feature vector
- selected variable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/50—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Definitions
- a neural network comprises plurality of nodes which are interconnected by links, sometimes referred to as edges.
- the input edges of one or more nodes form the input of the network as a whole, and the output edges of one or more other nodes form the output of the network as a whole, whilst the output edges of various nodes within the network form the input edges to other nodes.
- Each node represents a function of its input edge(s) weighted by a respective weight, the result being output(s) on its output edge(s).
- the weights can be gradually tuned based on a set of training data so as to tend towards a state where the output of the network will output a desired value for a given input.
- the nodes are arranged into layers with at least an input and an output layer.
- a “deep” neural network comprises one or more intermediate or “hidden” layers in between the input layer and the output layer.
- the neural network can take input data and propagate the input data through the layers of the network to generate output data.
- Certain nodes within the network perform operations on the data, and the result of those operations is passed to other nodes, and so on.
- Each node is configured to generate an output by carrying out a function on the values input to that node.
- the inputs to one or more nodes form the input of the neural network
- the outputs of some nodes form the inputs to other nodes
- the outputs of one or more nodes form the output of the network.
- the input to that node is weighted by a respective weight.
- a weight may define the connectivity between a node in a given layer and the nodes in the next layer of the neural network.
- a weight can take the form of a scalar or a probabilistic distribution. When the weights are defined by a distribution, as in a Bayesian model, the neural network can be fully probabilistic and captures the concept of uncertainty.
- the values of the connections between nodes may also be modelled as distributions.
- the distributions may be represented in the form of a set of samples or a set of parameters parameterizing the distribution (e.g. the mean ⁇ and standard deviation ⁇ or variance ⁇ 2).
- the network learns by operating on data input at the input layer, and adjusting the weights applied by some or all of the nodes based on the input data.
- each node takes into account the back propagated error and produces a revised set of weights. In this way, the network can be trained to perform its desired operation.
- Training may employ a supervised approach based on a set of labelled training data.
- Other approaches are also possible, such as a reinforcement approach wherein the network each data point is not initially labelled.
- the learning algorithm begins by guessing the corresponding output for each point, and is then told whether it was correct, gradually tuning the weights with each such piece of feedback.
- Another example is an unsupervised approach where input data points are not labelled at all and the learning algorithm is instead left to infer its own structure in the experience data.
- machine learning model are also known, other than just neural networks, for example clustering algorithms, random decision forests, and support vector machines.
- Some machine learning models can be designed to perform causal discovery using observational data or both observational and interventional data. That is, for a set of variables (e.g.
- the model when trained can estimate a likely causal graph describing the causal relationships between these variables.
- a simple causal graph could be x1 ⁇ x 2 ⁇ x 3 , meaning that x 1 causes x 2 and x 2 causes x 3 (put another way, x 3 is an effect of x 2 and x 2 is an effect of x 1 ).
- x 1 ⁇ x 2 ⁇ x 3 means that x 1 and x 3 are both causes of x2 (x2 is an effect of x1 and x3).
- Geffner et al provided an integrated machine learning model that both models the causal relationships between variables and performs treatment effect estimation, by averaging over multiple possible causal graphs sampled from a distribution. This advantageously allows the method to exploit a model that has been trained for causal discovery in order to also estimate treatment effects. The method thus enables “end-to-end” causal inference.
- SUMMARY It is recognized herein that there is still further scope to improve on the DECI model developed by Geffner et al, or the like. Particularly, existing models do not model well time series data. The nature of cause-and-effect is such that causes in the past cause effects in the future.
- the present disclosure discloses a model in which, instead of a static graph that contains only edges between variables at single snapshot in time, the model samples a temporal causal graph which contains causal edges between variables at different time steps.
- a selected variable from among variables of a feature vector
- the method further comprises: D) inputting an input value of each of the identified present and preceding parent into a respective encoder, resulting in a respective embedding of each of the present and preceding parents; E) combining the embeddings of the present and preceding parents, resulting in a combined embedding; and F) inputting the combined embedding into a decoder associated with the selected variable, resulting in a reconstructed value of the selected variable. Augmenting the model to include a temporal graph will advantageously lead to more accurate predictions as it will more accurately model the causal reality of the modelled scenario.
- the model may also be used to predict the best order in which to apply a series of two or more treatments, and/or the timing with which to apply one or more treatments; and/or to predict how long a treatment will take to take effect.
- Another potential issue to take into account when handling time series data is that if the modelled noise is static, then the model may not be optimally specified. For instance consider a scenario where a symptom of a modelled patient, or a sensor reading from a modelled device, remains relatively smooth during some periods, but becomes more erratic during other periods. It would be desirable to take this into account in the modelled noise.
- the presently disclosed model may also include a history dependent noise term, which is generated based on values of variables from past time steps.
- the method may comprise generating a history dependent noise term based on embeddings of the preceding parents; and combining the history dependent noise term with the reconstructed value of the selected variable, resulting in a simulated value of the reconstructed variable.
- Making the simulated noise dependent on history will lead to more optimal predictions, such as estimated treatment effects, as the model will again be more likely to be a more realistic representation of the ground truth.
- the disclosed model may also take into account the presence of possible confounders, which have the potential to bias existing models. Consider two variables x 1 and x 2 and a model which is attempting to discover whether there is a causal edge between them, or to predict a treatment effect based on the modelled causation.
- a hidden confounder is a third variable which is not observed and which is a cause of both x1 and x2. This could lead to a false conclusion that x 1 is the cause of x 2 (or vice versa) when in fact there is no causal link (or a weaker causal link) and instead both x1 and x2 are an effect of a common causal variable u12 (a confounder) which is not present in the input data (i.e. not one of the variables of the input feature vector).
- x 1 could measure the presence or absence of a certain condition (e.g. disease) in a subject
- x2 could be a lifestyle factor such as an aspect of the subject’s diet (e.g.
- embodiments of the present disclosure provide a machine learning model which models the possibility of hidden confounders as latent variables model. Accordingly, in embodiments B) may further comprise sampling a second causal graph from a second graph distribution, the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector.
- C) further comprises, from among of the variables of the feature vector, identifying a parent variable which is a cause of the selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph; and D) further comprises inputting the input value of the parent variable and an input value of the selected variable into an inference network, resulting in a latent value modelling the respective confounder of the confounded pair, and inputting the latent value into a second encoder, resulting in an embedding of the confounder of the confounded pair; and in E) the combining includes combining the embedding of the present and preceding parents with the embedding of the confounder of the confounded pair, thereby resulting in said combined embedding.
- Figure 1 is a schematic block diagram of a system in accordance with embodiments disclosed herein
- Figure 2 is a schematic computation diagram illustrating an example of a machine learning model
- Figure 3 schematically illustrates an example of a causal graph
- Figure 4 is schematic sketch of an example of a probabilistic distribution
- Figure 5 schematically illustrates another example of a causal graph
- Figure 6 is a schematic computation diagram illustrating a further machine learning model
- Figure 7 is a schematic flowchart of a method of training a model in accordance with the present disclosure
- Figure 8 is a schematic flowchart of a method of making treatment effect estimations using a trained model in accordance with embodiments disclosed herein
- Figure 9 schematically illustrates a causal graph (or part thereof) including a possible confounder between a pair of variables
- Figure 10 is a schematic computation diagram illustrating
- FIG. 1 illustrates an example system according to embodiments of the present disclosure.
- the system comprises a server system 102 of a first party, a network 112, and a client computer 114 of a second party.
- the server system 102 and client computer 114 are both operatively coupled to the network 112 so as to be able to communicate with one another via the network 112.
- the network 112 may take any suitable form and may comprise one or more constituent networks, e.g. a wide area network such as the Internet or a mobile cellular network, a local wired network such as an Ethernet network, or a local wireless network such as a Wi-Fi network, etc.
- the server system 102 comprises processing apparatus comprising one or more processing units, and memory comprising one or more memory units.
- The, or each, processing unit may take any suitable form, e.g. a general purpose processor such as a CPU (central processing unit); or an accelerator processor or application specific processor such as a dedicated AI accelerator processor or a repurposed GPU (graphics processing unit), DSP (digital signal processor), or cryptoprocessor, etc.
- The, or each, memory unit may also take any suitable form, e.g. an EEPROM, SRAM, DRAM or solid state drive (SSD); a magnetic memory such as a magnetic disk or tape; or an optical medium such as an optical disk drive, quartz glass storage or magneto-optical memory; etc.
- processing units and/or memory units may be implemented in the same physical server unit, or different server units in the same rack or different racks, or in different racks in the same data centre or different data centres at different geographical sites.
- these may be networked together using any suitable networking technology such as a server fabric, an Ethernet network, or the Internet, etc.
- Distributed computing techniques are, in themselves, known in the art.
- the memory of the server system 102 is arranged to store a machine learning (ML) model 104, a machine learning algorithm 106, training data 108, and an application programming interface (API) 110.
- the ML model 104, ML algorithm 106 and API 110 are arranged to run on the processing apparatus of the server system 102.
- the ML algorithm 106 is arranged so as, when run, to train the ML model 104 based on the training data 108. Once the model 104 is trained, the ML algorithm 106 may then estimate treatment effects based on the trained model. In some cases, after the ML model 104 been trained based on an initial portion of training data 108 and been made available for use in treatment effect estimation, training may also continue in an ongoing manner based on further training data 108 , e.g. which may be obtained after the initial training.
- the API 110 when run, allows the client computer 114 to submit a request for treatment effect estimation to the ML algorithm 106.
- the ML model 104 is a function of a plurality of variables.
- the request may specify a target variable to be examined, and may supply input values of one or more other variables (including intervened-on values and/or conditioned values).
- the ML algorithm 106 may control the ML model 104 to generate samples of the target variable given the intervened-on and/or conditioned values of the one or more other variables.
- the API 110 returns the result of the requested causal query (the estimated treatment effect) to the client computer 114 via the network 112.
- the API may also allow the client computer to submit some or all of the training data 108 for use in the training.
- FIG. 2 schematically illustrates an example implementation of a machine learning model 104 of a type as disclosed in “Deep End-to-End Causal Inference” (DECI) by Geffner et al.
- the ML model 104 comprises a respective encoder ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ and a respective decoder ⁇ ⁇ ⁇ ⁇ ⁇ for each of a plurality of variables xi, where i is an index running from 1 to D where D > 1.
- the set of variables xi...xD may be referred to as the input feature vector.
- Each variable represents a different property of a subject being modelled.
- the subject may for example be a real-life entity, such as a human or other living being; or an object such as a mechanical, electrical or electronic device or system, e.g. industrial machinery, a vehicle, a communication network, or a computing device etc.; or a piece of software such as a game, operating system software, communications software, networking software, or control software for controlling a vehicle or an industrial processor or machine.
- a property could be an inherent property of the being or object, or an environment of the being or object which may affect the being or object or be affected by the being or object.
- the variables represent different properties of a person or other living being (e.g. animal).
- One or more of the variables may represent a symptom experienced by the living being, e.g.
- One or more of the variables may represent environmental factors to which the subject is exposed, or behavioural factors of the subject, such as whether the subject lives in an area of high pollution (and perhaps a measure of the pollution level), or whether the subject is a smoker (and perhaps how many per day), etc. And/or, one or more of the variables may represent inherent properties of the subject such as a genetic factor. In the example of a device, system or software, one or more of the variables may represent an output state of the device, system or software.
- One or more of the variables may represent an external factor to which the device, system or software is subjected, e.g. humidity, vibration, cosmic radiation, and/or a state of one or more input signals.
- the variables may comprise a control signal, and/or an input image or other sensor data captured from the environment.
- one or more of the variables may represent an internal state of the device, system or software, e.g. an error signal, resource usage, etc.
- One, more or all of the variables may be observed or observable. In some cases, one or more of the variables may be unobserved or unobservable.
- Each respective encoder ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ is arranged to receive an input value of its respective variable xi, and to generate a respective embedding e i (i.e. a latent representation) based on the respective input value.
- a respective embedding e i i.e. a latent representation
- an embedding, or latent representation or value is in itself a known concept. It represents in the information in the respective input variable an abstracted form, typically in a compressed form, which is learned by the respective encoder during training.
- the embedding may be a scalar value, or may be a vector of dimension ⁇ embedding_dim ⁇ which is greater than 1.
- a ”value” as referred to herein could be a vector value or a scalar value.
- one of the variables is an image (e.g. a scan of the subject)
- the “value” of this vector variable is the array pixel values for the image.
- the different variables x i may have a certain causal relationship between them, which may be expressed as a causal graph.
- a causal graph may be described as comprising a plurality of nodes and edges (note that these are not the same thing as the nodes and edges mentioned earlier in the context of a neural network).
- Each node represents a respective one of the variables xi in question.
- the edges are directional and represent causation. I.e.
- x2 causes x1
- x1 causes x3.
- x3 may represent having a respiratory virus
- x 1 may represent a lung condition
- x 2 may represent a genetic predisposition.
- other possible graphs are possible, including those with more variables and other causal relationships.
- this could be written as: ⁇ ⁇ ⁇ ⁇ 1,2 ⁇ ⁇ ⁇ 1,3 ) ⁇ ⁇ ⁇ 2,3
- Other equivalent representations are possible.
- an alternative representation of the probabilities of existence and direction of edges in a graph would be: For any given situation, the actual causal graph may not be known.
- a distribution q ⁇ of possible graphs may be expressed in a similar format to G, but with each element comprising a parameter ⁇ (phi, also drawn ⁇ ) representing a probability instead of a binary value.
- ⁇ ⁇ _ ⁇ ⁇ ⁇ ⁇ ⁇ 1,2 ⁇ _ ⁇ ⁇ ⁇ 1,2 ⁇ ⁇ ⁇ ( ⁇ _ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 1,3 ⁇ _ ⁇ ⁇ 1,3 ) ⁇ ⁇ _ ⁇ ⁇ ⁇ ⁇ ⁇ 2,3 ⁇ _ ⁇ ⁇ ⁇ 2,3 (Or an equivalent representation.)
- the parameter ⁇ _exists1,2 represents the probability that an edge between x 1 and x 2 exists
- ⁇ _dir1,2 represents the probability that the direction of the possible edge between x 1 and x 2 is directed from x 1 to x 2 (or vice versa)
- parameter ⁇ _exists1,2 represents the probability that an edge between x1 and x3 exists; etc.
- the ML model 104 further comprises a selector ⁇ , a combiner 202 and a demultiplexer 204. It will be appreciated that these are schematic representations of functional blocks implemented in software.
- the selector ⁇ is operable to sample a causal graph G from the distribution. This means selecting a particular graph G (with binary elements) whereby the existence and direction of the edges are determined pseudorandomly according to the corresponding probabilities in the distribution q ⁇ .
- the possible graphs are constrained to being directed acyclic graphs (DAGs), for the sake of practicality and simplicity of modelling.
- DAGs directed acyclic graphs
- the selector ⁇ also receives a value of the index i for a selected target variable x i .
- the selector ⁇ selects the respective embeddings e Pa(i) generated by the respective encoders ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ( ⁇ ) of the parents Pa(i) of the node i (variable x i ) in the currently sampled graph G, and inputs these into the combiner 202.
- the combiner 202 combines the selected embeddings e Pa(i) into a combined embedding e c .
- the combination is a sum. In general a sum could be a positive or negative sum (a subtraction would be a sum with negative weights).
- the combined (summed) representation thus has the same dimension as a single embedding, e.g. ⁇ embedding_dim ⁇ . In alternative implementations however it is not excluded that another form of combination could be used, such as a concatenation.
- the demultiplexer 204 also receives the index i of the currently selected variable x i , and supplies the combined embedding e c into the input of the decoder ⁇ ⁇ ⁇ ⁇ ⁇ associated with the currently selected variable x i . This generates a value of a respective noiseless reconstructed version x i ’ of the respective variable xi based on the combined embedding ec.
- TRAINING Figure 7 schematically represents a method of training the DECI type model 104 based on a training data 108.
- the training data 108 comprises a plurality of training data points.
- Each data point [x1 ... xD] comprises a set of input values, one respective value for each of the variables xi.
- the method processes each of the training data points, at least in an initial portion of the training data 108.
- this causes the selector ⁇ to select the target variable xi with the currently set value of the index i as the variable to be processed.
- the selector ⁇ samples a random graph G from the distribution q ⁇ .
- the selector ⁇ selects the parents Pa(i) of the target variable xi (node i in the graph) and supplies the respective embeddings ePa(i) from the encoders ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ( ⁇ ) of the selected parents into the combiner 202.
- the combiner 202 combines (e.g.
- the demultiplexer 204 selects to supply the combined embedding e c into the decoder ⁇ ⁇ ⁇ ⁇ ⁇ of the target variable xi.
- the respective decoder ⁇ ⁇ ⁇ ⁇ ⁇ is thus caused to generate a noiseless reconstruction xi’ of the selected target variable xi.
- G is a DAG (directed and acyclic) so that it is valid to generate the value of any node in this way.
- the difference x i -x i ’ may be referred to as the residual.
- This is represented schematically by steps S70 and S80 in Figure 7, looping back to S30.
- this the “loop” may be a somewhat schematic representation. In practice some or all of the different variables xi for a given data point could be processed in parallel.
- the ML algorithm 106 applies a training function which updates the parameters (e.g.
- the training function attempts to update the parameters ⁇ , ⁇ in such a way as to reduce a measure of overall difference between the set of input values of the set of input variables [x1 ... xD] and the reconstructed version of the set variables [x 1 ’ ... x D ’].
- the training function is an evidence lower bound (ELBO) function. Training techniques of this kind are, in themselves, known in the art, e.g.
- the model 104 may be a probabilistic model that it assigns a joint probability (likelihood) p (X 1 ,...,X D ) to all the variables X 1 ,...,X D ; where X i represents the noiseless reconstruction x i combined with a random noise term z i , e.g.
- the above-described process is repeated for each of the data points in the training data 108 (or at least an initial portion of the training data). This is represented schematically in Figure 7 by step S100 and the loop back to S10. Though again, this form of illustration may be somewhat schematized and in some implementations, batches of data points could be reconstructed in parallel, and the model parameters updated based on the batches.
- the parameters ⁇ e.g. weights
- the trained ML model 104 is made available to be used for treatment effect estimation (i.e., answering causal queries). In embodiments, this may comprise making the model 104 available to via the API 110 to estimate treatment effects / answer causal queries requested by the client computer 114.
- a method of using the ML model 104 to estimate a treatment effect is shown schematically in Figure 8.
- the index i is set to that of a target variable xi whose treatment effect is to be estimated.
- the input values of one or more ”intervened-on” variables are also set to their known values.
- the intervened-on variables are variables whose values are set to some specified value, to represent that the property that they represent has been controlled (the treatment, i.e. an intervention on the modelled property).
- An ”intervened-on” variable could also be referred to as a treated variable or controlled variable.
- the intervened- on variable(s) may represent one or more interventions performed on the subject, and the target variable may represent a possible symptom or condition of the subject (e.g. the presence of a certain disease).
- the intervened-on variable(s) may represent one or more states that are set to defined values, and the target variable may represent a condition or state of the device or software that is being diagnosed.
- the selector ⁇ samples a graph G pseudorandomly from the distribution q ⁇ according to the probabilities ⁇ .
- some of the learned probabilities could be overridden to predetermined values based on prior knowledge (e.g. in some scenarios a given causality could be ruled out – probability set to zero – or may be known to be unlikely, either based on a priori or empirical knowledge).
- the selector ⁇ selects the parents Pa(i) of the target variable xi in the sampled graph G, and supplies the respective embeddings ePa(i) from the encoders ⁇ ⁇ ⁇ ⁇ ⁇ ( ⁇ ) of the selected parents into the combiner 202 to be combined (e.g. summed) into the combined embedding ec.
- the demultiplexer 204 selects to pass the combined embedding ec into the decoder ⁇ ⁇ ⁇ ⁇ ⁇ of the target variable xi, thus causing it to generate a noiseless reconstruction of xi’.
- the selection ⁇ selects the parents of node i in the graph G, so it depends on both i and G.
- the index is set to a particular target node i that one is trying to estimate (as well as selecting a graph G from q ⁇ ), and the model 104 outputs a noiseless reconstruction of x i , denoted by x i ’.
- x i a noiseless reconstruction of x i , denoted by x i ’.
- the model 104 is used in a slightly different way for treatment effect estimation.
- the target variable x i which is now treated as unknown, and the goal is to generate simulated values of the target variable, given other interventional variables.
- the simulated value could just be taken as the noiseless reconstructed value x i .
- the noiseless reconstruction x i ’ of the target variable x i may not be considered preferable, and instead the full interventional distribution may be taken into account.
- This can characterized by the following sources of uncertainties: 1), different realizations of the graph G that could happen to have been selected during sampling (even an unlikely one), which can be modelled by simulating different graphs from q ⁇ . ; and 2), the randomness of the residual noise random variables zi.
- the noise z i could be combined with the noiseless reconstruction x i ’ in other ways in order to obtain Xi, not necessarily additively.
- an average is taken over multiple sampled graphs.
- an average is taken over multiple sampled graphs and residual noise variables zi.
- multiple values of variable Xi are simulated based on different respective sampled graphs G, each time sampled randomly from both q ⁇ and z i
- This is represented schematically in Figure 8 by the loop from step T50 back to T20, to represent that multiple simulated values of xi are determined (for a given target variable xi), each determined based on a different randomly sampled graph G (each time sampled from the distribution q ⁇ ) and optionally noises ⁇ ⁇ .
- the illustration as a loop may be somewhat schematized, and in embodiments some or all of the different simulated values for a given target variable x i , based on the different sampled graphs and sampled residual noises z, may in fact be computed in parallel.
- the average could be taken as a simple mean, median or mode of the different values of the simulated values x i ’ or X i ’ of the variable x i (as simulated with the different sampled graphs and optionally residual noises). In embodiments, such averaging is based on estimating an expectation of the probabilistic distribution of the simulated values.
- Figure 4 schematically illustrates the idea of a probabilistic distribution p(xi).
- the horizontal axis represents the value of the target variable being simulated, x i
- the vertical axis represents the probability that the variable takes that value.
- the distribution may be described by one or more parameters, e.g. a mean ⁇ and standard deviation ⁇ or variance ⁇ 2 in the case of a Gaussian.
- Other more complex distributions are also possible, which may be parameterized by more than two parameters, e.g. a spline function.
- the probabilistic distribution may be referred to as a function the target variable of x i , i.e.
- the target variable xY may model an outcome of a treatment modelled by xT.
- treatment does not necessarily limit to a medical treatment or a treatment of a living being, though those are certainly possible use cases.
- the treatment may comprise applying a signal, repair, debugging action or upgrade to an electronic, electrical or mechanical device or system, or software, where the effect may be some state of the device, system or software which is to be improved by the system.
- the actual real-world treatment may be applied in dependence on the estimation (e.g. expectation) of the effect of the modelled treatment, for example on condition that the treatment estimation (e.g. expectation) is above or below a specified threshold or within a specified range.
- One suitable measure of expectation may be referred to as the average treatment effect (ATE).
- (do(x T val2)] or E[p(xY)
- do(xT mean(xT))], where val1 is some treatment, val2 is some other treatment, and “do” represents applying the treatment.
- the ATE is the difference between: a) the expectation of the distribution of the effect xY given the value val1 of a treatment xT and b) the expectation of the distribution of the effect x Y given the value val2 of a treatment x T , or the difference between a) the expectation of the distribution of the effect xY given the value val1 of a treatment xT and b) the expectation of the distribution of the effect xY without applying a value of the treatment xT.
- x1 is the treatment xT
- x2 is the target variable xY.
- x1 is set to its known value.
- one example method is simply to take the mean of the different sampled values xY’ or XY of the target variable xY, based on the different sampled graphs, optionally also including some random noise (e.g. additive noise) in each sample.
- Another option is to fit the sampled values of x Y to a predetermined form of distribution, such shown as in Figure 4, e.g. a Gaussian, normal or spline function (again optionally also including random noise).
- the average may then be determined as the average of the fitted distribution. Note that during the determination of the multiple instances of x i ’ it is possible, especially for some distributions, that the same graph G ends up being sampled multiple times, or even every time.However the graph is still being freshly sampled each time and even if the sampled graphs turn out to be the same, they may still be described as different instances of the sampled graph, in the sense that they are individually sampled anew each time.
- x4 ⁇ x1 ⁇ x2 ⁇ x3 where again x 1 is the treatment x T , and x 2 is the target variable x Y .
- x 1 is set to its known value. Therefore x 4 has no effect on the outcome of x 2 . So in the determination of the expectation, the graph is mutilated to remove the node x4 and the edge from x4 ⁇ x1.
- fixing the value of the known (controlled) variable x1 means that any effect of the edge from the parent of the known variable x 1 .
- ,xC ⁇ ] which could also be expressed E[p(x Y )
- Figure 5 illustrates by way of example why estimating the conditional treatment effect is not necessarily straightforward.
- x 3 is the target x Y whose treatment effect will be estimated and x4 is the treatment xT, where the treatment is the cause of the target effect.
- x 1 of the target effect there is another, unobserved cause x 1 of the target effect, and another observable effect x 2 of the unobserved cause x 1 .
- the variable x 2 is to be the observed condition x C .
- the target effect x3 could be some condition or symptom of the subject (e.g. a respiratory problem)
- the treatment x4 could be a possible medical intervention (e.g. taking some drug or vitamin)
- the unobserved cause x 1 may be a genetic factor
- the other observable cause x2 may be some observable physical quality of the subject’s body (e.g. body mass index).
- the unobserved cause could be unobservable, or merely unobserved.
- the ML model 104 may be adapted to include at least one inference network h disposed between at least one observable condition xc (x2 in the example) and at least one unobservable potential cause (x 1 in the example) of the condition x c .
- the inference network h (or individual such networks) may be disposed between the unobserved cause and multiple potential effects (up to all the other variables). This will allow the model to learn which variable(s) may be an effect of the unobserved cause, if relationship is not prior knowledge.
- the inference network(s) h may be trained at the training stage simultaneously along with the encoders g e and decoders g d and the parameters of the graph distribution q ⁇ , or alternatively after the rest of the model (see below).
- the inference network h may comprise a neural network, in which case training the inference network comprises tuning the weights of the inference network. Alternatively the use of other forms of machine learning is not excluded for the inference network.
- the inclusion of the inference model makes it possible to estimate a conditional expectation such as E[xY
- the method proceeds as described above with respect to ATE, to obtain multiple different samples of x Y based on multiple respective sampled graphs.
- respective simulated samples x c of the conditional variable are also obtained in the same way based on the respective sampled graphs.
- a predetermined form of function is fitted to the 2D set of samples (x Y , x c ), such as a straight line, a curve, or a probabilistic distribution.
- xc is set to its observed value in the fitted function, and a corresponding value of xY is read out from the fitted function.
- Variant I estimate CATE using the same approach as used to estimate ATE, but performing a re- weighing of the terms inside of the expectations such that the condition (that gives CATE its name) is satisfied. This type of approach is known as an importance sampling technique. The weights of the different samples used to compute the expectation are provided by an inference network, which is trained together with the rest of the ML model 104.
- Variant II after the model 104 of Figure 2 has been trained and a specific CATE query is received (e.g. via the API 110), the inference network h is trained to estimate the effect variable from the conditioning variable.
- the simulation of the target variable takes into account a potential effect of all causal variables across the sampled graph. An example implementation of this is as follows. This may be used in conjunction with any of the ways of averaging discussed above, or others.
- the method of estimating the target variable xY (e.g. the treatment effect) may comprise an inner and an outer loop.
- the simulated values Xi from the previous round or cycle (iteration) of the outer loop become the input values xi of the current iteration of the outer loop to generate an updated set of values for the simulated variables xi. This may be repeated one or more further times, and the simulated values will start to converge (i.e. the difference between the input layer and output layer of the model 104 will get smaller each time). If noise is included the noise is frozen throughout a given inner loop, then re- sampled each outer loop. The total number of iterations of the outer loop may be predetermined, or the outer loop may be iterated until some convergence criterion is met.
- the outer loop is iterated at least D-1 times, which guarantees convergence without needing to evaluate a convergence criterion.
- This method advantageously allows causal effects to propagate throughout the graph. For example if x1 causes x2 and x2 causes x3, and an intervention is performed on x1 then the outer loop will be run at least two times to propagate the effect through to x3.
- the method of using an inner and outer loop is not essential.
- An alternative would be to perform a topological sort of the nodes and propagate effects through in a hierarchical fashion starting from the greatest grandparents or “source nodes” (those which are only causes and not effects).
- OPTIONAL EXTENSION TO INCLUDE LATENT CONFOUNDERS The following describes an optional extension to the model to accommodate the possible presence of “confounders”.
- Figure 9 schematically illustrates the issue of the presence of possible “hidden confounders”, which may also be referred to just as confounders.
- a confounder is a variable that is both A) unobserved (i.e. hidden, so not part of the input feature vector); and B) a cause of at least two variables that are among the variables of the input feature vector [x 1 ....x D ].
- confounded variables may be referred to herein as the “confounded” variables or a “confounded pair” of the respective confounder.
- the confounded variables are labelled x1 and x2 and the confounder between them is labelled u12.
- the confounder could be unobservable or merely unobserved.
- the causal graph may be more complex than shown in Figure 9, and confounders may exist between more than one pair of variables of the input feature vector.
- An issue recognized herein is that existing machine learning models do not take into account the presence of possible confounders, which may lead to an erroneous bias in the training of such models or predictions made by such models.
- x1 is a variable measuring the presence or absence of a certain condition (e.g.
- x2 is a variable measuring a property of the subject which may or may not affect the possible presence of the condition.
- x2 could measure a lifestyle factor such as an aspect of the subject’s diet (e.g. salt or fat intake, etc.), whether they are a smoker, a dosage of a certain medication they are taking, or an environmental factor such as living in an area of high pollution or near an electricity pylon.
- the confounder u12 could represent a factor in the socioeconomic circumstances of the patient (e.g. annual income, education or family circumstances). Ignoring the confounder (e.g. socioeconomic circumstance) may give the false impression that the lifestyle factor x 2 causes the condition x1, whereas in fact the ground truth is that both x1 and x2 are effects of the socioeconomic circumstance, in which case x2 may not actually be a cause of x1 or may only be a weaker cause than it would otherwise appear.
- the confounder e.g. socioeconomic circumstance
- the presently disclosed extension – which in embodiments may be applied as an extension to the previous work by Geffner et al summarised in the Background section – introduces the modelling of possible hidden confounders into a causal machine learning model by introducing a second causal graph sampled from a second graph distribution.
- the extended model works in a similar manner to the model 104 described in relation to Figure 2 and like elements function in the same manner except where stated otherwise.
- Figure 10 schematically illustrates the extension to the model on the encoder side.
- the decoder side works the same as described in relation to Figure 2.
- the extended model 104’ comprises a respective first encoder g e i for each variable x i (i.e.
- the extended model 104’ additionally comprises an inference network H, and a respective second encoder g e ij for each of a plurality of pairs of input variables x i , x j in the input feature vector – preferably one for each possible pair (i.e. every combination of two variables from the input vector).
- the inference network H is arranged to encode the respective values of one or more of the input variables x1...xD (e.g.
- the inference network H is implemented as one common inference network that encodes all the variables x1...xD together into the latent confounder value uij for a given pair.
- the extended model 104’ further comprises a respective second encoder g e ij for each respective latent value u ij (corresponding to each respective pair of input variables x ij ). Each respective second encoder g e ij is arranged to encode the respective latent value uij of the respective confounder into a respective embedding eij.
- the extended model 104’ comprises a selector ⁇ ’ analogous to the selector ⁇ described in relation to the base model of Figure 2, but with additional functionality.
- a causal graph G ⁇ is sampled from the graph distribution q ⁇ , and a certain selected variable x i is selected; and the selector ⁇ ’ then selects the parent variables Pai that are parents of the selected variable xi according to sampled graph G ⁇ , and inputs these into the combiner 202 (e.g. adder or concatenator).
- the graph distribution used for this is the same as the graph distribution q ⁇ described previously with respect to Figure 2. I.e. it comprises matrix whose elements represent the probabilities of directed causal edges existing between the different possible pairs of variables in the input vector [x1 ... xD].
- the extended model 104 In the context of the extended model 104’, this may also be described as the first graph distribution, or the directed graph distribution; and the graph sampled therefrom may be referred to as the first causal graph or the directed graph.
- the extended model 104’ also samples a second causal graph G 2 from a second graph distribution q2.
- the second graph distribution q2 is a graph distribution representing the probabilities that confounders exist between pairs of variables.
- the presence or absence of each edge in the graph G2 is determined pseudorandomly according to the probability specified in the corresponding element of the distribution q2.
- the element (1,2) in G2 – representing whether or not a confounder u 12 exists between variables x 1 and x 2 – has a 70% chance of being 1 (meaning confounder present) and a 30% chance of being 0 (meaning no confounder present).
- the value of the confounder if it exists in the sampled graph G 2 , taken from the output of the inference network H.
- H may be configured to model a probabilistic distribution, which may also be described as a third distribution q3, from which the values of uij are sampled.
- These embeddings eij of the selected confounders uij are input into the combiner 202 (e.g. adder or concatenator) along with the embeddings ei of the parents Pa(i) of the selected variable xi.
- the combiner 202 then combines (i.e. sums or concatenates, depending on implementation) all of the embeddings e i of the selected parents and the embeddings e ij of the selected confounders together into a combined embedding eC.
- the combined embedding eC is then input into the decoder g d i associated with the selected variable x i , which decodes the combined embedding to produce a respective reconstructed value xi’ of the input variable xi.
- the pseudorandom noise zi may be added (or included in some other, non-additive way) to create the noisy reconstructed value X i . Either may be used as the simulated value of x i .
- This side of the model 104’ (the decoder side) works the same as in the base model 104, as described previously with respect to Figure 2.
- the inference network H, second encoders g e ij, and selector ⁇ ’ may be implemented as modules of software in a similar manner to the other components as described in relation to the model of Figure 2.
- each of the inference network H and second encoders g e ij may be implemented as a respective neural network, or alternatively other types of constituent ML models such as random forests or clustering algorithms, or any other form of parametric or non-parametric function, are not excluded.
- the core functionality of the extended model 104’ is represented in the flow chart of Figure 11.
- the method selects one of the variables x i from among the features of the input feature vector [x 1 ...x d ].
- this variable x i is one of the variables to be reconstructed in order to determine a loss effect.
- the selected variable x i will be the target variable, and the intervened-on variable (i.e.
- variable to be treated will also be selected separately (though may be null).
- the method samples a first causal graph G1 from the first graph distribution q1 and samples a second causal graph G2 from the second graph distribution q2.
- the method determines which of the other variables [x1 ... xi-1, x i+1 ... x D ] in the feature vector are parents Pa(i) of the selected variable x i in the sampled first graph G1. Note this may involve mutilating the graph as described previously. The method also determines which of these parents Pa(i) share a confounder uij with the selected variable xi in the sampled second graph G 2 .
- the method generates the reconstructed value xi’ of the selected variable xi. This is done by inputting the input values of each of the selected parents Pa(i) (as determined from the first graph G 1 ) into the respective first encoders g e i in order to generate their respective embeddings ei; and inputting at least the input value of the parent in each selected confounded pair xi, xj (as determined from the second graph G2) into the inference network H to generate the respective latent value u ij , and inputting this into the respective second encoder g eij for the respective confounder in order to generate the respective embedding eij for each respective pair of confounded variables.
- the values input into the inference network H to generate the latent value u ij for a confounded pair i,j may also comprise one or more additional input values in addition to the parent of the respective pair.
- these one or more additional input values preferably comprise the observed input value of the selected variable as well.
- the one or more additional input values may comprise input values of one or more other of the variables of the feature vector ⁇ x1...xD ⁇ (other than the selected variable xi or the respective parent xj of the pair i,j), as one or more of these variables may potentially also comprise information about the possible confounder.
- H can generate u ij for the selected variable xi based on what is known about any or all of the other variables. The more is known, the better that H will predict uij.
- H(xj) sum over xi ⁇ H(xi, x j ) * p(x j ) ⁇ . So H can generate u ij from only a specified value of the parent x j . H can even generate u ij given nothing as input, meaning H is just generating u ij randomly from a distribution learned at training. When estimating average treatment effects, this is acceptable, since in the one will marginalize out all the variables x in the entire population anyway.
- step U40 all the generated embeddings e i ; e i,j for the selected parents and confounders are input into the combiner 202 to be combined (e.g. summed or concatenated) into the combined embedding eC, which is then decoded by the respective decoder g d i for the selected variable xi to produce the respective reconstructed value x i ’.
- a noise term z i may be added to produce the simulated value Xi.
- the noiseless reconstructed value xi’ could be used as the simulated value.
- the method could identify the parents Pa(i) from the first graph and begin generating the parents’ embeddings ei before it has begun identifying all of the confounded pairs xi, xj. Or the method could begin generating embeddings for some parents while still selecting other parents from the graph G 1 , etc.
- this will be values of all the variables of the feature vector if available, but the function H can take any number of observations as input, so if one or more values are unobserved then they are simply not passed through H.
- the method determines a measure of the discrepancy between the input feature vector xi and the simulated feature vector xi’ or Xi for the given training data point, and based on this tunes the parameters (e.g. weights) of all the constituent models (e.g.
- the neural networks of the ML model 104’, including the first encoders g e i , second encoders g e ij, inference network H, and decoders g d i, as well as the first and second graph distributions q1 and q2.
- this may be done using an ELBO function, or any other suitable learning function.
- Algorithms for comparing an input feature vector with a simulated version in order to tune the parameters of a ML model are, in themselves, known in the art. The method may be repeated over multiple training data points in order to refine the training.
- the model 104’ may be used to estimate whether a likely confounder exists between a given target pair of variables x i , x j (i.e. a pair of variables of interest). This may be done by accessing the trained second graph distribution q2 from memory and reading out the value of the probability ⁇ _exists(i, j). If the probability is high (e.g. above a threshold) it may be determined that an unobserved confounder is likely to exist between the two variables xi and xj, whereas conversely, if the probability is low (e.g. below a threshold) it may be determined that such a confounder is not likely to exists between the variables in question.
- a decision may be made as to whether or not to action a treatment on a first, intervened-on one of the two variables xj as a means of affecting a second, targeted one of the two variables xi.
- xj is a apparently cause of xi according to the first graph distribution q 1 , because the relevant elements – e.g. ⁇ _exists(I,k), ⁇ _dir(i,j) – of the first graph distribution q 1 have been read out from memory and indicate a high probability (e.g. above a threshold probability) of a directed edge existing from xj to xi.
- ⁇ _exists(i, j) in the second graph distribution q 2 also indicates a high probability (e.g. above a threshold probability) of a hidden confounder existing between xj and xi, then it may be determined that in fact xj should not be treated as a means of affecting target variable xi.
- the trained model 104’ may also be used to estimate the value of such a confounder. This may be done by reading out the value of uij from the output of the inference network after a suitable amount of training.
- the inference network H may model a probabilistic distribution, which could also be described as q 3 , from which the value u ij is sampled.
- the decision as to whether to action the treatment may also be dependent on the value of the estimated confounder. E.g. if a confounder is determined likely to be present but having a weak effect, it may be decided still to action the treatment of xj as a means of affecting target variable xi.
- the trained model 104’ may be used to perform treatment effect estimation in an analogous manner to that already described in relation to Figures 2 and 8, by replacing steps T10-T40 in Figure 8 with steps U10-U40 from Figure 11.
- the target variable is selected as the selected variable xi, and input values of one or more other, intervened-on (i.e. controlled) variables xj is/are set to its/their intended value (i.e. the proposed value for treatment).
- the inference network H can take any number of observations, so if any variables are not intervened-on or observed, their values are simply not passed through H.
- the method of Figure 11 is then used to generate a simulated value xi’ or Xi of the target variable x i . In embodiments this may be repeated over multiple different sampled instances of the first and second causal graphs G1, G2; i.e.
- the estimated effect of the treatment may then be determined by averaging over these multiple rounds of graph sampling. E.g. this may comprise determining the average treatment effect (ATE) or any other averaging technique described previously with respect to Figures 2 and 8.
- ATE average treatment effect
- a decision about whether or not to action the proposed treatment of the (to-be) intervened-on variable or variables x j may be made in dependence on the estimated treatment effect.
- Figure 12 illustrates the concept of a temporal causal graph by way of example.
- the modelled scenario may be divided into a series of two or more time steps, including a current time step t representing a present time, and one or more preceding time steps t-1, t-2, ... representing one or more points in time in the past.
- the nodes of the graph may be thought of as arranged into layers, each layer corresponding to a given time step.
- Each node in the graph represents a particular one of the variables of the input feature vector at a given time step.
- the graph includes nodes representing a plurality of variables x1,t ... xD,t of the feature vector at the present time (up to all the features of the feature vector).
- the graph includes nodes representing a plurality of variables x 1,t-1 ... x D,t-1 at the previous time step. These may include some or all of the same variables x1,t ... xD,t as included at the first time step, i.e. representing the same features of the feature vector, but instantiated at the preceding time step (the values of variables may change over time).
- each further preceding layer before that (t-2 ... t-T), if present, represents variables of the feature vector at earlier times going backward in time.
- Each layer may represent some or all of the same set of variables as included in other layers, but at different steps in time. In the case of more than two time steps, these may or may not be evenly spaced in time. Preferably the time steps are evenly spaced.
- the temporal graph is able to include one or more edges 1204 directed from the nodes representing one or more variables x 1,t-1 ...x D,t-1 ; ... x 1,t-T ... x D,t-T in layers representing one or more preceding steps in time t-1 ... t-T to nodes in the layer representing variables x1,t ... xD,t at the present time t.
- edges 1204 directed from the nodes representing one or more variables x 1,t-1 ...x D,t-1 ; ... x 1,t-T ... x D,t-T in layers representing one or more preceding steps in time t-1 ... t-T to nodes in the layer representing variables x1,t ... xD,t at the present time t.
- edges 1204 directed from the nodes representing one or more variables x 1,t-1 ...x D,t-1 ; ... x 1,t-T ... x D,t-T in layers representing one or more preceding steps in time t-1
- the edges could include an edge from a given variable in a preceding time step to a single other variable in the current time step, or from a given variable in a preceding time step to multiple variables at the present time step.
- the edges could include edges from a plurality of different variables in a given preceding time step, each to a respective set of one or more variables in the present time step, where the sets may or may not include one, some or all of the same variables.
- the edges could include edges from variables in a plurality of different preceding time steps, each to a respective set of one or more variables in the present time step, where again the sets may or may not include one, some or all of the same variables.
- edges between any one or more variables of the feature vector at the present time step or any one or more preceding time steps and any one or more variables of the feature vector at the present time step may be any combination of edges between any one or more variables of the feature vector at the present time step or any one or more preceding time steps and any one or more variables of the feature vector at the present time step; excepting only that edges between time steps must be directed forward in time, i.e. from a preceding time steps to the present time step.
- the graph formed by edges from present time steps to present time steps should not contain any directed cycles. (i.e. it should form a directed acyclic graph).
- a particular instance of a temporal graph G may be sampled from a temporal graph distribution q ⁇ where q ⁇ now represents a temporal graph distribution rather than a static graph distribution as specified earlier.
- q ⁇ now represents not only the probabilities of causal edges existing between various possible combinations of variables x 1,t ... x D,t at the present time t; but also from each of one or more of variables x1,t-1...xD,t-1; ... x1,t-T ... xD,t-T at one or more preceding time steps t-1...t-T each to a respective one, more or all of the variables x1,t ... xD,t at the present time step t.
- q ⁇ includes an entry representing the probability of an edge existing between every possible combination variables in preceding and present time-step layers.
- the parameter ⁇ _exists(i,j) represents the probability that a causal edge exists between x i and x j ; and ⁇ _dir(i,j) represents the probability that the direction of the possible edge between xi and xj is directed from xi to xj (or vice versa).
- Other equivalent mathematical representations of the same functionality may also be possible. Note this submatrix q ⁇ t is just for present time to present time, and so does not included existence of edges ⁇ 1,1 ⁇ , ⁇ 2,2 ⁇ ... ⁇ D, D ⁇ .
- the other submatrices q ⁇ t-1 ... q ⁇ t-T modelling the distribution of edges from historical time steps do not contain entries for direction as edges from the past, if they exist, must always be directed forward in time.
- q ⁇ t- ⁇ ⁇ _exists(i,j) ⁇ counting through all modelled possible combinations of i and j.
- the parameter ⁇ ⁇ _exists(i,j) represents the probability that an edge exists between xi at past time ⁇ and xj at the present time t.
- ⁇ ⁇ _exists(i,j) represents the probability that an edge exists between xi at past time ⁇ and xj at the present time t.
- Figure 13 shows schematically an extension of the model of Figure 2 to accommodate the modelling of temporal causality using a temporal graph sampled from a temporal graph distribution.
- the extended model includes embeddings for different time steps.
- Each first encoder g e i,t; g e i,t- ⁇ is arranged to receive an input value of its respective variable xi,t ; xi,t- ⁇ at its respective time step t, and to encode into a respective embedding e j,t ; e i,t- ⁇ for its respective combination of variable index i and time step t.
- the selector ⁇ receives a selection of a particular selected variable of index i from among the variables (i.e. features) of the input feature vector ⁇ x 1 ...x D ⁇ .
- the selector ⁇ also takes as an input a particular sampled graph G from the graph distribution q ⁇ , remembering that q is now a temporal graph distribution and so G may now be a temporal graph containing one or more edges from one or more variables at a past time t- ⁇ to one or more variables at the present time t (depending on what edges get sampled from the distribution q ⁇ on any given graph sampling event).
- the edges present in G are determined to exist or not pseudorandomly according to the corresponding probabilities specified in the temporal graph distribution q ⁇ .
- the selector ⁇ selects the parents Pai that are causes of the selected variable xi in the sampled temporal graph G, including any parents from past time steps t-1...t-T.
- the respective embeddings e j,t ; e j,t- ⁇ are generated from the respective encoder g e , and the selector inputs these into the combiner 202 (e.g. adder or concatenator) to be combined (e.g. summed or concatenated) into the combined embedding eC for the selection variable xi , which may also be written ⁇ e c ⁇ I to indicate that it will vary with i.
- the combined embedding e C is input into the respective decoder g d i associated with the selected variable in order to thereby generate a respective reconstructed value xi’, in the same way as the decoder side of Figure 2, except that this is now specifically the reconstructed value of the selected variable xi for in the present time step t.
- Using a temporal graph can lead to more accurate predictions of treatment effects or the like since the model 104’’ has the capacity to more accurately reflect the causation of the modelled scenario, which may be temporal in nature.
- the model 104’’ may also be used to predict the best order in which to apply treatments, the best timing in which to apply one or more treatments, and/or to predict the likely timing of an effect of one or more treatments.
- the model 104’’ (once trained) may be used to determine treatments to apply to a data centre or cloud server system, by predicting the effects of one or more server units failing or being taken offline, or having a high load.
- the model 104’’ may be used to determine the order in which to apply controls in an autonomous vehicle or robot, or to apply certain actions in an industrial process or farming (e.g. when to apply pesticides and fertilizer, etc.).
- the reconstructed value may be combined (e.g. summed) with a noise term to generate the simulated value Xi.
- the noiseless reconstructed value xi’ could simply be used as the simulated value without including noise.
- noise is included, this may be combined with the reconstructed value xi’ as an additive noise term, or alternatively other non-additive means of combining a noise term with the reconstructed value xi’ may be used.
- a static noise term z i as discussed previously could be used.
- the static noise term is replaced with a history-dependent noise term ⁇ i,t .
- the history-dependent noise term ⁇ i,t is a noise term that takes into account the values of the preceding parent variables Pai G ( ⁇ t) of the selected variable xi in the sampled graph G. This means the model will take into account the past smoothness or noisiness of time series data in which at least one variable has different values over time.
- the model 104’’ will be a better specified representation of the real-world scenario being modelled, e.g. taking into account the level of fluctuation in a symptom of a patient, or a sensor reading measuring the state of a device or its environment, or traffic conditions experienced over a network, or a condition of a crop of plants, or so forth.
- Equation 1 is not very practical in terms of modelling as it is too general.
- Equation 1a also called equation 5 later in the subsection “Example Implementation: Auto- Regressive DECI) breaks the representation down into a part that is based only on the parents from the present and preceding time steps, but not the noise; and a part that is based only on the parents from previous time steps and the noise, but not the parents from the present time step.
- Equation 1a provides a good workable trade-off between specificity and generality. Equation 1a may be implemented as shown in Figures 13 and 14.
- the embeddings ei,t- ⁇ of the identified parents Pai G ( ⁇ t) from any preceding times steps – but not the parents Pai G (t) from the current time step — are selected by a selector 1401 and input into a noise model 1302.
- these are the same embeddings that you use to generate the reconstructed values x i ’ in Figure 13, generated by the same encoders g e i.
- the embeddings used here in the noise model 1302 of Figure 14 could be separate embeddings generated by separate encoder networks than used in Figure 13.
- This selector 1401 may be implemented by some or all of the same code as the selector ⁇ (it will be appreciated that the figures are somewhat schematized and do not necessarily imply that the illustrated functions have to be implemented by separate modules of code).
- These embeddings ei,t- ⁇ are input into the history dependent noise model 1302 which is configured to generate the history-dependent noise term ⁇ i,t in dependence on the input embeddings.
- the noise model comprises a combiner 1402 (e.g. adder or concatenator), a demultiplexer 1404, and a respective decoder g d i associated with each of the possible variables x i ...x D that may be selected as the selected variable xi.
- the combiner 1402 of the noise model 1302 may share some or all of the same code with the combiner 202 of the wider model 104/104’’ as described previously with respect to Figures 2 and 13.
- the demultiplexer 1404 of the noise model 1302 may share some or all of the same code with the demultiplexer 204 of the wider model 104/104’’ as described previously with respect to Figures 2 and 13.
- the decoders g d i used here in the noise model 1302 of Figure 14 may be different than those used to decode the combined embedding e C in the wider model 104/104’/104’’ as described previously with respect to Figures 2 and 13.
- the x'i decoder would only output a single value: x'i, but in embodiments the output ⁇ i that is the input to the noise generator 1406 actually has a slightly different form than the output of the decoder that produces x' i .
- ⁇ i might be the mean and variance (or standard deviation) of a distribution such as a Gaussian distribution (two values) and the noise generator 1406 would generate a sample from this distribution.
- the noise model 1302 further comprises a noise generator 1406.
- the combiner 1402 combines (e.g. sums or concatenates) the embeddings ei,t- ⁇ of the selected past parents Pai G ( ⁇ t) into a further combined embedding Ci.
- the demultiplexer 1404 routes this further embedding Ci into the respective decoder g d i associated with the selected variable x i .
- the noise generator 1406 then generates the history dependent noise term ⁇ i,t based on the probabilistic distribution as parameterized by the one or more parameter values.
- the probabilistic distribution could be a Gaussian or a spline function.
- Figure 4 shows an example of a probabilistic distribution such as a Gaussian that may be parameterized by two parameters, a centre point (e.g. mean ⁇ ) and a width (e.g.
- the noise generator 1406 could generate the distribution parameterized by the parameter(s) ⁇ i (the mean ⁇ and standard deviation ⁇ ), and then sample the value of the history dependent noise term ⁇ i,t directly from the generated probabilistic distribution.
- the noise generator 1406 implements the sampling by taking a sample Ei from a fixed distribution, e.g. a unit distribution such as a unit Gaussian (with mean of 0 and standard deviation of 1), and then apply a transform function F which transforms the combination of the sample E i from the fixed distribution and the generated parameter(s) ⁇ i into a sample ⁇ i,t from the parameterized distribution (without having to actually generate the entire parameterized distribution).
- ⁇ i,t F(Ei, ⁇ i).
- the various encoders g e i,t; g e i,t- ⁇ ; decoders g d i and noise model 1302 may be implemented as modules of software as part of the model 104’’, stored in memory and arranged to run on one or more processors in a similar manner as described previously in relation to Figure 2.
- Figure 15 shows the method of training the extended, temporal model 104’’.
- the method receives a new temporal training data point comprising values of some or all of the variables (features) of the feature vector ⁇ xi...xD ⁇ at two or more time steps. In embodiments it is all of the variables, but alternatively the model could be extended to handle missing data.
- the method selects a selected variable x i with index i from amongst the observed variables of the input feature vector.
- the method selects the instantaneous parents Pai G (t), i.e.
- the method selects the lagged parents Pa i G ( ⁇ t), i.e. from the one or more preceding time steps t-1...t-T, of the selected variable xi.
- the method generates the noiseless reconstructed value of xi’ of the selected variable xi, by combining the embeddings e i,t ; e i,t-1 ; ... e i,t-T of the selected instantaneous and lagged parents.
- the method also generates the history dependent noise ⁇ i,t.
- the method determines the difference between the simulated value X i of the selected variable and the actual input value x i .
- the simulated value may include the noise term, or in less preferred embodiments the noiseless reconstruction xi’ could be used as the simulated value in step V60.
- the method loops back to step V30 via step V80 where it selects a new one of the observed variables of the feature vector, and repeats the method until reconstructed xi’ values have been generated for the whole feature vector.
- the method then updates the parameters (e.g. weights) of all the constituent models (e.g.
- NNs which include at least the encoders g e , decoders g d and temporal graph distribution q ⁇ .
- the update is done based on the evaluating a measure of overall loss between the simulated value Xi or reconstructed value x’i of the feature vector and the input value x i . E.g. this may be done using an ELBO function. In embodiments only the noiseless reconstruction is used during training, whereas the simulation with noise is used in treatment effect estimation.
- the method loops back to V10 and repeats with a new training data point, tuning the parameters of the model 104’’ each time.
- step V110 the model 104’’ is made available to be used for making predictions, e.g. treatment effect estimation.
- the method of using the extended temporal model 104’’ to perform treatment effect estimation is analogous to that described earlier with reference to Figures 2 and 8.
- Figure 16 shows a particular example method of using the model 104’’ to estimate that treatment effect that will be experienced at a future point in time.
- the method selects a target variable xA as the selected variable and a future time t’ as a target time.
- a here is the value of the index i for the target variable.
- the method also selects at least one variable xB to be intervened on (i.e. treated).
- the intervened-on variable meaning the variable for a which an intervention (treatment) is proposed, i.e. an effect of intervening on that variable is to be estimated.
- B is the value of the index i of the intervened on variable x B .
- the method also selects a value I of the proposed intervention on x B , and a time t ⁇ as the time of the proposed intervention.
- the target time t’ is the time for which the effect on the target variable is to be estimated, i.e. the time of interest.
- the time t ⁇ on the other hand is the time at which it is proposed to perform the intervention I on the intervened-on variable xB.
- any treated (i.e. intervened-on) variables x B these are set to their fixed, treated values I.
- the method extracts any past observations of the feature vector ⁇ x 1 ...x D ⁇ , i.e. for times t-T...t-1.
- the reason for reconstructing multiple variables, even though only one target variable xA may be ultimately of interest, is that when iterated over multiple time steps then these may have a knock on effect on the target variable x A even if not immediate parents of x A .
- step W70 the method loops back, via step W75 where t is incremented by one step, to the point of branching to steps W30/W35/W40.
- the method continues repeating through to step W60 until future time t’ is reached. After this the method proceeds to step W80 where, if not finished, it loops back to step W20 and repeats one or more times, resampling the graph at step W20 each time.
- step W90 are an average treatment effect for the target variable x i of index i is determined averaged over all the sampled graphs.
- the procedure is then repeated one or more times (the loop of W80).
- the flow charts may be somewhat schematized and in practice one or more steps and/or loops shown sequentially may in fact be performed in parallel.
- Figure 17 shows explicitly the sub-loop mentioned previously with respect to steps W20-W60 of Figure 16.
- Figure 17 may be considered as a more detailed expansion of Figure 16.
- the method also selects time t ⁇ as the time of the intervention and t’ as the target future time.
- time t ⁇ is the time of the intervention and t’ as the target future time.
- x A could be the temperature of a target environment or device
- xB could be the thermostat setting
- I the value to which the thermostat is proposed to be set (the proposed treatment)
- time t ⁇ is the time of applying the thermostat setting
- time t’ is the time for which it is desired to predict the resulting change in the temperature xA.
- the method samples a temporal graph G from the graph distribution q ⁇ .
- the method determines whether the current time t equals the treatment time t ⁇ AND the current value of the variable index i equals B, that of the intervened-on variable (i.e. treated variable) xB.
- step W30a sets the treated variable xB to the value of the intervention I.
- B could represent a set of one or more index values
- I could represent a set of one or more corresponding intervention values, such that one or more variables i will be set to their corresponding intervened-on values I as the method counts though i via the loop W30b.
- step W27 could check whether there exists an observed value of the current variable i for the current time t, and if so branch to W30a where that variable is set to its observed value.
- step W27 could represent determining whether the current variable i at the present time t has a pre-specified value determined by some other means than reconstruction by the model 104’’ (whether because it is intervened-on or observed) and step W30a may represent setting any such variable to its specified value.
- step W27 If however it is determined at step W27 that the value of the present variable xi,t at index i at the present time t is not specified, then the method branches to step W29 where it extracts the values of any instantaneous parent(s) of xi,t from the sampled graph, and step W35a where it extracts the values of any lagged parents of xi,t according to the sampled graph. The method then proceeds to step W50 where it generates a noiseless reconstructed value x’ i,t of the present variable x i,t for the present time t, using means as discussed previously in the description of the model 104’’.
- steps W45 and W55 are also performed to generate the history dependent noise term ⁇ i,t as also discussed previously.
- the noise is added to (or otherwise combed with) the reconstructed variable x’i,t to generate the noisy simulated sample Xi,t corresponding to the present variable x i,t .
- the method then proceeds to step W65 where it determines whether the counting index i has reached the value D of the index of the last variable in the feature vector (preferably sorted in topological order). If not the method branches to step W30b where it increments i by 1 and then loops back to step W25 and repeats until it has reconstructed the entire vector (or at least as much of it as is to be reconstructed).
- steps W30a and W35a and the loop via W30b may be considered to correspond to the steps W30 and W35 in Figure 16, in that steps W30a sets pre- specified (treated or observed) variables to their specified values and step W35 extracts the historical values, while the loop via W30b ensures that a reconstructed value x’ ⁇ i,t is generated for any unspecified variable that comes before the present variable xi,t in the counting order (e.g. topological order). This is relevant since these variables x’ ⁇ i,t could have an effect on the present variable.
- step W the method checks whether the present time t has reached the target time t’ yet. If not, the method branches to step W75 where t is incremented by 1, and then loops back to step W25 where it repeats with the new value of t. When the method does reach the target time t’, then the method proceeds to step W80 where it determines whether any more graphs are to be sampled. If so, the method loops back to step W20 where it repeats with a newly sampled temporal graph.
- the entity may comprise a living being; or a physical object such as an electronic, electrical or mechanical device or system. Alternatively or additionally the object may comprise a piece of software running or to be run on a computer system.
- the modelled real-world entity may comprise a living being, e.g. a human or animal.
- the disclosed model 104, 104’ or 104’’ may be used to determine whether to action a treatment to the being in order to affect a condition of the being (e.g. to cure or alleviate a disease by which the being is afflicted).
- the target variable xi represents a measure of the condition, e.g. a symptom experienced by the living being as a result of the condition
- one or more intervened-on variables x j represent one or more properties of the being or its environment which are susceptible to being controlled as a treatment (e.g. dosage of a drug, or a change in habit or environment, etc.).
- the potential treatment in question could be, for instance, a drug, surgery, or a lifestyle change.
- the living being may comprise a plant or crop of plants, such as an agricultural crop, and the potential treatment could be the timing of sowing seeds, when to water or cover the plant/crop, or whether or when to apply a chemical such as a fertilizer or pesticide and/or in what quantity, etc.
- a decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to treat the target condition may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target condition, and/or an estimate strength of the confounder.
- the decision may be made in dependence on the estimated treatment effect, e.g. ATE.
- the real-world entity may comprise a mechanical, electrical or electronic system or device; and the disclosed model 104, 104’ or 104’’ may be used to determine whether to action a treatment to affect a state of the system or device, such as to repair, maintain or debug the system or device.
- the target variable xi may represent a measure of the state of the system or device, e.g. sensor data measuring wear, operating temperature, operating voltage, output or throughput, etc.
- the one or more intervened-on variables x j may represent one or more properties of the device or its environment susceptible to being treated, e.g.
- a decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to treat the target state may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target state, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE.
- the real-world entity being modelled may comprise software that is run, or to be run, on one or more processors in one or more computer devices at one or more locations; and the disclosed model 104, 104’ or 104’’ may be used to determine whether to action a treatment to try to optimize the running of the software.
- the target variable xi may comprise a measure of a current state of the software, e.g. memory or processing resource usage, or latency, etc.
- the one or more intervened-on variables x j may represent any property capable of affecting the running of the software, e.g. input data, or rebalancing or the load across a different combination of processors or devices.
- a decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to optimize the running of the software may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target state, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE.
- the real-world entity being modelled may comprise a network, e.g. a mobile cellular network, a private intranet, or part of the internet (e.g. an overlay network such as a VoIP network overlaid on the network).
- the disclosed model 104, 104’ or 104’’ may be used to determine whether to action a treatment to try to optimize the operation of the network.
- the target variable x i may represent any state of the network that it may be wished to improve, e.g. a property of network traffic such as end-to-end delay, jitter, packet loss, error rate, etc.
- the one or more intervened-on variables xj may represent and property capable of affecting the target variable, e.g. balancing of traffic across the network, routing or timing of traffic, encoding scheme used, etc.
- a decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to optimize the network performance may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target state, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE.
- the decision(s) may be made adaptively (i.e. dynamically in response to changing conditions), or as part of network planning.
- the real-world entity may comprise an autonomous or semi- autonomous self-locomotive device such as a self-driving vehicle or robot.
- the disclosed model 104, 104’ or 104’’ may be used to determine whether to action a treatment in the form of a control signal to control the motion of the device.
- the target variable x i may represent sensor data providing information on the device’s environment, e.g. image data or other sensor data (such as distance or motion sensor data) providing information on the environment of the device (such as presence of obstacles, location of another object to interact with).
- the one or more intervened- on variables x j may comprise a control signal that can be applied to control the device, e.g. to steer the vehicle or apply brakes or lights, or to move a robot arm in a certain way.
- a decision as to whether to perform a certain control operation (a type of “treatment”) represented by the intervened-on variable, as a means to achieve a certain effect on the environment (e.g. avoid an obstacle or interact with another object), may be made in dependence on whether a confounder is estimated to exist between the proposed variable to be controlled and the target outcome, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE.
- EXAMPLE IMPLEMENTATION AUTO-REGRESSIVE DECI
- the above methods assume that the noises are mutually independent and stationary in a way such that their distributions have independent (and learnable) parameters. However, this may not hold in many scenarios. For example, under education context, the observational noise of whether a student correctly answers a question should depend on his/her past learning history. If the student correctly answered similar past questions in a consistent way, the observational noise should be small. On the other hand, if the answer history is similar to a random guess, the observational noise should be large. This history-dependent noise distribution is ubiquitous in real life and cannot modelled by the aforementioned SEM.
- a particular embodiment of the present disclosure provides a novel SEM for time-series data based on the framework known as deep end-to-end causal inference (DECI) (Geffner et al, 2022).
- DECI deep end-to-end causal inference
- AR- DECI auto-regressive DECI
- AR-DECI also adopts a Bayesian view of graph learning by using variational inference to approximate the graph posterior, which provides uncertainties over graphs under limited data. Also, we show one can compute interested treatment effect estimation by leveraging the fitted AR-DECI, e.g. conditional averaged treatment effect (CATE) with time- invariant interventions. ⁇ Theoretically, we show that the proposed AR-DECI is structure identifiable under assumptions. To achieve this, we provide a general framework for showing structure identifiability with history-dependent noise, where AR-DECI is a special case. Furthermore, we show AR-DECI unifies several aforementioned approaches based on SEM and Granger causality. First we will briefly introduce necessary preliminaries required for building AR-DECI model.
- CATE conditional averaged treatment effect
- Pa i G ( ⁇ t) contains the parent node specified by G in previous time frame (lagged parents) (i.e. parent nodes at time t ⁇ 1, t ⁇ 2, ...)
- Pa i G (t) is the parent node at current time t (instantaneous parents)
- ⁇ t,i is the exogenous noise that is mutually independent to other variables in the model
- fi,t describes the functional relationships between the parents and child node X i t .
- DECI Deep End-to-end Causal Inference
- DECI also adopts a Bayesian view for graph learning by applying variational inference to approximate the graph posterior p(G
- inference tasks such as estimating (conditional) average treatment effect, which provides us an end-to-end pipeline from observation data to the interested causal quantities.
- Granger Causality Granger causality has been extensively used as the underlying principle for identifying causal relationships in time series data.
- time-series i Granger causes j if ⁇ l ⁇ [1, t] such that X i t ⁇ l ⁇ Pa i G ( ⁇ t) and f j,t depends on X i t ⁇ l .
- Granger causality is equivalent to causal relations for directed acyclic graph (DAG) if there are no latent confounders and instantaneous effect.
- DAG directed acyclic graph
- One shortcoming of Granger causality is its incapability of handling the instantaneous effect, which can happen when the time sampling interval is not frequent enough.
- Vector Auto-regressive Model To overcome the aforementioned issue of Granger causality, another line of research focuses on directly fitting the identifiable SEM to the observational data with instantaneous parents.
- AR-DECI Auto-Regressive Deep End-To-End Causal Inference
- ⁇ Xt ⁇ t 0...T with a set of nodes V and
- G ⁇ ,ij 1 means there is a directed connection X i t ⁇ ⁇ X j t and 0 means no connection.
- G and G0:K for brevity.
- AR-DECI adopts a Bayesian view of causal discovery, which aims to learn a graph posterior distributions instead of inferring a single graph.
- ⁇ is the model parameter.
- Graph Prior To design a proper graph prior, we aim at three criteria: (1) constrain the graph to be a DAG; (2) favour sparse graph; (3) support prior knowledge.
- We choose the prior as: where h(G) tr(e G ⁇ G ) ⁇ D is the DAG penalty proposed in and is 0 if and only if G is a DAG; ⁇ is the Hadamard product; G p is the prior knowledge of the graph; ⁇ s, ⁇ p specifies the strength of the graph sparseness and prior knowledge, respectively; ⁇ , ⁇ characterize the strength of the DAG penalty. Since the connections from the history node to the current one can only follow the direction of the time flow, only the instantaneous connections can violate the DAG constraint.
- the causal graph and the associated causal effects are to be inferred based on observational data, in embodiments solely based on observational data. Therefore, this would appear to require an assumption that all the data provided by users already contains all the information that is needed (i.e. there should not be any unmeasured confounders). However, this assumption is often unrealistic, as the modelled scenario is also impacted by certain variables that cannot be directly measured. Therefore, it would be desirable to provide an effective and theoretically principled strategy for handling latent confoundings when performing joint causal discovery & inference. However, the existence of unmeasured confounding poses questions for causal discovery since there might exist multiple contradicting causal structures that are compatible with observations.
- G the adjacency matrix representation
- h ( ⁇ ) ⁇ ⁇ ( ⁇ ⁇ ⁇ ⁇ ) ⁇ ⁇ (3) is the DAG penalty.
- ⁇ the parameters of our non-linear ANM, using observational data by maximizing (an lower bound of) log p ⁇ (x 1 , ... , x N ).
- the question of causal discovery can be simply answered by the posterior, p ⁇ (G
- ADMGs acyclic directed mixed graphs
- an ADMG causal discovery algorithm is able to express search results using (a subset of) the following structures: ⁇ x ⁇ y: x is the cause of y; ⁇ x ⁇ y: indicates the existence of a unmeasured founder uxy, such that x ⁇ uxy ⁇ y. ⁇ x ⁇ y: either x ⁇ y or x ⁇ y. ⁇ x ⁇ y: either x ⁇ y, or x ⁇ y.
- G' is basically a big matrix on both X and U, obtained by concatenating and unpacking G1 and G2.
- G' M(G1, G2), where M(.) is a pre-known mathematical function.
- G' may also be called magnified matrix. In short, it is obtained by concatenating and aggregating G 1 and G 2 . For example, for a graph x 1 -- >x2 ⁇ -->x3, the corresponding matrices will be: where the 4th row of G' corresponds to the 4th variable (additional latent confounder) implied in the bidirected edge x 2 ⁇ -->x 3 .
- the 4th variable should point to both x 2 and x 3 , hence those two 1s in the second and third cell of the 4th row.
- G the causal graph
- Eq. (5) magnification-based parameterization since the original ADMGs formulation does not explicitly assume any functional forms.
- p(G’) can be implemented by: Parameterization of p ⁇ (x n , u n
- the function f corresponds to the decoders g d described earlier.
- nj are mutually independent noise variables.
- the noise term n has also been referred to as z earlier.
- Lemma 1 Let M denote a set of variables not containing v i , and let s(v i ) denote a linear or nonlinear function of vi. Then, the residual of s(vi) regressed on M cannot be independent of ni: Proof: Assume that [s(v i ) ⁇ g i (M) n i ] holds, then M must contain at least one descendent of v i as it must have dependence on the noise ni to cancel effect of ni in s(vi). We can express gi(M) as .
- x j is a direct cause of x i and there is no latent confounder.
- MLE recovers ground proof We can use the exact same proof as in the DECI paper to show that MLE recovers ground truth—we can just replace ⁇ with the ADMG corresponding to ⁇ projected onto x.
- D-DECI recovers ground truth: In the infinite data limit, we have: (22) The posterior approximation q(un
- Statement 1 a computer-implemented method comprising: A) selecting a selected variable from among variables of a feature vector; B) sampling a temporal causal graph from a temporal graph distribution, the temporal graph distribution specifying probabilities of directed causal edges between different ones of the variables of the feature vector at a present time step, and from one of the variables of the feature vector at a preceding time step to one of the variables of the feature vector at the present time step; C) from among of the variables of the feature vector, identifying a present parent which is a cause of the selected variable in the present time step according to the temporal causal graph, and identifying a preceding parent which is a cause of the selected variable from the preceding time step according to the temporal
- the method may optionally further comprise features as set out in any of the following Statements.
- Statement 2 The method of Statement 1, wherein: the temporal causal graph distribution sampled in B) specifies probabilities of directed causal edges existing from each of a plurality of variables of the feature vector from one or more preceding time steps, each to a variable of the feature vector at the present time step; and C) comprises, from among of the variables of the feature vector, identifying each present parent that is a cause of the selected variable in the present time step according to the temporal causal graph, and identifying each preceding parent variable which is a cause of the selected viable from a preceding time step according to the temporal causal graph.
- Statement 3 The method of Statement 1, wherein: the temporal causal graph distribution sampled in B) specifies probabilities of directed causal edges existing from each of a plurality of variables of the feature vector from one or more preceding time steps, each to a variable of the feature vector at the present time step; and C) comprises, from among of the variables of the feature vector, identifying each present parent that is
- the method of Statement 2 further comprising: generating a history dependent noise term based on embeddings of the preceding parents; and combining the history dependent noise term with the reconstructed value of the selected variable, resulting in a simulated value of the reconstructed variable.
- Statement 4. The method of Statement 3, wherein embeddings of the present parents are not input into the noise model.
- Statement 5. The method of Statement 4, wherein the generating of the history dependent noise term comprises: combining the embeddings of the preceding parents, resulting in a further embedding; inputting the further embedding into a decoder associated with the selected variable, resulting in one or more parameter values; and generating the history dependent noise term from the probabilistic distribution as parameterized by the one or more parameter values.
- B) further comprises sampling a second causal graph from a second graph distribution, the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector;
- C) further comprises, from among of the variables of the feature vector, identifying a parent variable which is a cause of the selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph;
- D) further comprises inputting the input value of the parent variable and an input value of the selected variable into an inference network, resulting in a latent value modelling the respective confounder of the confounded pair, and inputting the latent value into a second encoder, resulting in an embedding of the confounder of the confounded pair;
- the combining includes combining the embedding of the present and preceding parents with the embedding of the confounder of the confounded pair, thereby resulting in said combined embedding.
- Statement 7 The method of any of Statements 2 to 5, or Statement 6 when dependent on any of Statements 2 to 5, comprising: i) for a given training data point comprising a given combination of input values of the variables of the feature vector at the present times step and each preceding time step, repeating A)-F) over multiple selections, each selection selecting a different one of the variables of the feature vector as the selected variable thereby resulting in a respective reconstructed value, the multiple selections together thereby resulting in a reconstructed version of the training data point comprising the reconstructed values for the training data point; ii) evaluating a measure of difference between the training data point and the reconstructed version; and iii) training model parameters of the encoders, decoders, and temporal graph distributions, based on the evaluated measure.
- Statement 8 The method of Statement 7, comprising repeating i)-iii) over multiple input data points, each comprising a different combination of input values of the variables of the feature vector.
- Statement 9 The method of Statement 7 or 8, wherein: each selection in i) further comprises a history dependent noise model generating a respective history dependent noise term based on embeddings of the preceding parents; and wherein the training further comprises training model parameters the history dependent noise model based on the evaluated measure.
- the generating of the history dependent noise term by the history-dependent noise model comprises: combining the embeddings of the preceding parents, resulting in a respective further embedding; inputting the respective further embedding into the decoder associated with the selected variable, resulting in a one or more parameter values; and generating the respective history dependent noise term based on the one or more parameter values.
- Statement 11 The method of Statement 10, comprising repeating i)-iii) over multiple input data points, each comprising a different combination of input values of the variables of the feature vector.
- the method of Statement 2, or any of Statements 3 to 11 when dependent on Statement 2, comprising: I) setting the input value of an intervened-on variable of the feature vector, other than the selected variable, to a specified value; and II) estimating an effect of the intervened-on variable on the selected variable based on the reconstructed value of the selected variable.
- Statement 13 The method of Statement 12, wherein: I) comprises setting the input value of a plurality of intervened-on variables of the feature vector, other than the selected variable, to respective specified values; and II) comprises estimating the effect of the plurality of intervened- on variables based on the reconstructed value of the selected variable.
- each graph-sampling event further comprises: generating a history dependent noise term based on the embeddings of the preceding parents; and combining the history dependent noise term with the reconstructed value of the selected variable, resulting in a simulated value of the reconstructed variable; wherein said estimating based on the reconstructed values in II) comprises: estimating the effect of the intervened-on variable on the selected variable based on the simulated values.
- each round sets a plurality of intervened-on variables and/or observed variables of the feature vector to specified values (and in the case of multiple intervened-on variables, II comprises determining the average treatment effect of the plurality of intervened-on variables on the selected variable); and/or each round comprises an interior loop of D)-F) repeated around multiple iterations with the same sampled graph but for incrementing values of the present time step with each iteration, wherein each but the last iteration comprises, in addition to the target variable, reconstructing or simulating values of one or more further variables of the feature vector, and feeding back the simulated or reconstructed values as the specified input values of the next iteration.
- the intervened-on variable models a treatment on a real-world entity or an environment thereof and the target variable models an effect of the treatment applied to the real-world entity, and the method further comprises actioning the treatment on the real-world entity in dependence on the estimated treatment effect.
- Statement 18 The method of Statement 17, wherein one of: the real-world entity comprises a living being, and the treatment comprises a medical treatment to the living being or an environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real-world entity comprises a network or software, and the treatment comprises an act of controlling the network or software.
- a system comprising: processing apparatus comprising one or more processors; and memory comprising one or more memory units, wherein the memory stores: a machine learning model comprising the encoder and decoders of any preceding Statement, and optionally the noise model and/or a inference network; and code arranged to run on the processing apparatus and being configured so as when run to perform the method of any of Statements 1 to 18.
- Statement 20 A computer program embodied on non-transitory computer-readable storage, wherein the computer program comprises instructions configured so as when run on one or more processors to perform the method of any of Statements 1 to 18.
- Statement 1A a computer- implemented method comprising: a) selecting a selected variable from among variables of a feature vector; b) sampling a first causal graph from a first graph distribution and sampling a second causal graph from a second graph distribution, the first causal graph modelling causation between variables in the feature vector, and the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector; c) from among of the variables of the feature vector, identifying a parent variable which is a cause of the selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph; d) inputting an input value of the parent variable into a first encoder, resulting in a respective embedding of the parent variable; e) inputting at least the input value of the parent variable (and optionally also an input value of the selected variable and/or an input value
- Statement 2A The method of Statement 1A, wherein c) comprises: identifying each of the variables of the feature vector that is a respective parent variable of the selected variable according to the first causal graph, and identifying each respective confounded pair that comprises the selected variable and a respective one of the identified parent variables according to the second causal graph; d) comprises: inputting a respective input value of each of the parent variables into a respective first encoder for the parent, resulting in a respective embedding of each respective parent variable; e) comprises: for each identified confounded pair, inputting the input value of the respective parent variable and the input value of the selected variable into the inference network, thereby resulting in a respective latent value of each of the respective confounders, and inputting each of the latent values into a respective second encoder for the respective latent value, resulting in a respective embedding of each of the respective confounders; f) comprises combining the embeddings of all the identified parent
- Statement 3A The method of Statement 2A, comprising: i) for a given training data point comprising a given combination of input values of the variables of the feature vector, repeating a)-g) over multiple selections, each selection selecting a different one of the variables of the feature vector as the selected variable thereby resulting in a respective reconstructed value, the multiple selections together thereby resulting in a respective reconstructed version of the training data point comprising the reconstructed values for the training data point; ii) evaluating a measure of difference between the training data point and the reconstructed version; and iii) training parameters of the inference network, first and second encoders, decoders, and first and second graph distributions, based on the evaluated measure.
- Statement 4A The method of Statement 2A, comprising: i) for a given training data point comprising a given combination of input values of the variables of the feature vector, repeating a)-g) over multiple selections, each selection selecting a different one of the variables of the feature vector as the
- the method of Statement 3A comprising repeating i)-iii) over multiple input data points, each comprising a different combination of input values of the variables of the feature vector.
- Statement 5A The method of Statement 4A, comprising, after the training over the multiple training data points, observing the second graph distribution to estimate whether a confounder exists between a target pair of the variables of the feature vector .
- Statement 6A The method of Statement 5A, comprising, after the training over the multiple training data points, observing the latent value of the respective confounder between the target pair of variables, resulting from the inference network, as an estimated value of the respective confounder.
- Statement 7A The method of Statement 3A, comprising repeating i)-iii) over multiple input data points, each comprising a different combination of input values of the variables of the feature vector.
- the treatment comprises a medical treatment to the living being or the environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real-world entity comprises a network or software, and the treatment comprises an act of controlling the network or software.
- Statement 10A The method of any of Statements 2A to 8A, comprising: I) setting the input value of an intervened-on variable of the feature vector, other than the selected variable, to a specified value; and II) estimating an effect of the intervened-on variable on the selected variable based on the reconstructed value of the selected variable.
- Statement 10A The method of Statement 9A, wherein: I) comprises setting the input value of a plurality of intervened-on variables of the feature vector, other than the selected variable, to respective specified values; and II) comprises estimating the effect of the plurality of intervened- on variables based on the reconstructed value of the selected variable.
- Statement 11A The method of any of Statements 2A to 8A, comprising: I) setting the input value of an intervened-on variable of the feature vector, other than the selected variable, to a specified value; and II) estimating an effect of the intervened-on variable on the selected variable based on the reconstructed value of the selected variable.
- each graph-sampling event sets a plurality of intervened-on values to specified values, the same values each time; and II) comprises determining the average treatment effect of the plurality of intervened-on variables on the selected variable.
- Statement 13A The method of Statement 9A, wherein the intervened-on variable models a treatment on a real-world entity or an environment thereof and the target variable models an effect of the treatment applied to the real-world entity, and the method further comprises actioning the treatment on the real-world entity in dependence on the estimated treatment effect.
- the treatment comprises a medical treatment to the living being or an environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real-world entity comprises a network or software, and the treatment comprises an act of controlling the network or software.
- the inference network comprises a respective constituent inference network for each pair of variables in the feature vector
- e comprises: for each identified confounded pair, inputting the input value of the respective parent variable and the input value of the selected variable into the respective inference network for the respective confounded pair, thereby resulting in the respective latent value of the respective confounder.
- the inference network may comprise a common inference network operable to encode all the variables of the feature vector together into the respective latent for each pair.
- Statement 16A The method of any of Statements 2A to 15A, wherein each of the first and second encoders and each of the decoders comprises a neural network.
- Statement 17A The method of any of Statements 2A to 15A, wherein each of the first and second encoders and each of the decoders comprises a neural network.
- Statement 18A The method of any of Statements 1A to 16A, wherein the inference network comprises a neural network.
- Statement 18A The method of Statement 9A or any subsequent Statement when dependent thereon, wherein the method is performed on a server system of a first party, the server system comprising one or more server units at one or more sites; and the method further comprises, by the server system of the first party: providing an application programming interface, API, enabling a second party to contact the server system via a network; receiving a request from the second party over the network via the API; in response to the request, determining the estimated treatment effect on the target variable; and returning the estimated treatment effect to the second party over the network via the API.
- Statement 19A The method of any of Statements 1A to 16A, wherein the inference network comprises a neural network.
- a computer program embodied on computer-readable storage wherein the computer program comprises a machine learning model comprising a plurality of first encoders, a plurality of second encoders, a decoder and an inference network; and wherein the computer- program further comprises instructions configured so as when run on one or more processors to perform the method of any of Statements 1A to 18A.
- Statement 20A A system comprising: processing apparatus comprising one or more processors; and memory comprising one or more memory units, wherein the memory stores: a machine learning model comprising the first encoders, second encoders, decoders, and inference network of any preceding Statement; and code arranged to run on the processing apparatus and being configured so as when run to perform the method of any of Statements 1A to 18A.
- the first aspect (Statement 1) or any embodiment thereof may be used independently or in conjunction with the second aspect (Statement 1A) or any embodiment thereof.
- the model 104/104’/104’’ and associated computer executable instructions are provided using any computer-readable media that are accessible by the computing equipment 102.
- Computer-readable media include, for example, computer storage media such as memory and communications media.
- Computer storage media include volatile and non-volatile, removable, and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or the like.
- Computer storage media include, but are not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), persistent memory, phase change memory, flash memory or other memory technology, Compact Disk Read-Only Memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, shingled disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing apparatus.
- communication media may embody computer readable instructions, data structures, program modules, or the like in a modulated data signal, such as a carrier wave, or other transport mechanism.
- computer storage media do not include communication media. Therefore, a computer storage medium should not be interpreted to be a propagating signal per se. Propagated signals per se are not examples of computer storage media.
- the computer storage medium may be described within the computing equipment 102, it will be appreciated by a person skilled in the art, that, in some examples, the storage is distributed or located remotely and accessed via a network or other communication link (e.g., using a communication interface). Other variants or use cases of the disclosed techniques may become apparent to the person skilled in the art once given the disclosure herein. The scope of the disclosure is not limited by the described embodiments but only by the accompanying claims.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Primary Health Care (AREA)
- Computing Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23777072.2A EP4591217A1 (en) | 2022-09-21 | 2023-08-24 | Modelling causation in machine learning |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263376550P | 2022-09-21 | 2022-09-21 | |
| US63/376,550 | 2022-09-21 | ||
| US17/936,347 US20240104338A1 (en) | 2022-09-21 | 2022-09-28 | Modelling causation in machine learning |
| US17/936,347 | 2022-09-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024063907A1 true WO2024063907A1 (en) | 2024-03-28 |
Family
ID=88197357
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/031000 Ceased WO2024063907A1 (en) | 2022-09-21 | 2023-08-24 | Modelling causation in machine learning |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024063907A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118397290A (en) * | 2024-05-10 | 2024-07-26 | 中国科学院深圳先进技术研究院 | Feature enhancement method, device, equipment and storage medium for image recognition |
| CN118446230A (en) * | 2024-07-02 | 2024-08-06 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Method for capturing dynamic causal relationship in emotion support dialogue |
-
2023
- 2023-08-24 WO PCT/US2023/031000 patent/WO2024063907A1/en not_active Ceased
Non-Patent Citations (4)
| Title |
|---|
| GEFFNER ET AL., MICROSOFT RESEARCH, Retrieved from the Internet <URL:https://arxiv.org/pdf/2202.02195.pdf> |
| MATEJ ZE\V{C}EVI\'C ET AL: "Relating Graph Neural Networks to Structural Causal Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 October 2021 (2021-10-22), XP091071398 * |
| SINDY L\"OWE ET AL: "Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 February 2022 (2022-02-21), XP091146009 * |
| YUAN MENG: "Estimating Granger Causality with Unobserved Confounders via Deep Latent-Variable Recurrent Neural Network", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 September 2019 (2019-09-09), XP081475360 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118397290A (en) * | 2024-05-10 | 2024-07-26 | 中国科学院深圳先进技术研究院 | Feature enhancement method, device, equipment and storage medium for image recognition |
| CN118446230A (en) * | 2024-07-02 | 2024-08-06 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Method for capturing dynamic causal relationship in emotion support dialogue |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Yang et al. | Policy representation via diffusion probability model for reinforcement learning | |
| Chen et al. | Diffusion forcing: Next-token prediction meets full-sequence diffusion | |
| Zhang et al. | Generative flow networks for discrete probabilistic modeling | |
| Moreno-Muñoz et al. | Heterogeneous multi-output Gaussian process prediction | |
| Houthooft et al. | Vime: Variational information maximizing exploration | |
| Contaldi et al. | Bayesian network hybrid learning using an elite-guided genetic algorithm | |
| Touati et al. | Randomized value functions via multiplicative normalizing flows | |
| Jordan et al. | An introduction to graphical models | |
| US20230229906A1 (en) | Estimating the effect of an action using a machine learning model | |
| WO2024063907A1 (en) | Modelling causation in machine learning | |
| Abdel-Nasser et al. | Link quality prediction in wireless community networks using deep recurrent neural networks | |
| Liu et al. | Gradient‐Sensitive Optimization for Convolutional Neural Networks | |
| EP4591217A1 (en) | Modelling causation in machine learning | |
| Ma et al. | Meta-Black-Box-Optimization through Offline Q-function Learning | |
| US20240104370A1 (en) | Modelling causation in machine learning | |
| Kim et al. | Symmetric replay training: Enhancing sample efficiency in deep reinforcement learning for combinatorial optimization | |
| WO2024063912A1 (en) | Modelling causation in machine learning | |
| do Carmo Alves et al. | Information-guided planning: an online approach for partially observable problems | |
| Smith et al. | A learning classifier system with mutual-information-based fitness | |
| Li et al. | Enao: Evolutionary neural architecture optimization in the approximate continuous latent space of a deep generative model | |
| Laird et al. | Generative Flow Networks with Parameterized Quantum Circuits | |
| Jarrett et al. | Time-series Generation by Contrastive Imitation | |
| Mannan et al. | Clan: Continuous learning using asynchronous neuroevolution on commodity edge devices | |
| KR102734936B1 (en) | Method and apparatus for conditional data genration using conditional wasserstein generator | |
| Lange | Decoding the surface code using graph neural networks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23777072 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023777072 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023777072 Country of ref document: EP Effective date: 20250422 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023777072 Country of ref document: EP |