[go: up one dir, main page]

WO2025157774A1 - Clinical data analysis - Google Patents

Clinical data analysis

Info

Publication number
WO2025157774A1
WO2025157774A1 PCT/EP2025/051373 EP2025051373W WO2025157774A1 WO 2025157774 A1 WO2025157774 A1 WO 2025157774A1 EP 2025051373 W EP2025051373 W EP 2025051373W WO 2025157774 A1 WO2025157774 A1 WO 2025157774A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
variables
clinical
nodes
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2025/051373
Other languages
French (fr)
Inventor
Hervé ISAMBERT
Nikita LAGRANGE
Laurent-Philippe Albou
Jonathan DESPONDS
Florent GUINOT
Nadir SELLA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
F Hoffmann La Roche AG
Hoffmann La Roche Inc
Original Assignee
F Hoffmann La Roche AG
Hoffmann La Roche Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by F Hoffmann La Roche AG, Hoffmann La Roche Inc filed Critical F Hoffmann La Roche AG
Publication of WO2025157774A1 publication Critical patent/WO2025157774A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the present disclosure relates to methods for providing synthetic clinical data, and in particular to methods of providing synthetic clinical data using directed relationships between variables and a Bayesian inference framework, as well as methods to train clinical variables predictor tools using synthetic clinical data.
  • Related methods, systems and products are described.
  • Background Electronic healthcare data collection has become ubiquitous in the last two decades and is paramount for the development of new Artificial Intelligence (AI) algorithms and tools that can be deployed in the clinic, for example to improve prognosis, diagnosis, personalised medicine, etc.
  • AI Artificial Intelligence
  • Solutions have been proposed to generate synthetic medical records. However, this is a very complex task that becomes even more challenging when the preservation of patient privacy is a requirement.
  • the algorithm can use data including a mixture of categorical and continuous variables, and is robust to the presence of missing data.
  • the MIIC algorithm is a causal discovery constraint- based method and therefore the graphs produced contain a mixture of undirected and directed edges, as well as directed cycles. Indeed, the method is designed to uncover relationships between clinical variables to provide clinically relevant insights that may not be immediately apparent when looking at the various clinical variables in isolation. In other words, the method is designed for data exploration and the graphs produced are not suitable or designed for data generation. Nevertheless, the present inventors recognized that the graphs produced by this method represented a promising starting point for data generation, and designed a method to use the information captured in these graphs for this purpose.
  • the DAG is used to iteratively generate data starting from isolated nodes or nodes without parents (by sampling from learned distributions associated with these nodes), then propagating from parents to target nodes by sampling from multivariate conditional probability tables learned from the original data and that capture the conditional dependence between parents and targets (in the case of discrete or discretised data for both parents and targets), or machine learning algorithms (e.g. classifiers or regressors) that can learn dependencies between target variable values and parent variable values from the original data (for target variables with continuous parents or both discrete and continuous parents (mixed type variables)).
  • machine learning algorithms e.g. classifiers or regressors
  • Further optional improvements of this method additionally include the provision of a new method to generate a DAG from the graph provided by the causal discovery constraint-based method, which operates by: orienting each undirected edge in a way that minimizes the number of directed cycles, and removing all directed cycles from the graph by iteratively flipping an edge in the longest cycle in the graph, the edge flipped being selected so as to minimize the number of remaining cycles.
  • This approach was found to best exploit the information in the graph, and generate synthetic data with lower variability compared to know methods to generate a DAG such as a Depth-First Search (DFS) orientation algorithm.
  • Further optional improvements of this method additionally include the use of a discretization scheme for continuous data in the data generation step, which maximises the mutual information between each parent variable and the target variable.
  • a similar concept is used in the MIIC algorithm in order to calculate the multivariate information between variables when performing network inference.
  • the data is not in fact discretised in the MIIC algorithm in the sense that no equivalent discrete data is generated from which multivariate conditional probabilities are learned.
  • the discretization of continuous data is simply used as a tool to enable the calculation of the mutual information terms involving continuous variables.
  • one or more continuous parent variables may be selected for discretization, and the resulting discretised data may be used to obtain a multivariate conditional probability table that is used to sample values for a target variable of the discretised parent variable(s).
  • a computer implemented method of providing synthetic clinical data comprising values for a plurality of clinical variables for one or more patients, the method comprising: obtaining a directed acyclic graph (DAG) comprising a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes, wherein the edges correspond to conditional dependence relationships inferred from real clinical data comprising values for the plurality of clinical variables for a plurality of patients, and wherein the DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes; and generating synthetic clinical data for a patient using the DAG by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node; and iteratively
  • the method of the first aspect may have any one or any combination of the following optional features.
  • the machine learning model used to obtain a value for a target node may have been trained using training data comprising, for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes.
  • the machine learning model used to obtain a value for a target node may be configured to predict a value for the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node.
  • Obtaining a value for a target node may comprise training a machine learning model to predict a value of the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node, wherein the training uses training data comprising for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes.
  • the machine learning model may be a regression or classification model.
  • the machine learning model may be a non-linear classification or regression model.
  • the machine learning model may be a tree-based classification or regression model, such as a random forest model.
  • Obtaining a value for a target node using a machine learning model trained on the real clinical data may comprise predicting a value using said machine learning algorithm and adjusting the predicted value to fall within the observed range for the variables in the real clinical data, optionally by setting the predicted value to the nearest boundary of the range when the predicted value is outside of the observed range.
  • Obtaining a value for a target node may comprise: determining that the target node and all parent nodes are associated with discrete variables and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with target nodes and parent nodes; or determining that at least one of the target node and parent nodes is associated with a continuous variable and performing one of: (i) discretising the target or parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; or (ii) using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes.
  • the method may comprise discretising a target or parent node that is associated with a continuous variable using a respective discretisation scheme, wherein the discretisation scheme is a discretisation that is specific to the target and parent nodes and that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes.
  • the method may comprise determining that the target node is associated with a continuous variable and fitting a probability density function to the real clinical data for the continuous variable associated with the target node, wherein a separate probability density function is fitted for each combination of values of the discrete or discretised parent nodes.
  • the objective criterion may be the maximisation of the absolute value of a statistical metric of dependence between the variables associated with the target and parent nodes.
  • the statistical metric of dependence may be mutual information.
  • Obtaining a value for a target node may comprise: determining that at least one of the parent nodes is associated with a continuous variable, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; wherein discretising the target or parent nodes that are associated with a continuous variable comprises, for each continuous variable to be discretised, determining a discretisation scheme using an optimisation method comprising identifying a plurality of non-overlapping subranges for the continuous variable that are associated with a supremum value of a statistical metric of dependence between the target and parent nodes, amongst a plurality of values of said statistical metric associated with respective candidate plurality of non- overlapping ranges for the continuous variable.
  • the method may comprise: determining that at least one of the parent nodes is associated with a continuous variable; and when the number or proportion of parent nodes associated with a continuous variable is below a predetermined threshold, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; or when the number or proportion of parent nodes associated with a continuous variable is at or above a predetermined threshold, using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes.
  • the step of obtaining a directed acyclic graph may comprise obtaining a network between the variables in the real clinical data using a causal discovery method or a constraint-based causal discovery method, optionally wherein the causal discovery method or constraint-based causal discovery method determines conditional mutual information between variables associated with the nodes in the network.
  • the step of obtaining a DAG may comprise: obtaining a network between the variables in the real clinical data using a constraint-based causal discovery method, wherein the network comprises directed and undirected edges; and obtaining a DAG from the network by: using a depth-first-search orientation algorithm; or selecting the direction of each undirected edge that minimises the number of directed cycles in the network, and removing all directed cycles in the resulting directed graph by iteratively identifying the longest cycle in the graph (the one with the most edges) and changing the direction of the edge that minimizes the number of remaining cycles in the graph.
  • the real clinical data used may comprise data for the plurality of clinical variables for a plurality of patients each representing an independent sample, wherein the number of samples is at least 5 times, at least 6 times, at least 7 times, at least 8 times, at least 9 times, or at least 10 times, preferably at least 10 times larger than the number of clinical variables in the plurality of clinical variables.
  • a method of of providing a clinical predictor tool comprising: obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis; wherein obtaining the training data comprises obtaining synthetic clinical data comprising values for a plurality of clinical variables for one or more patients, obtained using a method comprising: obtaining a directed acyclic graph (DAG) comprising a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value
  • DAG directed acyclic graph
  • the step of obtaining synthetic clinical data comprising values for a plurality of clinical variables for one or more patients may be performed using the methods of any embodiment of the first aspect.
  • the clinical predictor model may be a classification or regression model.
  • the training data may comprise real and synthetic clinical data.
  • the diagnosis or prognosis may be selected from a disease severity, a disease subtype, and survival.
  • the diagnosis or prognosis may be survival and the disease may be cancer.
  • a computer-implemented method of analysing clinical data comprising: obtaining synthetic clinical data using the method of any embodiment of the first aspect; and performing one or more of: providing a clinical predictor tool by obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables, wherein the training data comprises said synthetic data; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis; combining said synthetic data with the real clinical data used to obtain the synthetic data; outputting said synthetic data to a third party or public data repository; combining said synthetic data with another clinical dataset for the purpose of further analysis, optionally including obtaining a clinical predictor tool;
  • a system comprising: at least one processor; and at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any embodiment of any of the first, second or third aspect, or any method described herein.
  • a non-transitory computer readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any embodiment of any of the first, second or third aspect, or any method described herein.
  • Figure 1 is a flowchart illustrating a method for generating synthetic clinical data according to a general embodiment of the disclosure
  • Figure 2 is a flowchart illustrating a method for analysing clinical data according to an embodiment of the disclosure
  • Figure 3 illustrates schematically an exemplary system according to the disclosure
  • Figure 4 shows a MIIC network obtained for the preprocessed METABRIC dataset (see Examples - Benchmark dataset).
  • Figure 5 shows correlation matrices evaluated on 1000 samples of the METABRIC dataset, for each of a plurality of synthetic data generation methods. Correlation for each x,y combination is evaluated as the mean value over all executions.
  • A Original data, Bayesian tree search, CT-GAN.
  • B MIIC-SDG (method according to the disclosure), Bayesian hill climbing, TVAE.
  • C Synthpop, PrivBayes, Random.
  • Figure 6 shows quality metrics evaluated on synthetic data generated from the METABRIC dataset using a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE).
  • Randomly generated data (obtained by sampling from a uniform distribution over the range or levels of each feature) is used as a baseline.
  • A. Mutual information distance between original and synthetic datasets.
  • Figure 7 shows privacy metrics evaluated on synthetic data generated from the METABRIC dataset using a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random sampling from uniform distributions) is used as a baseline. A. Identifiability score. B. Membership inference score.
  • Figure 8 shows the results of a feature permutation importance process from a survival random forest model trained to predict overall survival, fitted on a set of 1977 patients from the METABRIC dataset or corresponding synthetic datasets obtained using the methods indicated.
  • MIIC-SDG (method according to the disclosure), Bayesian hill climbing, PrivBayes.
  • C-index is indicated on the y-axis, and the different synthetic data generation methods are on the x-axis.
  • FIG. 10 shows quality, privacy and quality-privacy scores (QPS) for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random sampling from uniform distributions) is used as a baseline. Mutual Information distance is used as the quality measure and privacy is evaluated using identifiability and membership inference scores. Data from METABRIC.
  • Figure 11 shows quality, privacy and quality-privacy scores (QPS) for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random permutation of all data columns) is used as a baseline. Mutual Information distance is used as the quality measure and privacy is evaluated using identifiability and membership inference scores. Data from the IMVIGOR210 trial (Bladder cancer).
  • Figure 12 shows quality-privacy scores for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random sampling from uniform distributions) is used as a baseline. Wasserstein distance, Mutual Information distance and Correlation distance as quality measures and privacy evaluated with identifiability and membership inference scores. Data from METABRIC.
  • Figure 13 shows execution times for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Data from METABRIC. The x-axis indicates the number of samples generated.
  • Figure 14 shows pseudocode for a method of generating a directed acyclic graph from a MIIC network.
  • Figure 15 illustrates schematically the MIIC-SDG pipeline used in the examples of the disclosure.
  • Clinical data refers to data associated with a subject, that characterises the subject’s health status.
  • Clinical data for a subject includes values for each of a plurality of clinical variables.
  • Clinical data may include demographic data (e.g. data about a subjects, age, gender, ethnicity, etc.) about the subject, clinical history data (e.g.
  • omics assay data about the subject e.g. data obtained from assays that measure the presence and/or amounts of omic markers
  • data from medical images obtained from the subject e.g. MRI, endoscopy, x-ray, ultrasound, CT scans, etc.; such as e.g. tumour size determined by imaging, tumour laterality determined by imaging
  • histopathology analysis of one or more samples from the subject e.g. cellularity of a tumour, molecular subtype determined by histopathology, etc.
  • An omic assay may be of any scale, from targeted detection of individual omics marker to genome/transcriptome/proteome/epigenome/microbiome wide assays.
  • Omic data includes information derived from omics data, such as e.g. disease molecular subtypes, tumour mutational burden, etc.
  • Clinical data may be obtained from a sample, may be recorded in one or more databases or may have been previously obtained from a sample and recorded in one or more databased from which it can be obtained for the purpose of performing the methods described herein.
  • a clinical variable is any variable of the type above.
  • the clinical data used may be referred to as “tabular data”, in that the clinical data for a subject contains a single value (which may be a summarised value across a plurality of instances for the same variable) for each of a plurality clinical variables.
  • the clinical variables may be discrete or continuous. Methods of the present disclosure are advantageously able to handle clinical data comprising a mixture of discrete and continuous clinical variables.
  • Discrete or discretised variables are variables for which all observed values in a data set are selected from a discrete set of values.
  • the discrete set of values may consist of two values (binary variable) or more than two values.
  • a continuous variable is a variable for which observed values in a dataset are on a continuous scale.
  • the continuous scale may be within a specific range, but within this range any value is possible.
  • Continuous variables may be discretised by identifying non-overlapping ranges and assigning all values that fall within a respective non-overlapping range the same value.
  • the value may be a summarised value for the range, e.g. the middle of the range, or the mean or median of observed value in the range.
  • Methods of the present disclosure make use of a directed acyclic graph (DAG) that captures relationships between clinical variables, in order to generate synthetic data comprising values for each of the clinical variables represented in the DAG.
  • nodes are each associated with a respective clinical variable.
  • a directed acyclic graph is a graph where each edge is oriented, i.e. each edge connects a source (also referred to as “parent”) node to a target (also referred to as “child”) node, and there are no directed cycles in the graph (i.e. there is no path using the directed edges that starts and finishes on the same node).
  • Edges in a DAG can be associated with weights. Weights can represent strength and/or confidence of relationships. These can be used for example for filtering a DAG, for example by applying a threshold on edge weights (e.g. maintaining only edges with a weight above a predetermined threshold).
  • Edge weights may be ignored when generating data from a particular DAG (which may be a filtered DAG).
  • a DAG is used according to methods of the present disclosure to generate synthetic data using one or both of: multivariate conditional probability tables and machine learning models.
  • a DAG indicates, for each node, the identity of all of the node’s one or more parents.
  • the DAG can be used to identify a plurality of sets of nodes, each set comprising a target node and all of its parents.
  • a multivariate conditional probability table can be obtained that comprises the probability of the target node variable taking each of the values in the discrete set of values associated with the variable, depending on the values of the parent nodes (i.e. depending on which of the discrete set of values associated with the parent node variables are observed).
  • a machine learning model can be obtained that takes as input values for the parent node variables and produces as output a value for the target node.
  • a DAG further comprises one or more source nodes, and optionally one or more isolated nodes. These may be associated with distributions estimated from data for the variable associated with the nodes.
  • the machine learning models are trained by supervised learning, using as training data at least a portion of the original clinical data from which synthetic clinical data is being generated.
  • supervised learning models There are two major types of supervised learning models: classification models and regression models.
  • Classification models aim to classify observations between a plurality of categories. They are typically suitable when the output to be predicted is a discrete category in the training data.
  • Regression models aim to predict the value of a continuous variable and are therefore suitable when the output to be predicted is a continuous value in the training data. For example, regression models may be used to predict continuous variables such as tumour size, tumour mutational burden, etc.
  • a classification model may provide as output a classification label (i.e.
  • An optimality criterion may be the minimisation of a loss function that quantifies the model prediction error based on the observed (ground truth) and predicted values of the predicted variables.
  • Suitable loss functions for use in training machine learning models are known in the art and include the mean squared error (MSE), and the mean absolute error (MAE). Any of these can be used according to the present disclosure.
  • Regularised loss functions are functions that include a loss function as described above (e.g. MSE or MAE), and one or more terms penalizing model complexity in order to reduce the risk of overfitting. Overfitting is a phenomenon that occurs where a machine learning model is trained to very closely reproduce the features of a training data set, resulting in poorer performance on other datasets that do not have the same characteristics (i.e.
  • L1 regularisation also known as “Lasso” in the context of regression
  • L2 regularisation also known as “Ridge” in the context of regression
  • L1 regularisation can be used as a feature selection method as it minimizes the coefficients associated with less informative predictive features.
  • a machine learning model as described herein may be any regression of classification model known in the art.
  • a machine learning model may be selected from: decision trees and variants thereof including regularised and/or gradient boosted decision trees and random forest models, regularised discriminant analysis, logistic regression models, artificial neural networks (ANNs) including multilayer perceptrons (with linear or non-linear activation functions), na ⁇ ve Bayes classifiers, support vector machines (SVM, using linear or non-linear kernels such as radial basis function) and multivariate adaptive regression splines (MARS).
  • ANNs artificial neural networks
  • SVM support vector machines
  • MAM multivariate adaptive regression splines
  • non-linear models may be particularly useful as they are able to capture non-linear relationships between parents and target variables, which are likely to occur in clinical data and which can also be captured in the networks from which parent-target relationships are obtained through the use of mutual information-based network inference.
  • Non-linear models include, for example, decision trees and variants thereof (including in particular random forests and gradient boosted trees), SVM with a non-linear kernel, and ANNs (e.g. multilayer perceptrons) with non-linear activation functions.
  • a machine learning model used to generate synthetic data is a regularized model.
  • the machine learning model is a tree-based model.
  • the machine learning model is a regularized gradient boosted decision tree model.
  • Examples of such models are available in the XGBoost software library (xgboost.ai/). Such models may be referred to as “XGBoost” models, although any other implementation of regularized gradient boosted models may equally be used.
  • a machine learning model comprises an ensemble of models whose predictions are combined.
  • a machine learning model may comprise a single model. Random forest models and gradient boosted tree models (such as XGBoost) are ensemble models. Ensemble versions of any models can be constructed. Ensemble models are expected to result in better prediction performance than single models and are therefore preferred in the context of the methods described herein.
  • the machine learning model may be a random forest classifier or regressor.
  • a random forest classifier is a model that comprises an ensemble of decision trees and outputs a class that is the average prediction of the individual trees.
  • a random forest regressor is a model that comprises an ensemble of decision trees and outputs a continuous value that is the average prediction of the individual trees.
  • Each decision tree may be trained on a sampled subset of the training data.
  • Decision trees perform recursive partitioning of a feature space until each leaf (final partition sets) is associated with a single value (for classification) or range of values (for regression) of the target.
  • Gradient boosting is a machine learning method that forms an ensemble of weak prediction models (e.g. decision trees) from which a combined strong prediction is obtained.
  • the algorithm iteratively adds new weak predictors to improve the prediction obtained by combining the outputs of the weak predictors.
  • random forest iteratively trains a set number of trees using random subsets of the training data.
  • the present disclosure also provides methods of obtaining a clinical predictor tool using a dataset comprising synthetic data generated using the methods described herein.
  • a clinical predictor tool is typically a machine learning model.
  • a clinical predictor tool may be a machine learning model configured to predict a diagnosis or prognosis for a subject, based on the values of one or more clinical variables associated with the subject.
  • the machine learning model may have been trained to provide as output a diagnosis or prognosis based on input comprising the values of the one or more clinical variables, using training data comprising, for each of a plurality of training subjects, a diagnosis or prognosis and values of the one or more clinical variables, the training data comprising synthetic data generated using the methods described herein.
  • the machine learning model has been trained using data from “synthetic subjects”, i.e. using data that has been simulated and does not correspond to any real patient.
  • a diagnosis may be a diagnosis of a subtype (including molecular subtypes, histopathological subtypes, phenotypic subtype, therapy response groups, severity groups, or any other distinction of groups of patients or disease, etc.) of a disease that the patient has been diagnosed as having.
  • the patient may have been diagnosed as having bladder cancer and the diagnosis may be the classification of the patient between a group associated with response to a particular therapy, and a group associated with a lack of response to the particular therapy.
  • a prognosis may be survival, such as e.g. overall survival (OS), disease free survival (DFS), regression free survival (RFS) or any other survival metric or category derived therefrom.
  • OS overall survival
  • DFS disease free survival
  • RFS regression free survival
  • a clinical predictor tool may classify a patient between a first class associated with “good” survival and a second class associated with “poor” survival, or any number of classes associate with different levels of survival (e.g. good prognosis, intermediate prognosis or poor prognosis).
  • a clinical predictor tool may be a regression model and may predict a continuous variable such as the probability of good or poor survival.
  • Any machine learning model known in the art may be used as a clinical predictor tool. In particular, any of the machine learning models described above in the context of the use of machine learning for synthetic data generation may be used.
  • a DAG as described herein may have been obtained from a graph that comprises directed and/or undirected edges, including for example graphs that include a mixture of directed and undirected edges.
  • a graph may have been obtained using any graphical model (i.e. network) reconstruction method.
  • causal discovery methods may be used, including in particular any causal discovery constraint-based method (also referred to as constraint-based causal discovery method).
  • causal networks can also be referred to as causal Bayesian networks, and any method to identify such networks may be used.
  • a causal discovery method is a method for identifying relationships between variables in a dataset, which by default are likely causal relationships. Any method known in the art that can infer a DAG for a dataset may be used (i.e. any causal discovery method).
  • the Peter and Clark (PC) algorithm is an implementation of the Inductive Causation (IC) algorithm proposed by Verna and Pearl (1990).
  • the PC algorithm (described in Spirtes, Glymour, and Scheines, 2000) is a well-known method of identifying causal networks, that can be used as an alternative to the method used in the examples below.
  • the present inventors have found that an information theoretic constraint based method that starts from a fully connected graph and iteratively removes edges between variables X and Y for which I(X;Y
  • ⁇ Ai ⁇ ) 0 (i.e.
  • causal discovery methods aim to identify all relationships, preferably causal relationships, that are supported by a dataset. In other words, these methods perform a statistical estimation of parameters describing a graphical causal structure.
  • Causal discovery methods typically assume that all edges are directed.
  • Constraint based methods identify structural constraints corresponding to all dispensable edges in a graph, and can identify both directed and undirected edges. Networks with both directed and undirected edges are more likely to capture real relationships in clinical datasets because unobserved (latent) variables frequently exist that impact the causal relationships between variables, leading to spurious causal associations.
  • a graph used for generating a DAG for synthetic data generation as described herein may have been obtained using an information-theoretic constraint-based causal discovery method.
  • Constraint-based approaches start from a fully connected network and iteratively remove edges between variables X and Y for which a conditional independence can be found.
  • An information theoretic constraint based method iteratively removes edges between variables X and Y for which I(X;Y
  • ⁇ Ai ⁇ ) 0, i.e.
  • a “sample” as used herein may be a cell or tissue sample (e.g. a biopsy), or an extract from which biological material can be obtained for analysis, to obtain values of one or more clinical variables.
  • the sample may be a tumour sample or a blood sample.
  • the sample may be a tissue sample, such as a tumour sample.
  • a sample may be a tumour sample or a biological fluid sample, for example comprising circulating tumour DNA or tumour cells.
  • the sample may be one which has been freshly obtained from a subject or may be one which has been processed and/or stored prior to making a determination (e.g. frozen, fixed or subjected to one or more purification, enrichment or extractions steps).
  • the sample may be a cell or tissue culture sample that has been derived from a tumour.
  • a sample as described herein may refer to any type of sample comprising biological material from which values of clinical variables may be determined.
  • sample may be transported ad/or stored, and collection may take place at a location remote from the biological data acquisition (e.g. sequencing) location, and/or the computer-implemented method steps may take place at a location remote from the sample collection location and/or remote from the clinical data acquisition (e.g. sequencing) location (e.g. the computer-implemented method steps may be performed by means of a networked computer, such as by means of a “cloud” provider).
  • a “tumour sample” refers to a sample that contains tumour cells or genetic material derived therefrom.
  • the tumour sample may be a cell or tissue sample (e.g. a biopsy) obtained directly from a tumour.
  • a subject or individual according to the present disclosure is preferably a mammalian (including a human or a model animal such as mouse, rat, etc.), preferably a human.
  • the terms “patient”, “subject” and “individual” are used interchangeably.
  • the patient may be a patient who has been diagnosed as having or being likely to have a disease.
  • the disease may be cancer.
  • the cancer may be breast cancer or bladder cancer.
  • the methods described herein have been specifically demonstrated on clinical data from breast cancer and bladder cancer patients. However, the approach is applicable to any clinical context, including but not limited to any cancer type.
  • Figure 1 is a flowchart illustrating a method providing synthetic clinical data according to a general embodiment of the disclosure.
  • step 110 real clinical data comprising values for a plurality of clinical variables for a plurality of patients is obtained.
  • This step is optional because the methods of the disclosure can also start from a previously obtained reconstructed network, learned probability density functions, multivariate probability tables and machine learning models obtained from said data.
  • Obtaining data typically comprises receiving data from one or more computing devices, databases or memories, i.e. receiving previously collected data.
  • the real clinical data used may comprise data for the plurality of clinical variables for a plurality of patients each representing an independent sample, wherein the number of samples is at least 5 times, at least 6 times, at least 7 times, at least 8 times, at least 9 times, or at least 10 times, preferably at least 10 times larger than the number of clinical variables in the plurality of clinical variables (optionally after filtering to remove samples with a proportion of missing data above a predetermined threshold).
  • synthetic data may be generated for 20 variables (i.e. using a DAG comprising nodes associated with 20 clinical variables) using a real clinical dataset comprising at least 2000 samples (patients).
  • the clinical data may comprise one or more of preclinical data, clinical data (e.g. data from clinical trials) and real world data (e.g.
  • a network is obtained from the data obtained at step 110.
  • Step 112 may comprise obtaining a network between the variables in the real clinical data using an information theoric constraint-based method that iteratively removes edges between variables X and Y for which there exists a set of variables ⁇ Ai ⁇ such that X is independent of Y given ⁇ Ai ⁇ .
  • Removing said edges may be performed by estimating the probability that the edge should be removed as the probability that I(X;Y
  • ⁇ Ai ⁇ ) 0 is above a predetermined threshold, where said probability is provided by equation where N is the number of independent samples in the real clinical data set from which the network is being learned.
  • Obtaining a network may be performed using the MIIC algorithm. The MIIC algorithm is described in Verny et al. 2017 and available at rdrr.io/cran/miic/man/miic.html.
  • the causal discovery method or constraint-based causal discovery method may determine conditional mutual information between variables associated with the nodes in the network, wherein the mutual information between variables associated with the nodes in the network is estimated for the purpose of identifying relationships between nodes using a discretisation scheme that is specific to the set of nodes for which conditional independence is being evaluated.
  • the discretisation scheme may comprise, for each continuous variable to be discretised, determining a discretisation scheme using an optimisation method comprising identifying a plurality of non- overlapping subranges for the continuous variable that are associated with a supremum value of the mutual information between the nodes for which conditional independence is being evaluated, amongst a plurality of values of said mutual information associated with respective candidate plurality of non- overlapping ranges for the continuous variable.
  • the discretisation scheme may be as described in Verny et al.2017.
  • the discretisation scheme may be a discretisation that is specific to the target and parent nodes and that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes.
  • the objective criterion is expressed using equation (21) below, optionally wherein equations (19) and (20) also apply.
  • a directed acyclic graph is obtained from the network obtained at step 112.
  • the DAG comprises a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes.
  • the edges correspond to conditional dependence relationships inferred from the real clinical data at step 112.
  • the DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes.
  • Obtaining a DAG may comprise receiving a network between the variables in the real clinical data using a constraint-based causal discovery method (obtained at step 112), wherein the network comprises directed and undirected edges, and obtaining a DAG from the network by using one of: (i) using a depth-first-search orientation algorithm; (ii) using a partially directed acyclic graph to DAG algorithm that transforms a graph that contains both directed edges and undirected edges, with no directed cycle in its directed subgraph into a fully directed acyclic graph on the same underlying set of edges, with the same orientation on the directed subgraph and the same set of v-structures by iteratively selecting a sink node x and where all nodes y connected to x by undirected edges are adjacent
  • synthetic clinical data is obtained for a (synthetic, i.e. simulated) patient by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node (step 116B), and iteratively obtaining a value for each target node of the DAG using: a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table estimated from the real clinical data or a machine learning model trained on the real clinical data (step 116C).
  • the plurality of variables comprises at least one continuous variable and at least one discrete variable, and obtaining a value for at least one target node of the DAG comprises using a machine learning model trained on the real clinical data obtained at step 110.
  • Step 116 may further comprise step 116A of imputing missing data for one or more continuous variables. This step may use any data imputation algorithm known in the art. Multivariate imputation methods are preferred, such as e.g. the MICE algorithm.
  • the method may further comprise, prior to the generation of synthetic clinical data, filtering the real clinical data to remove data for patients for which the proportion of missing data is above a predetermined threshold, such as e.g.20%, 30%, 40% or 50%.
  • Step 116B may comprise, for each discrete source / isolated node, obtaining a probability table from the real clinical data obtained at step 110 for the node, and sampling from said probability table (i.e. generating data following the distribution in the probability table).
  • Step 116B may comprise, for each continuous source / isolated node, fitting a probability density function (i.e. fitting a distribution to the data, for example by determining kernel density estimates from the data for the node) to the real clinical data obtained at step 110 for the node, and sampling from said distribution.
  • a probability density function i.e. fitting a distribution to the data, for example by determining kernel density estimates from the data for the node
  • step 116C data is obtained iteratively using data obtained at the previous step for parent nodes of the target node for which data is being obtained at the current iteration. For example, a value is sampled for two source nodes, then a value is obtained for a target node that is the target of these source nodes using the respective sampled values for the source nodes and one or either a machine learning model or a multivariate conditional probability table.
  • Step 116C may distinguish multiple situations depending on the nature of the target and parent nodes: (a) discrete parents and target nodes; (b) discrete target nodes, parent nodes comprise both discrete and continuous nodes and the number of continuous parents is below a first predetermined threshold; (c) discrete target nodes, parent nodes comprise both discrete and continuous nodes and the number of continuous parents is at or above the first predetermined threshold; (d) continuous target nodes, parent nodes comprise discrete nodes or both discrete and continuous nodes and the number of continuous parents is below a second predetermined threshold; and (e) continuous target nodes, parent nodes comprise both discrete and continuous nodes and the number of continuous parents is at or above the second predetermined threshold.
  • a multivariate conditional probability table may be used in cases (a), (b), (d).
  • a machine learning model may be used in cases (c) and (e).
  • the method may comprise determining that the target node and all parent nodes are associated with discrete variables (case (a)) and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with target nodes and parent nodes.
  • the method may comprise estimating a multivariate conditional probability table from the real clinical data for variables associated with target nodes and parent nodes that are all discrete variables.
  • the method may comprise determining that at least one of the target node and parent nodes is associated with a continuous variable (cases (b) to (e)) and discretising the target or parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme (e.g. in cases (b) and (d)).
  • the method may comprise determining that at least one of the target node and parent nodes is associated with a continuous variable (cases (b) to (e)) and using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes (e.g.
  • a machine learning model or a multivariate conditional probability table may be used in cases where at least one of the parents and target nodes is continuous, depending on whether the number of parent nodes that are associated with a continuous variables satisfy a predetermined criterion.
  • the predetermined criterion may be the number of parent nodes that are associated with a continuous variable being at or above a predetermined threshold (in which case a machine learning model is used), or below said predetermined threshold (in which case a multivariate conditional probability table is used).
  • the predetermined criterion e.g. predetermined threshold
  • the machine learning models used to obtain a value for a target node at step 116C (note a separate machine learning model is used for each target node) has been trained using training data comprising, for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes.
  • the machine learning model used to obtain a value for a target node is configured to predict a value for the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node.
  • Obtaining a value for a target node at step 116C (e.g.
  • the machine learning model may be a regression or classification model.
  • the machine learning model may be a non-linear classification or regression model.
  • the machine learning model may be a tree-based classification or regression model, optionally a random forest model.
  • the method may comprise obtaining an adjusted value for a predicted value obtained using the machine learning model, wherein the predicted value is outside of a predetermined range and the adjusted value is the value of the nearest boundary of the predetermined range.
  • the predetermined range may be a range that is determined by a user for the particular variable that is predicted, or a range that is automatically determined between the smallest and largest value for the variable in the real clinical data. This may be particularly useful in the context of machine learning models that are regression models as these may provide outputs that are outside of the range of values present in the real clinical data on which they were trained.
  • obtaining a value for a target node using a machine learning model trained on the real clinical data may comprise predicting a value using said machine learning algorithm and adjusting the predicted value to fall within the observed range for the variables in the real clinical data, optionally by setting the predicted value to the nearest boundary of the range when the predicted value is outside of the observed range.
  • the method may comprise determining that at least one parent node is associated with a continuous variable (cases (b) to (e)), and when the number or proportion of parent nodes associated with a continuous variable is below a predetermined threshold, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme (e.g. cases (b) and (d)).
  • the method may comprise determining that at least one parent node is associated with a continuous variable (cases (b) to (e)), and when the number or proportion of parent nodes associated with a continuous variable is at or above a predetermined threshold, using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes (e.g. cases (c) end (e)).
  • the predetermined threshold may be 3 parent nodes associated with a continuous variable when the target node is discrete (case (b), first predetermined threshold) and 2 parent nodes associated with a continuous variable when the target node is discrete (case (d), second predetermined threshold).
  • a parent node that is associated with a continuous variable is discretised (e.g.
  • this can use a respective discretisation scheme, i.e. a discretisation that is specific to the target and parent nodes.
  • the discretisation scheme may be one that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes.
  • the present inventors have found such approaches to make a real difference to the quality of the data generated. This is because the discretisation scheme can really change the relationship (e.g. correlation) between variables leading to multivariate conditional probability tables that do not accurately capture the relationships between the variables.
  • obtaining a value for a target node may comprise: determining that at least one of parent node is associated with a continuous variable (and optionally that the number of parent nodes associated with a continuous variable satisfy a predetermined criterion, e.g.
  • a statistical metric of dependence may be a correlation or mutual information.
  • Mutual information is advantageous as it is able to capture non-linear relationships between variables.
  • an optimisation step does not guarantee that a global optimal solution is identified, and merely implements a strategy to explore the space of possible solutions (each solution corresponding to a candidate plurality of non-overlapping ranges for the continuous variable to be discretised) and select the best solution amongst the solutions explored, according to the objective criterion used.
  • the method may comprise fitting a probability density function to the real clinical data for the continuous variable associated with the target node, wherein a separate probability density function is fitted for each combination of values of the discrete or discretised parent nodes.
  • the method may comprise obtaining a multivariate conditional probability table where the probability for the target node for each combination of discrete values of the parent nodes is defined by a fitted distribution (probability density function).
  • the method may comprise, for each variable that is a discrete variable, including missing data (i.e.
  • the method may comprise generating missing data according to a conditional probability determined from the real clinical data.
  • the results of any one or more of the preceding steps are provided to a user (e.g. through a user interface), or to another computing device, memory or database.
  • the methods described herein find application in a variety of contexts. For example, the methods described herein can be used to generate training data for training one or more machine learning models to predict clinically relevant characteristics (such as e.g. prognosis such as survival, response to a treatment, diagnosis including disease subtyping, disease severity, etc.).
  • the synthetic training data may be used alone or in combination with additional “real” data.
  • the methods described herein can be used to generate augmented clinical data, comprising real clinical data and additional synthetic clinical data that is generated from the real clinical data using a method as described herein.
  • This synthetic data may comprise whole samples (i.e. complete data for a whole synthetic patient) or predicted values for specific variables, for example in order to fill in missing values.
  • the augmented data set can be used for any clinical tool development (including e.g. machine learning predictors) or any clinical data analysis known in the art.
  • the method may enable new methods to be applied on the data that may not have been applicable to data comprising missing values.
  • FIG. 1 is a flowchart illustrating a method for analysing clinical data according to an embodiment of the disclosure.
  • the method comprises obtaining, at optional step 210, real clinical data comprising values for a plurality of clinical variables for a plurality of patients. This step is optional because the subsequent steps may use exclusively synthetic data or a combination of real and synthetic clinical data.
  • the method further comprises at step 212, obtaining synthetic clinical data comprising values for the plurality of clinical variables for one or more (simulated) patients using the method of any embodiment explained by reference to Figure 1.
  • the method may comprise performing one or more of: (i) providing a clinical predictor tool by obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables, wherein the training data comprises said synthetic data; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis (steps 210, 212, 214, 216); (ii) combining said synthetic data with the real clinical data used to obtain the synthetic data (steps 210, 212, 214, 220); (iii) outputting said synthetic data to a third party or public data repository (step 218); (iv) combining said synthetic data with another clinical dataset for the purpose of further analysis (steps 220,
  • methods of data augmentation comprising generating synthetic data using real clinical data and combining the real and synthetic data. This can result in smoother data, which can improve the performance of machine learning algorithms applied to said data, for example for the purpose of obtaining a clinical predictor tool.
  • method of data imputation comprising generating synthetic data using real clinical data and using the synthetic data to fill in missing values in the real clinical data.
  • the data obtained at step 212 and the data obtained at step 210 may optionally be combined at step 214 for form an “augmented” dataset comprising more data than would have been available if relying solely on the real clinical data obtained at step 210.
  • the plurality of variables comprise a variable indicative of a diagnosis or prognosis and one or more further clinical variables.
  • the data obtained at step 214 or at step 212 is used to train a clinical predictor model at step 216 to predict the variable indicative of a diagnosis or prognosis, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis.
  • the clinical predictor model may be a classification or regression model.
  • the training data may comprise real and synthetic clinical data.
  • the diagnosis or prognosis ma be selected from a disease severity, a disease subtype, and survival.
  • the diagnosis or prognosis may be survival and the disease may be cancer.
  • the results of any one or more of the preceding steps are provided to a user (e.g. through a user interface), or to another computing device, memory or database.
  • the methods described herein are computer-implemented unless context specifies otherwise (such as e.g. where measurement steps and/or wet steps are involved). Thus, the methods described herein are typically performed using a computer system or computer device. Any reference to an action such as “obtaining”, “processing”, “determining” may therefore refer to a processor performing the action, or a processor executing instructions that cause the processor to perform the action.
  • the methods of the present invention comprising at least the training of machine learning algorithms to generate synthetic data, the training of machine learning algorithms to provide a clinical prediction using synthetically generated data, the identification of mutual information based networks between multiple variables using data from hundreds of patients (involving multiple optimisation steps for each learned interaction as described in the Examples below), is such that it cannot be performed in the human mind.
  • the terms “computer system” of “computer device” includes the hardware, software and data storage devices for embodying a system or carrying out a computer implemented method.
  • a computer system may comprise one or more processing units such as a central processing unit (CPU) and/or a graphical processing unit (GPU), input means, output means and data storage, which may be embodied as one or more connected computing devices.
  • CPU central processing unit
  • GPU graphical processing unit
  • the computer system has a display or comprises a computing device that has a display to provide a visual output display (for example in the design of the business process).
  • the data storage may comprise RAM, disk drives or other computer readable media.
  • the computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network.
  • a computer system may be implemented as a cloud computer.
  • computer readable media includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system.
  • the media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.
  • Figure 3 shows an embodiment of a system for providing synthetic clinical data, for providing a clinical predictor tool, and/or for comparing or combining data according to the present disclosure.
  • the system comprises a computing device 1, which comprises a processor 101 and computer readable memory 102.
  • the computing device 1 also comprises a user interface 103, which is illustrated as a screen but may include any other means of conveying information to a user such as e.g.
  • the computing device 1 is communicably connected, such as e.g. through a network, to one or more databases 2 storing clinical data about a plurality of patients.
  • the one or more databases 2 may further store one or more of: one or more machine learning algorithms, training data, parameters (such as e.g. parameters of machine learning model, feature selection algorithm, data preprocessing methods, etc.), etc.
  • the computing device may be a smartphone, tablet, personal computer or other computing device.
  • the computing device is configured to implement a method as described herein.
  • the computing device 1 is configured to communicate with a remote computing device (not shown), which is itself configured to implement a method of as described herein.
  • the remote computing device may also be configured to send the result of the method to the computing device.
  • the various steps of the methods described herein may be split between the computing device 1 and the remote computing device.
  • the remote computing device may be a cloud computing device, a server node, etc. Any processing device known in the art may be used for this purpose.
  • Communication between the computing device 1 and the remote computing device may be through a wired or wireless connection, and may occur over a local or public network 3 such as e.g. over the public internet.
  • a local or public network 3 such as e.g. over the public internet.
  • SDC Statistical Disclosure Control
  • Identification disclosure when an attacker is able to link some data to a specific individual
  • attribute disclosure when the attacker is able to learn new information on the subject, by using prior knowledge and the information contained in the data.
  • Classical anonymization techniques like k-anonymity (Samarati & Sweeney, 1998) protect user data by minimizing the risk of re-identification, while keeping in theory a good level of data utility.
  • K-anonymity is obtained through data suppression and data generalization, so that each person in the collection cannot be distinguished from at least k-1 individuals by using quasi-identifiers features (attributes available to an adversary).
  • Machanavajjhala ay al. (2007) showed k-anonymity to be vulnerable to some attacks when using background knowledge and proposed a new privacy criteria, named l-diversity.
  • Li et al. (2007) published a novel privacy criterion named t-Closeness, with even stronger properties for privacy preservation.
  • these classical tools are shown to deteriorate the data distribution, making them no longer exploitable in many situations (Hernandez et al.2022).
  • SDG synthetic tabular data generation
  • GANs Generative Adversarial Networks
  • CARTs Classification and Regression Trees
  • VAE Variational Autoencoders
  • MIIC-SDG Magnetic Ink Characteristics Deformation
  • MIIC-SDG builds on the MIIC algorithm by adding a new algorithm to transform a graph into a DAG by exploiting the information given by MIIC and a synthesizer (MIIC synthesizer) capable of generating samples that mimic the original data, while taking into account the multivariate distribution associated with the data.
  • MIIC-SDG is composed of three steps, illustrated on Figure 15: 1.
  • MIIC network reconstruction inferring a graphical model associated with the original dataset using the MIIC algorithm ( Figure 15A).
  • MIIC DAG generation creating a directed acyclic graph (DAG) using the previously inferred network ( Figure 15B). 3.
  • MIIC synthesizer generating synthetic samples based on the DAG and the original data using several approaches that depend on the nature of parents and children nodes in the graph ( Figures 15C and D). Each of these is described in more detail below.
  • MIIC network reconstruction The MIIC (Multivariate Information-based Inductive Causation) is an algorithm that infers a graphical network to represent the direct and possibly causal associations between variables in a dataset (described in Verny et al. 2017 and Cabeli et al.2020). The algorithm is able to estimate conditional mutual information, even when the dataset includes a mixture of categorical and continuous variables. MIIC does not have any hyperparameters and is not sensible to the order of features in the input data.
  • MIIC has been shown to be robust to sampling noise and to reliably estimate (conditional) mutual information. These features have been demonstrated in multiple benchmarks (see Cabeli et al.2020).
  • the graph generated by MIIC is composed of both undirected and directed edges, which may also form some directed cycles (as with any other causal discovery constraint-based methods).
  • the directed edges originate from the discovery of v-structures (Verny et al.2017), which are signatures of causality in observational data, or through the propagation of orientation from upstream v-structures. Hence, directed edges do not necessarily correspond to causal associations.
  • MIIC is available through a web-server (Sella et al. 2018) and an R package and has been recently applied to a breast cancer cohort of patients treated at Institut Curie in Paris, providing a novel way to globally visualize, analyze, and understand the connections between well-known clinical features (Sella et al.2022).
  • the MIIC network reconstruction step uses the MIIC method as described in Verny et al. 2017 and Cabeli et al. 2020, also outlined below.
  • An example of a MIIC reconstructed network applied on the METABRIC dataset is shown on Figure 4.
  • the 3 point information can be positive or negative.
  • A) are defined similarly but using conditional multivariate entropies H( ⁇ Xi ⁇
  • the MIIC algorithm proceeds in 3 steps: 1.
  • MIIC removes dispensable edges by iteratively subtracting the most significant information contributions from indirect paths between each pair of variables.
  • the most likely contributor An after collecting the first n-1 contributors ⁇ Ai ⁇ n-1 is chosen by maximising this score.
  • the score (Slb) for the addition of node Z on previously selected ⁇ Ai ⁇ combines the maximum of three-point information and the minimum of two-point information as Slb(Z;XY
  • ⁇ Ai ⁇ ) min[Pnv(X;Y;Z
  • ⁇ A i ⁇ ) max Z (S lb (Z;XY
  • ⁇ Ai ⁇ ) 1/(1+exp(-NI’(X;Y;Z
  • Significance is determined using the normalised maximum likelihood criterion or Bayesian information criterion (BIC)/minimum description length (MDL) criterion as described in Verny et al. 2017, see also below (comparing the mutual information to the NML or MDL complexity to obtain a regularised mutual information that has to be above a threshold to be significant – see in particular Verny et al.2017, supplementary file S1, section 1.2).
  • the strength of a retained edge is illustrated on Figure 4 by the thickness of the edge. 2.
  • Edge filtering / confidence of edges (optional – not performed in the present examples):
  • the remaining edges can be further filtered based on the confidence ratio assessment:
  • the lower the value of C XY the higher the confidence on the XY edge.
  • filtering edges with CXY > 0.1 or 0.01 was found to limits the false discovery rates with small datasets, while maintaining satisfactory true positive rates.
  • Edge orientation remaining edges are then oriented based on the sign of (conditional) three-point information in the observed data.
  • Initially unspecified endpoint marks (o) can be established as arrow tail (-) or head (>) by iteratively taking the top (X,Y,Z)X ⁇ Y with highest endmark orientation / propagation probability >1/2 (until no additional endmark orientation / propagation probability >1/2): - if I’ N (X;Y; Z
  • the MIIC method further implements an information-maximizing discretization of continuous data.
  • l 1 , l 2 ,...,l r are r non-negative integers such that their sum is equal to n.
  • a local optimisation heuristic is implemented which finds the optimal partition (cut points) for each continuous variable iteratively, keeping the partitions of the other continuous variables fixed.
  • a similar scheme is used in some cases in the synthetic data generation step (see below), except that continuous variables used to generate synthetic data are discretised taking into account only the source and target nodes (at least one of which is a continuous variable to be discretised), i.e. this uses bivariate information.
  • the same scheme can be applied when computing conditional mutual information involving continuous or mixed-type variables.
  • ⁇ ) ⁇ ′ ⁇ ⁇ ( ⁇ ; ⁇ , ⁇ ) ⁇ ⁇ ′ ⁇ ⁇ ( ⁇ ; ⁇ ) (21) is optimised with respect to Y and ⁇ Ai ⁇ partitions using equation (17a) as parametric complexity extended to multivariate categories ny, ⁇ ai ⁇ and n ⁇ ai ⁇ .
  • ⁇ ) ⁇ ′ ⁇ ⁇ ( ⁇ ; ⁇ , ⁇ ) ⁇ ⁇ ′ ⁇ ⁇ ( ⁇ ; ⁇ ) (22) is optimised with respect to X and ⁇ Ai ⁇ partitions using equation (17b) as parametric complexity extended to multivariate categories nx, ⁇ ai ⁇ and n ⁇ ai ⁇ . Partitions ⁇ Ai ⁇ are optimised separately for each of the 4 terms in equations (21) and (22) before taking their differences.
  • MIIC DAG generation In the new methods described herein, the inventors expand MIIC’s ability to learn unparameterized network structures by incorporating a framework capable of generating synthetic data from MIIC reconstructed graphs.
  • DAG Directed Acyclic Graph
  • DFS Depth-First Search
  • each undirected edge in the MIIC network is oriented so as to minimize the number of directed cycles and possibly avoid them. This is performed in order of degrees, i.e. nodes are sorted by ascending degree, then going considering all undirected edges involving nodes with the same degree (increasing to the next degree at each iteration), any remaining undirected edges are oriented so as to minimise the number of cycles.
  • the sorting by ascending degree is optional, and was performed to minimise the number of V-structures. Note that there is not necessarily a single global solution, i.e. multiple solutions may result in the same number of cycles – this step mostly aims to generate a directed graph as a starting point that has as few cycles as possible.
  • All starting point directed graphs with the minimum number of cycles obtained with this step are equivalent from the point of view of the original data set since they only orient edges that could not be oriented based on the data (i.e. both orientations are consistent with the original data). 2. Then, all directed cycles are removed from the graph, if some are present. In order to guarantee the removal of all the cycles of the graph, the MIIC-to-DAG algorithm iteratively considers the longest cycle in the graph (the one with the most edges) and flips the edge that minimizes the number of remaining cycles in the graph. Taking the longest cycle guarantees the removal of at least one cycle at each iteration and therefore convergence towards a DAG.
  • this step evaluatse, for each edge of the longest cycle, the number of remaining cycles if the edge is flipped, then select the edge for flipping that, if flipped, results in the smaller number of cycles.
  • a pseudocode for this procedure termed MIIC-to-DAG, is shown on Figure 14. The inventors compared these two methods and decided to use the second approach which best exploits MIIC capabilities of discovering a network that contains a mixture of directed and undirected edges (latent variables are excluded in the present setting). The second method also results in synthetic data generation with lower variability. The process described here was tested using both methods, leading to similar results (i.e. both would be usable), but the second method is better able to deal with cycles in the MIIC network.
  • a partially directed acyclic graph is a graph that contains both directed edges and undirected edges, with no directed cycle in its directed subgraph.
  • the data generation component leverages the Directed Acyclic Graph (DAG) obtained in the previous stage.
  • DAG Directed Acyclic Graph
  • the MIIC-SDG synthesizer is based on Bayesian assumptions, where data is initially generated for variables associated with isolated nodes or nodes without parents (source nodes) and then iteratively expands to nodes whose parent nodes have already been processed.
  • source nodes and isolated nodes that are discrete variables a probability table is obtained from the input data and data is generated following the distribution in the probability table (this is the same approach that would be used when generating data for a classical Bayesian network).
  • a probability density function is fitted to the data (using the “density” function from the R package stats).
  • the density function computes kernel density estimates from a data series (see www.rdocumentation.org/packages/stats/versions/3.6.2/topics/density).
  • P data type of parent
  • T data type of the target variable
  • P discrete – T discrete a multivariate conditional probability table for the target variable is estimated from the original data and then used to sample synthetic data based on parents values (like in a classical Bayesian network algorithm).
  • P mixed – T discrete two different methods are applied depending on the number of continuous parents. a.
  • continuous distributions are discretized using the optimum discretization algorithm described above by reference to equations (19)-(21) and in Cabeli et al. 2020, except that the discretisation of the continuous variable takes into account only source and target variables (bivariate information).
  • a discretisation i.e. set of bins for a continuous variable or set of source-target continuous variables
  • This approach has been shown to reliably estimate theoretical mutual information and to be adaptive to the number of samples and multimodal continuous distributions.
  • This approach finds the cut values (bins) for the continuous variables using the estimation of the mutual information minus a complexity term.
  • the inventors limited the application of this method to cases where the number of continuous parents to be discretised is relatively low for the following reasons. There is no limit in the number of bins that are found for the continuous variables, which could result in a very large conditional probability table for the target variable. This is problematic because the data generated would then be very similar to the input data. Further, if the number of continuous variables is more important, machine learning algorithms can also have good performances, since they can capture the multivariate underlying distribution that can well explain the target variable. A multivariate conditional probability table for the target variable is then estimated from the discretized data and then used to sample synthetic data based on parents values.
  • the density of the continuous target is learned by fitting a probability density function to the data (using the “density” function from the R package stats) separately for each combination of discrete or discretised predictors. These fitted density functions are then used to generate the data. This option was only applied to cases where the number of continuous predictors is low because it can otherwise create data that is quite similar to the original data (due to the discretisation scheme having an unbounded number of bins, as explained above). Tus, when the number of continuous predictors is high, a machine learning approach was preferred. b. If the number of continuous predictors is higher (2 or more), a random forest regression model is implemented. This was implemented as above. Data imputation.
  • the MICE algorithm (Multiple imputation by chained equations, see Azur et al., 2011) was used for missing data imputation when there were many NAs in a continuous target node, as that would prevent the regression model from running. No data imputation was implied to discrete variables and instead NA was used as a separate category of data, such that NAs are created in the synthetic data with probabilities matching the input data.
  • the MIIC-SDG algorithm was implemented as an R package. It contains one function that returns three objects: the synthetic data generated by the algorithm as a data frame, the adjacency matrix representing the network corresponding to the DAG used to sample the synthetic data and a third object containing the data type (discrete or continuous) of the input variables as a data frame.
  • Example 2 Benchmarking of MIIC-SDG relative to state of the art: quality metrics and machine learning performances
  • the novel method described in Example 1 is benchmarked against a series of state-of- the-art synthetic data generation methods, in particular in relation to the quality of the data generated and its usability in training a prognostic machine learning model. Methods MIIC-SDG. This is described in Example 1.
  • MIIC-SDG is composed of three steps: the first step discovers a network structure from the input data, the second step transforms this network into a DAG using the MIIC-to-DAG algorithm and the third step uses this DAG and the original data to generate synthetic samples resembling the original data.
  • This method aims to learn the models from the original data without overfitting the data and using the correct set of ancestors for each parent. This was hypothesised to be better than classical Bayesian methods that have a tendency to overfit the data (therefore necessitating the injection of noise in the data to preserve privacy), Bayesian.
  • This method builds a probabilistic graphical model (Bayesian network) that represents the joint multivariate distribution by exploiting dependencies between the random variables (Ankan & Panda 2015).
  • a directed acyclic graph and a corresponding conditional probability distribution are learned from the given data. Sampling from the model is finally performed to generate the resultant dataset.
  • the code used is the Synthcity (Qian et al.2023) package that builds on the pgmpy package by Ankur and Panda (2015).
  • the DAG is obtained using the tree search (Chow–Liu tree) or hill climbing algorithms. These two approaches are referred to as “Bayesian tree search” and “Bayesian hill climbing”. Synthpop.
  • the Synthpop algorithm (described in Nowok et al. 2016), is a machine learning solution aimed at providing synthetic test data for users of confidential datasets.
  • the synthetic data generated through parametric and nonparametric methods, including the classification and regression trees (CART) model, aims to mimic the original data and can be used for exploratory analyses and for testing models.
  • CART model may result in final leaves representing a small number of individuals, potentially compromising the privacy of the synthesized data.
  • the authors suggest limiting this effect by specifying a minimum size for the final node produced by the CART model.
  • determining the appropriate value for this parameter is challenging as it depends on the data and the method does not offer a tuning procedure.
  • CTGAN CTGAN.
  • CTGAN (Conditional Tabular Generative Adversarial Networks) is a deep learning algorithm (described in Xu et al.2019), that aims at creating a generative model suitable for tabular data.
  • CTGAN differs from traditional GANs by adding a conditional structure to both the generator and the discriminator networks, allowing it to generate synthetic samples based on specific real-world conditions.
  • Xu et al. (2019) have reported CTGAN outperforming Bayesian methods on most of the real datasets they presented.
  • TVAE Tabular Variational AutoEncoders are adapted from classical variational autoencoders (VAE) to enable the generation of mixed-type tabular data.
  • PrivBayes This method was also used as a benchmark in the CTGAN paper (Xu et al. 2019) and is described therein. Authors claim that CTGAN achieves competitive performance across many datasets and outperforms TVAE on some benchmarks.
  • PrivBayes described in Zhang et al. (2014) is a differentially private Bayesian network model capable of efficiently handling datasets with a large number of attributes. Authors present the package as a new implementation that requires the injection of less noise compared to other differential privacy algorithms, maintaining more signal in the synthetic data. To obtain differentially private synthetic data, PrivBayes starts by creating a Bayesian network that succinctly represents the correlations among the attributes and then injects noise into each marginal distribution to ensure differential privacy.
  • the method finally uses these noisy marginals and the Bayesian network to generate synthetic samples.
  • the most important parameter for the algorithm is epsilon, determining the amount of noise injected in the marginal distributions.
  • epsilon determining the amount of noise injected in the marginal distributions.
  • the choice of epsilon is not straightforward since the level of both quality and privacy of the generated data depends on the type of distributions, number of samples and complexity of the Bayesian network.
  • the inventors chose an epsilon equal to 1 as it showed to be the best compromise in these simulations. RANDOM. This approach does not correspond to a synthetic data generation algorithm in itself, but it is used as a lower bound for normalizing the other benchmark methods.
  • the synthetic dataset is obtained by generating random data using uniform distributions (inside the ranges of the original data).
  • the parameters used by the synthesizer are: the DAG creation method and the compression of values that exceed real data boundaries in the continuous variables (setting values that exceed the range of observed or acceptable/realistic values for a variable to the nearest boundary of this range). The latter is optional. It is performed because regression models (random forest regression models used to predict continuous variables in some cases, see above), can generate data outside the range of the original data.
  • the compression avoids the creation of synthetic data that has strange behaviours or wrong values (like small negative values for clinical variable with a lot of 0, e.g. number of sentinel node positive).
  • Benchmark data - Breast cancer (METABRIC).
  • the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset is a collection of over 2,000 clinically annotated primary breast cancer specimens obtained from tumor banks in the UK and Canada (Curtis et al. 2012).
  • the cohort encompasses clinical variables and genetic information including copy number alterations, copy number variations, and single nucleotide polymorphisms.
  • the METABRIC dataset was selected for this study due to its widespread usage and validation in the literature, as well as its suitable sample size for the application of machine learning algorithms and coexistence of numerical and categorical features.
  • the original dataset consisting of 2491 patients and 36 variables, was pre-processed by removing patients with more than 20% missing values and variables with unique values. This is mostly to evaluate all the methods in a scenario where we have some NAs but not samples with almost all NAs.
  • the resulting filtered dataset comprised 1977 patients and 29 clinical variables, with 19 of them being discrete and 10 continuous.
  • Figure 4 shows the network reconstructed by the MIIC algorithm for this data.
  • the corresponding graph obtained by applying MIIC-to-DAG contains 63 edges reporting direct association between the 29 variables.
  • the variables are shown on Figure 4 and include: laterality, cellularity, claudin subtype, her2-snp6 (HER2 SNP6 loss), ER status (estrogen responsive), her2- status, PR-status (progesterone responsive), grade, histological subtype, chemotherapy, radiotherapy, tumour stage, breast surgery, ER-IHC (ER status by immunohistochemistry), histological subtype (detailed as per PAM50 subtypes, i.e.
  • the final data contains 297 samples and 24 features (e.g. Cancer subtype, mutational burden, sex, race, ECOG score, tobacco use, disease status, sample age, tissue type, TCGA subtype, types of therapy, Overall Survival, Clinical response).
  • Benchmark setting The inventors evaluated the different algorithms using the whole METABRIC dataset and different sample sizes, by subsampling the 1977 samples dataset in sizes 50, 100, 200, 500, 1000, 1500 and 1977. This allows one to assess the performance of each method in multiple subsets along with their stability. A total of 10 datasets were created for each sample size and each one of them was used to generate 10 synthetic datasets, each built with a different seed (100 datasets for each sample size). All comparative methods used a seed parameter.
  • Synthpop depends on the order of the features in the input dataset, and the order can change based on the seed.
  • the absence of a seed is a benefit of the present method as it is not necessary to generate multiple datasets with different seeds in order to get a representative picture of the performance of the method.
  • Quality metrics – Univariate analysis The inventors assess whether the distribution of each feature follows the same distribution in the original and synthetic datasets.
  • MI Mutual information
  • MI has been shown to robustly capture the association between variables even when their relation is nonlinear.
  • the inventors compared the MI matrices for real and generated data and computed the average difference between the two matrices. They estimated the MI for discrete-continuous or continuous- continuous variables through the optimum discretization algorithm implemented in the MIIC package, as described above.
  • the quality of the generated data is directly associated with the (mean) distances between the MI matrices of the original data and the one of the generated data. Small distances correspond to data that reliably capture the underlying structure present in the original data.
  • Quality metrics – Correlations The inventors assessed whether the bivariate distributions between each pair of features are preserved in the synthetic data.
  • the inventors compared the correlation matrices in the original and synthetic data and computed the mean difference between the matrices.
  • the analysis is performed on all variable pairs by calculating their correlation using two approaches.
  • the lower triangular matrix was determined by computing Pearson's correlation coefficient between continuous variables and Cramer's V between categorical variables.
  • the upper triangular matrix was dedicated to analyzing the relationship between continuous and discrete variables. To this end, the inventors used the MIIC algorithm which has been shown to optimally discretize the continuous features by maximizing the mutual information for all potential cut-points on the continuous variables.
  • a small Wasserstein distance corresponds to synthetic data that reliably represent the multivariate distribution.
  • Quality metrics - Machine learning performances One way to evaluate the quality of a dataset is to assess if the generated data can be used to perform classical machine learning tasks such as supervised learning. The inventors therefore chose to compare the algorithms based on their capability to build a relevant machine learning model to predict overall survival using a survival random forest model.
  • Survival Random Forest is a time-to-event model, similar to a Cox regression model, for censored data.
  • the inventors also evaluated whether each synthetic dataset retains robust relationships by comparing variable permutation importance ranking with the “true” ranking obtained on the original dataset. Permutation importance where obtained using the permutation_importance function of the scikit-learn python library. This determines the decrease in a model's R 2 score when a single feature value is randomly shuffled. Results Univariate analysis When comparing two datasets that share the same set of features, the simplest analysis that can be conducted involves assessing the distribution of each variable within both the original and synthetic datasets. To accomplish this, the inventors applied the chi-squared test for categorical variables and the Wilcoxon test for numerical variables, with a significance level set at 0.05.
  • Table 2 presents the average count of features that exhibited statistically significant differences based on these two tests across various sample sizes (columns) and algorithms (rows), for the METABRIC dataset.
  • the standard deviation is provided in parentheses.
  • the results indicate that Synthpop is the most effective method for replicating univariate distributions, followed closely by Bayesian algorithms and MIIC-SDG, which demonstrated similar performance.
  • the other algorithms fell short in reproducing the univariate distribution, with between 16 (CTGAN and TVAE) and 21 features exhibiting differences in the largest sample size (from a total of 29 features).
  • the random method flagged nearly all variables as different, owing to its random sampling approach within the original feature range.
  • MI Mutual information
  • FIG. 5 shows the correlation matrices for the datasets using 1000 samples (selecting 1000 samples of the 1977, 10 times, and for each dataset, generating 10 synthetic datasets, changing the seed for algorithms that use a seed) for the METABRIC dataset.
  • Values are obtained as a mean correlation over all executions from running the algorithms on the 1000 sample datasets (using bootstrap) and using multiple seeds.
  • the performance of the different methods are, in order: Bayesian with tree search algorithm, Synthpop, CTGAN, MIIC-SDG, TVAE and PrivBayes.
  • the score of the random method is the worst and it is strongly dependent on the type of associations between variables that exist in the original data (correlation structure and strength).
  • a dataset with only few strong correlations will provide small mean distances also when taking random values, due to the fact that most of the features are not correlated and taking random values does not alter significatively the resulting correlation.
  • MIIC-SDG Synthpop and bayesian networks with tree search all report the same 2 features (Nottingham Prognostic Index and the number of positive lymphatic nodes found) as the most important for survival prediction.
  • Figure 9 shows the concordance index estimates from Survival Random Forest model to predict Overall Survival. It is important to notice that the ability to predict a target variable (OS in this case) from other features is also used as a metric for privacy, by building an inference attack on sensitive attributes. Having a high concordance with the true data also correlates to a high risk in the case of inference attacks.
  • Example 3 Benchmarking of MIIC-SDG relative to state of the art: privacy metrics
  • the algorithms described in Example 2 were assessed using the benchmark data as described in Example 2, and compared based on the level of privacy for each synthetic data generation method. Methods See Examples 1 and 2.
  • the method uses a weighted Euclidean distance as metric, giving more importance to features with an unbalanced distribution of events that are rarer.
  • Yoon et al.2020 define ⁇ -identifiability as the property of having less than ⁇ ratio of observations from the original dataset in the generated synthetic dataset that are “not different enough” from the original observations. ⁇ corresponds to the defined identifiability score.
  • identifiability would represent a perfectly non-identifiable (private) dataset and one identifiability would represent a perfectly identifiable dataset.
  • the proposed identifiability is defined for all the samples or variables.
  • the described identifiability distance is implemented in the Synthcity package.
  • the derived privacy is evaluated as 1 - identifiability score.
  • Privacy metrics - Membership inference score.
  • the inventors used the partitioning membership disclosure attack method proposed by El Emam K and colleagues where, instead of using the hamming distance between samples as a similarity measure, the inventors used a unidimensional weighted Wasserstein distance where the weights are defined as the entropy of each feature as proposed by J. Yoon et al. This score evaluates whether we are able to identify which patients were used to create the synthetic dataset by subsampling the original dataset into a training and test set, for varying sample size.
  • the derived privacy metric is evaluated as 1 - membership inference score. Results In 2006 the paradigm of differential privacy was introduced and is still to date one of the most used techniques to try to preserve data privacy through mathematical constraints (Dwork et al. 2006).
  • the identifiability score corresponds to the probability of re-identification given the combination of all data on any individual patient. It is evaluated by measuring the identifiability of the finite original patient data using the finite generated synthetic data.
  • Figure 7A shows the identifiability score evaluated on the synthetic data generation algorithms for the METABRIC dataset.
  • the Bayesian algorithm with tree search has the highest identifiability scores (0.7– - 0.66), followed by the Synthpop algorithm (0.5– - 0.51), MIIC-SDG (0.5– - 0.40), TVAE (0.5– - 0.24), CTGAN (0.4– - 0.19), Bayesian with hill climbing (0.2– - 0.13), PrivBayes (0.0– - 0.01) and Random (0.0– - 0), with numbers between parenthesis corresponding to the smallest and biggest sample sizes. Interestingly, the random algorithm did not reach 0 for the smallest sample sizes.
  • the membership inference score corresponds to the probability of identifying which patients have been used to generate the synthetic dataset.
  • Figure 7B shows the membership inference score evaluated on the synthetic data generation algorithms for the METABRIC dataset.
  • Bayesian tree search algorithm is the method where it is the easiest to guess if a sample has been used or not to generate the synthetic data, followed by the Synthpop method, where scores never decrease more than 0.5.
  • MIIC-SDG remains at the third position, with scores that instead decrease well with bigger sample sizes.
  • CTGAN obtains slightly better results in the membership inference attack, together with TVAE.
  • Bayesian hill climbing generates datasets where it is hard to guess the membership of samples in the original data.
  • PrivBayes and the random algorithm have similar scores, with values vanishing to 0. Literature results (van Breugel et al.
  • Normalized privacy IS 1 - (Identifiability score / Identifiability score random data)
  • Normalized privacy MIS 1 - (membership inference score / membership inference score random data) with the Identifiability score or membership inference score ranging in [0,1].
  • Results Quality-Privacy scores can be evaluated by using different metrics for both quality and privacy. Both dimensions have been evaluated by calculating the ratio between the value obtained using the data of each algorithm and the value obtained using the corresponding random data, so that both quality and privacy range in [0,1] (normalized formula).
  • QPS are obtained through the F1 formula introduced in the Methods above.
  • Figure 10 shows the quality mutual information metric, the two privacy dimensions and the two derived QPS (one for each privacy metric), for the METABRIC dataset.
  • the QPS derived through mutual information distance shows that MIIC- SDG method is the best algorithm with respect to the quality-privacy trade-off for both QPS metrics, followed by Synthpop.
  • MIIC-SDG algorithm obtained the best QPS results for correlation distance for small sample sizes ( ⁇ 500 samples) and obtained the second best scores for larger sample sizes, after the CTGAN algorithm.
  • the QPS obtained through Wasserstein distance shows the Bayesian hill climbing as the best algorithm, followed by CTGAN and MIIC-SDG at comparable scores.
  • TVAE shows slightly worse performances, followed by Synthpop that does not show competitive results due to a poor privacy score.
  • Last in the group we find Bayesian tree search and PrivBayes the first due to a poor privacy score and the second due to a poor quality of the generated data. Note that the correlation and Wasserstein distance metrics may not produce the same results because the Wasserstein distance can be highly influenced by the number of discrete variables and the number of levels for each variable.
  • Defining Quality Metrics the inventors first established various metrics for assessing the preservation of data quality. These metrics were designed to gauge the ability of the methods to generate data that closely resembles the original dataset. One of the most used methods to compare data have been performed through the Pearson correlation coefficients. However, Pearson correlation is also known to be very sensitive to outliers, which may explain some of the apparent good relative rankings of certain methods under correlation scores, while they exhibit poorer performance under more robust statistical criteria such as MI, which only depends on the ranks (and not the specific values) of the variables of interest. The inventors hence focused our analysis on MI, but also presented results using the more classical correlation concept. 2. Privacy Considerations: the inventors then focused on the privacy of the original sensitive data.
  • MIIC-SDG execution times for a method according to the present disclosure
  • the computational runtime of MIIC-SDG is comparable to the one of CTGAN or TVAE, showing that it is able to obtain the demonstrated benefits while remaining computationally efficient.
  • References All documents mentioned in this specification are incorporated herein by reference in their entirety.
  • Dor, Dorit and Michael Tarsi “A simple algorithm to construct a consistent extension of a partially oriented graph.” (1992).
  • Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about” or “approximately”, it will be understood that the particular value forms another embodiment.
  • the terms “about” or “approximately” in relation to a numerical value is optional and means for example +/- 10%.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Biophysics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Pathology (AREA)
  • Mathematical Optimization (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

Computer-implemented methods of providing a clinical predictor tool are described, comprising: obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein obtaining the training data comprises obtaining synthetic clinical data comprising values for a plurality of clinical variables for one or more patients by obtaining a directed acyclic graph (DAG) edges corresponding to conditional dependence relationships inferred from real clinical data comprising values for the plurality of clinical variables for a plurality of patients, and obtaining values for each node of the DAG using a machine learning model and multivariate conditional probability table. Computer-implemented methods of obtaining synthetic clinical data are also described.

Description

CLINICAL DATA ANALYSIS Field of the Present Disclosure The present disclosure relates to methods for providing synthetic clinical data, and in particular to methods of providing synthetic clinical data using directed relationships between variables and a Bayesian inference framework, as well as methods to train clinical variables predictor tools using synthetic clinical data. Related methods, systems and products are described. Background Electronic healthcare data collection has become ubiquitous in the last two decades and is paramount for the development of new Artificial Intelligence (AI) algorithms and tools that can be deployed in the clinic, for example to improve prognosis, diagnosis, personalised medicine, etc. However, the development of such tools is often hindered by availability of training data. Solutions have been proposed to generate synthetic medical records. However, this is a very complex task that becomes even more challenging when the preservation of patient privacy is a requirement. This is often the case and restrictions on data sharing in order to protect patient privacy is one of the barriers limiting possibilities for data sharing and aggregation that could unlock new possibilities. Therefore, there is a need for new methods to generate synthetic healthcare data to enhance the capabilities of data driven clinical tool development. There present inventors postulated that when generating synthetic healthcare data there is a trade-off between data quality and data privacy. On the one hand, small modifications of the original data are directly associated with good quality scores but poor privacy ratings, since almost all the information of the dataset is maintained. On the other hand, strong perturbations or noise addition lead to a net loss on quality and usually a concomitant gain on privacy. They therefore set out to test this by jointly evaluating the quality and privacy levels of clinical datasets generated using existing algorithms. They indeed found that while some algorithms performed very well in terms of simulated data quality or privacy, the existing methods could be improved in terms of how they balanced these requirements. In other words, given a strict requirement of maintaining a certain level of privacy, the inventors set out to develop a new method for synthetic clinical data generation that could achieve a better level of quality. This is essential because the privacy level is often something that is a constraint of the problem that clinical data analysis projects have to work with, and maximizing quality of the data within these constraints is essential to ensure that good data driven clinical tools can be developed. The present inventors developed an approach that builds upon a network reconstruction method termed “MIIC” (Multivariate Information-based Inductive Causation), first described in Verny et al. (2017). This infers a graphical network to represent the direct and possibly causal associations between variables in a dataset. The algorithm can use data including a mixture of categorical and continuous variables, and is robust to the presence of missing data. The MIIC algorithm is a causal discovery constraint- based method and therefore the graphs produced contain a mixture of undirected and directed edges, as well as directed cycles. Indeed, the method is designed to uncover relationships between clinical variables to provide clinically relevant insights that may not be immediately apparent when looking at the various clinical variables in isolation. In other words, the method is designed for data exploration and the graphs produced are not suitable or designed for data generation. Nevertheless, the present inventors recognized that the graphs produced by this method represented a promising starting point for data generation, and designed a method to use the information captured in these graphs for this purpose. The method therefore comprises a first step of obtaining a graph between variables in a clinical dataset using a causal-discovery constraint-based method (such as e.g. the MIIC algorithm), a step of obtaining a directed acyclic graph (DAG) that is consistent with the graph produced by MIIC, and a step of generating synthetic data using this DAG and either multivariate conditional probability tables learned from the data or machine learning algorithms trained on the data. In particular, the DAG is used to iteratively generate data starting from isolated nodes or nodes without parents (by sampling from learned distributions associated with these nodes), then propagating from parents to target nodes by sampling from multivariate conditional probability tables learned from the original data and that capture the conditional dependence between parents and targets (in the case of discrete or discretised data for both parents and targets), or machine learning algorithms (e.g. classifiers or regressors) that can learn dependencies between target variable values and parent variable values from the original data (for target variables with continuous parents or both discrete and continuous parents (mixed type variables)). Further optional improvements of this method additionally include the provision of a new method to generate a DAG from the graph provided by the causal discovery constraint-based method, which operates by: orienting each undirected edge in a way that minimizes the number of directed cycles, and removing all directed cycles from the graph by iteratively flipping an edge in the longest cycle in the graph, the edge flipped being selected so as to minimize the number of remaining cycles. This approach was found to best exploit the information in the graph, and generate synthetic data with lower variability compared to know methods to generate a DAG such as a Depth-First Search (DFS) orientation algorithm. Further optional improvements of this method additionally include the use of a discretization scheme for continuous data in the data generation step, which maximises the mutual information between each parent variable and the target variable. A similar concept is used in the MIIC algorithm in order to calculate the multivariate information between variables when performing network inference. However, the data is not in fact discretised in the MIIC algorithm in the sense that no equivalent discrete data is generated from which multivariate conditional probabilities are learned. The discretization of continuous data is simply used as a tool to enable the calculation of the mutual information terms involving continuous variables. By contrast, in the present context, one or more continuous parent variables may be selected for discretization, and the resulting discretised data may be used to obtain a multivariate conditional probability table that is used to sample values for a target variable of the discretised parent variable(s). According to a first aspect of the disclosure, there is provided a computer implemented method of providing synthetic clinical data comprising values for a plurality of clinical variables for one or more patients, the method comprising: obtaining a directed acyclic graph (DAG) comprising a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes, wherein the edges correspond to conditional dependence relationships inferred from real clinical data comprising values for the plurality of clinical variables for a plurality of patients, and wherein the DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes; and generating synthetic clinical data for a patient using the DAG by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node; and iteratively, obtaining a value for each target node of the DAG using: a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table estimated from the real clinical data or a machine learning model trained on the real clinical data, wherein the plurality of variables comprise at least one continuous variable and at least one discrete variable, and wherein obtaining a value for at least one target node of the DAG comprises using a machine learning model trained on the real clinical data. The method of the first aspect may have any one or any combination of the following optional features. The machine learning model used to obtain a value for a target node may have been trained using training data comprising, for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes. The machine learning model used to obtain a value for a target node may be configured to predict a value for the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node. Obtaining a value for a target node may comprise training a machine learning model to predict a value of the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node, wherein the training uses training data comprising for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes. The machine learning model may be a regression or classification model. The machine learning model may be a non-linear classification or regression model. The machine learning model may be a tree-based classification or regression model, such as a random forest model. Obtaining a value for a target node using a machine learning model trained on the real clinical data may comprise predicting a value using said machine learning algorithm and adjusting the predicted value to fall within the observed range for the variables in the real clinical data, optionally by setting the predicted value to the nearest boundary of the range when the predicted value is outside of the observed range. Obtaining a value for a target node may comprise: determining that the target node and all parent nodes are associated with discrete variables and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with target nodes and parent nodes; or determining that at least one of the target node and parent nodes is associated with a continuous variable and performing one of: (i) discretising the target or parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; or (ii) using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes. The method may comprise discretising a target or parent node that is associated with a continuous variable using a respective discretisation scheme, wherein the discretisation scheme is a discretisation that is specific to the target and parent nodes and that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes. The method may comprise determining that the target node is associated with a continuous variable and fitting a probability density function to the real clinical data for the continuous variable associated with the target node, wherein a separate probability density function is fitted for each combination of values of the discrete or discretised parent nodes. The objective criterion may be the maximisation of the absolute value of a statistical metric of dependence between the variables associated with the target and parent nodes. The statistical metric of dependence may be mutual information. Obtaining a value for a target node may comprise: determining that at least one of the parent nodes is associated with a continuous variable, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; wherein discretising the target or parent nodes that are associated with a continuous variable comprises, for each continuous variable to be discretised, determining a discretisation scheme using an optimisation method comprising identifying a plurality of non-overlapping subranges for the continuous variable that are associated with a supremum value of a statistical metric of dependence between the target and parent nodes, amongst a plurality of values of said statistical metric associated with respective candidate plurality of non- overlapping ranges for the continuous variable. The method may comprise: determining that at least one of the parent nodes is associated with a continuous variable; and when the number or proportion of parent nodes associated with a continuous variable is below a predetermined threshold, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; or when the number or proportion of parent nodes associated with a continuous variable is at or above a predetermined threshold, using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes. The step of obtaining a directed acyclic graph (DAG) may comprise obtaining a network between the variables in the real clinical data using a causal discovery method or a constraint-based causal discovery method, optionally wherein the causal discovery method or constraint-based causal discovery method determines conditional mutual information between variables associated with the nodes in the network.The step of obtaining a DAG may comprise: obtaining a network between the variables in the real clinical data using a constraint-based causal discovery method, wherein the network comprises directed and undirected edges; and obtaining a DAG from the network by: using a depth-first-search orientation algorithm; or selecting the direction of each undirected edge that minimises the number of directed cycles in the network, and removing all directed cycles in the resulting directed graph by iteratively identifying the longest cycle in the graph (the one with the most edges) and changing the direction of the edge that minimizes the number of remaining cycles in the graph. The real clinical data used may comprise data for the plurality of clinical variables for a plurality of patients each representing an independent sample, wherein the number of samples is at least 5 times, at least 6 times, at least 7 times, at least 8 times, at least 9 times, or at least 10 times, preferably at least 10 times larger than the number of clinical variables in the plurality of clinical variables. According to a second aspect, there is provided a method of of providing a clinical predictor tool, the method comprising: obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis; wherein obtaining the training data comprises obtaining synthetic clinical data comprising values for a plurality of clinical variables for one or more patients, obtained using a method comprising: obtaining a directed acyclic graph (DAG) comprising a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes, wherein the edges correspond to conditional dependence relationships inferred from real clinical data comprising values for the plurality of clinical variables for a plurality of patients, and wherein the DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes; and generating synthetic clinical data for a patient using the DAG by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node; and iteratively, obtaining a value for each target node of the DAG using: a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table estimated from the real clinical data or a machine learning model trained on the real clinical data, wherein the plurality of variables comprise at least one continuous variable and at least one discrete variable, and wherein obtaining a value for at least one target node of the DAG comprises using a machine learning model trained on the real clinical data. Thus, the step of obtaining synthetic clinical data comprising values for a plurality of clinical variables for one or more patients may be performed using the methods of any embodiment of the first aspect. The clinical predictor model may be a classification or regression model. The training data may comprise real and synthetic clinical data. The diagnosis or prognosis may be selected from a disease severity, a disease subtype, and survival. The diagnosis or prognosis may be survival and the disease may be cancer. According to a third aspect, there is provided a computer-implemented method of analysing clinical data, the method comprising: obtaining synthetic clinical data using the method of any embodiment of the first aspect; and performing one or more of: providing a clinical predictor tool by obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables, wherein the training data comprises said synthetic data; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis; combining said synthetic data with the real clinical data used to obtain the synthetic data; outputting said synthetic data to a third party or public data repository; combining said synthetic data with another clinical dataset for the purpose of further analysis, optionally including obtaining a clinical predictor tool; and analysing said data using any data-driven clinical discovery method. According to a fourth aspect, there is provided a system comprising: at least one processor; and at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any embodiment of any of the first, second or third aspect, or any method described herein. According to a further aspect, there is provided a non-transitory computer readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any embodiment of any of the first, second or third aspect, or any method described herein. According to a further aspect, there is provided a computer program comprising code which, when the code is executed on a computer, causes the computer to perform the method of any embodiment of any of the first, second or third aspect, or any method described herein. Brief Description of the Drawings Embodiments of the present disclosure will now be described by way of example with reference to the accompanying drawings in which: Figure 1 is a flowchart illustrating a method for generating synthetic clinical data according to a general embodiment of the disclosure; Figure 2 is a flowchart illustrating a method for analysing clinical data according to an embodiment of the disclosure; Figure 3 illustrates schematically an exemplary system according to the disclosure; Figure 4 shows a MIIC network obtained for the preprocessed METABRIC dataset (see Examples - Benchmark dataset). Figure 5 shows correlation matrices evaluated on 1000 samples of the METABRIC dataset, for each of a plurality of synthetic data generation methods. Correlation for each x,y combination is evaluated as the mean value over all executions. A. Original data, Bayesian tree search, CT-GAN. B. MIIC-SDG (method according to the disclosure), Bayesian hill climbing, TVAE. C. Synthpop, PrivBayes, Random. Figure 6 shows quality metrics evaluated on synthetic data generated from the METABRIC dataset using a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (obtained by sampling from a uniform distribution over the range or levels of each feature) is used as a baseline. A. Mutual information distance between original and synthetic datasets. B. Wasserstein distance in a multivariate scenario. Left panel represents the Wasserstein distance evaluated using only categorical variables (19 variables), center panel represents Wasserstein distance on continuous variables (10 variables) and right panel represents the Wasserstein distance computed using all variables (29 variables). C. Correlation distance between original and synthetic datasets. Figure 7 shows privacy metrics evaluated on synthetic data generated from the METABRIC dataset using a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random sampling from uniform distributions) is used as a baseline. A. Identifiability score. B. Membership inference score. Figure 8 shows the results of a feature permutation importance process from a survival random forest model trained to predict overall survival, fitted on a set of 1977 patients from the METABRIC dataset or corresponding synthetic datasets obtained using the methods indicated. A. Original data, Bayesian tree search. TVAE. B. MIIC-SDG (method according to the disclosure), Bayesian hill climbing, PrivBayes. C. Synthpop, CT-GAN. Figure 9 shows the results of k-fold cross-validated c-index estimates from Survival Random Forest models to predict Overall Survival in METABRIC dataset as a function of sample size used for model training (K=10). Each plot shows a different sample size as indicated (either original data or synthetic data obtained from a sample of the original data of the indicated sample size). C-index is indicated on the y-axis, and the different synthetic data generation methods are on the x-axis. Methods under comparison: Bayesian tree search, SynthPop, CT-GAN, TVAE, MIIC-SDG (method according to the disclosure), PrivBayes and Bayesian hill climbing. Figure 10 shows quality, privacy and quality-privacy scores (QPS) for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random sampling from uniform distributions) is used as a baseline. Mutual Information distance is used as the quality measure and privacy is evaluated using identifiability and membership inference scores. Data from METABRIC. Figure 11 shows quality, privacy and quality-privacy scores (QPS) for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random permutation of all data columns) is used as a baseline. Mutual Information distance is used as the quality measure and privacy is evaluated using identifiability and membership inference scores. Data from the IMVIGOR210 trial (Bladder cancer). Figure 12 shows quality-privacy scores for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Randomly generated data (random sampling from uniform distributions) is used as a baseline. Wasserstein distance, Mutual Information distance and Correlation distance as quality measures and privacy evaluated with identifiability and membership inference scores. Data from METABRIC. Figure 13 shows execution times for a plurality of synthetic clinical data generation methods including a method according to the present disclosure (MIIC-SDG) and comparative methods (SynthPop, Bayesian tree search, Bayesian hill climbing, PrivBayes, CT-GAN and TVAE). Data from METABRIC. The x-axis indicates the number of samples generated. Figure 14 shows pseudocode for a method of generating a directed acyclic graph from a MIIC network. Figure 15 illustrates schematically the MIIC-SDG pipeline used in the examples of the disclosure. A) Execution of the MIIC algorithm from the original data table. This step generates a graph where nodes represent the variables of the data matrix and edges represent associations between variables. B) Transformation of the graph in a directed acyclic graph (DAG) through the MIIC-to-DAG algorithm. C) Generation of the data using the original data table and the reconstructed DAG. D) Details on the data generation phase: for each variable we need to take into consideration the variable type to best predict the anonymised data. RF = Random Forest, CPT = Conditional Probability Table, PT = Probability Table, Prob. Table = Probability Table, Emp. Dens. = Empirical density, Est.=estimation, Nb=number. Where the figures laid out herein illustrate embodiments of the present invention, these should not be construed as limiting to the scope of the invention. Where appropriate, like reference numerals will be used in different figures to relate to the same structural features of the illustrated embodiments. Detailed Description Specific embodiments of the invention will be described below with reference to the figures. The present disclosure describes methods for generating synthetic clinical data, and methods for using such data to compare clinical datasets, combine clinical datasets, augment clinical datasets, and train machine learning models to predict clinical variables. In the context of the present disclosure, “clinical data” refers to data associated with a subject, that characterises the subject’s health status. Clinical data for a subject includes values for each of a plurality of clinical variables. Clinical data may include demographic data (e.g. data about a subjects, age, gender, ethnicity, etc.) about the subject, clinical history data (e.g. comorbidities, treatment history, exposure factors, characteristics of a subject’s disease such as e.g. tumour stage, etc.) about the subject, omics assay data about the subject (e.g. data obtained from assays that measure the presence and/or amounts of omic markers), data from medical images obtained from the subject (e.g. MRI, endoscopy, x-ray, ultrasound, CT scans, etc.; such as e.g. tumour size determined by imaging, tumour laterality determined by imaging), and/or data from histopathology analysis of one or more samples from the subject (e.g. cellularity of a tumour, molecular subtype determined by histopathology, etc). Omics assay data may include data about gene expression (transcriptomics data, data about the presence and/or level of one or more transcripts) obtained from one or more samples from the subject, data about protein expression (proteomics data, data about the presence and/or level of one or more proteins) obtained from one or more samples from the subject, metabolomics data (data about the concentration of one or more metabolites) obtained from one or more samples from a subject, genomics data (data about the presence and/or characteristics of one or more genomic features, such as somatic mutations (of any kind including single base substitutions, indels and rearrangements), polymorphisms, epigenetic marks, copy number variations, etc.) obtained from one or more samples from the subject, and/or microbiome data obtained from one or more samples from the subject (e.g. presence and/or abundance of one or more microbial taxa in one or more samples from the subject). An omic assay may be of any scale, from targeted detection of individual omics marker to genome/transcriptome/proteome/epigenome/microbiome wide assays. Omic data includes information derived from omics data, such as e.g. disease molecular subtypes, tumour mutational burden, etc. Clinical data may be obtained from a sample, may be recorded in one or more databases or may have been previously obtained from a sample and recorded in one or more databased from which it can be obtained for the purpose of performing the methods described herein. A clinical variable is any variable of the type above. In the present context, the clinical data used may be referred to as “tabular data”, in that the clinical data for a subject contains a single value (which may be a summarised value across a plurality of instances for the same variable) for each of a plurality clinical variables. The clinical variables may be discrete or continuous. Methods of the present disclosure are advantageously able to handle clinical data comprising a mixture of discrete and continuous clinical variables. Discrete or discretised variables are variables for which all observed values in a data set are selected from a discrete set of values. The discrete set of values may consist of two values (binary variable) or more than two values. A continuous variable is a variable for which observed values in a dataset are on a continuous scale. The continuous scale may be within a specific range, but within this range any value is possible. Continuous variables may be discretised by identifying non-overlapping ranges and assigning all values that fall within a respective non-overlapping range the same value. The value may be a summarised value for the range, e.g. the middle of the range, or the mean or median of observed value in the range. Methods of the present disclosure make use of a directed acyclic graph (DAG) that captures relationships between clinical variables, in order to generate synthetic data comprising values for each of the clinical variables represented in the DAG. A “network” or “graph” is a data structure G comprising a set of nodes V, and a set of edges E between nodes (G=(V,E)). In the present context, nodes are each associated with a respective clinical variable. A directed acyclic graph is a graph where each edge is oriented, i.e. each edge connects a source (also referred to as “parent”) node to a target (also referred to as “child”) node, and there are no directed cycles in the graph (i.e. there is no path using the directed edges that starts and finishes on the same node). Edges in a DAG can be associated with weights. Weights can represent strength and/or confidence of relationships. These can be used for example for filtering a DAG, for example by applying a threshold on edge weights (e.g. maintaining only edges with a weight above a predetermined threshold). Edge weights may be ignored when generating data from a particular DAG (which may be a filtered DAG). A DAG is used according to methods of the present disclosure to generate synthetic data using one or both of: multivariate conditional probability tables and machine learning models. In particular, a DAG indicates, for each node, the identity of all of the node’s one or more parents. Thus, the DAG can be used to identify a plurality of sets of nodes, each set comprising a target node and all of its parents. For each such set of nodes consisting of nodes associated with discrete or discretised variables, a multivariate conditional probability table can be obtained that comprises the probability of the target node variable taking each of the values in the discrete set of values associated with the variable, depending on the values of the parent nodes (i.e. depending on which of the discrete set of values associated with the parent node variables are observed). For each such set of nodes comprising nodes associated with continuous variables, a machine learning model can be obtained that takes as input values for the parent node variables and produces as output a value for the target node. A DAG further comprises one or more source nodes, and optionally one or more isolated nodes. These may be associated with distributions estimated from data for the variable associated with the nodes. Clinical data may be simulated (i.e. generated) using a DAG by: obtaining a value for each variable associated with a source node or an isolated node by sampling from a distribution estimated from data for the respective node; and iteratively, obtaining the value for each target node of the DAG based on a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table or a trained machine learning model as described above. The iterative process progresses through the graph from source nodes to their target nodes, and from these target nodes to their own target nodes, etc. until a value has been generated for each node of the DAG. The term “machine learning model” refers to a mathematical model that has been trained to predict one or more output values (values of predicted variables or “target variables”) based on input data comprising values for one or more predictive variables (also referred to as parent variables in the context of synthetic clinical data generation). Training refers to the process of learning, using training data, the parameters of the mathematical model that are such that the mathematical model can predict outputs values that satisfy an optimality criterion or criteria. In the case of supervised learning, training typically refers to the process of learning, using training data, the parameters of the mathematical model that result in a model that can predict output values with minimal error compared to comparative (known) values associated with the training data (where these comparative values are commonly referred to as “labels” or “ground truth”). In the context of generating synthetic data using machine learning models, the machine learning models are trained by supervised learning, using as training data at least a portion of the original clinical data from which synthetic clinical data is being generated. There are two major types of supervised learning models: classification models and regression models. Classification models aim to classify observations between a plurality of categories. They are typically suitable when the output to be predicted is a discrete category in the training data. Regression models aim to predict the value of a continuous variable and are therefore suitable when the output to be predicted is a continuous value in the training data. For example, regression models may be used to predict continuous variables such as tumour size, tumour mutational burden, etc. A classification model may provide as output a classification label (i.e. information identifying a class) and/or one or more probabilities of an observation belonging to respective one or more classes. The classes may correspond to specific values of a discrete variable (e.g. presence or absence of a particular clinical history variable such as e.g. previous exposure to chemotherapy; severity score for a disease of the subject e.g. tumour stage, etc.). A predetermined threshold may be applied to the one or more probabilities to assign a class label. The term “machine learning algorithm” or “machine learning method” refers to an algorithm or method that trains and/or deploys a machine learning model. The machine learning models of the present disclosure are trained by supervised learning. An optimality criterion may be the minimisation of a loss function that quantifies the model prediction error based on the observed (ground truth) and predicted values of the predicted variables. Suitable loss functions for use in training machine learning models are known in the art and include the mean squared error (MSE), and the mean absolute error (MAE). Any of these can be used according to the present disclosure. Regularised loss functions are functions that include a loss function as described above (e.g. MSE or MAE), and one or more terms penalizing model complexity in order to reduce the risk of overfitting. Overfitting is a phenomenon that occurs where a machine learning model is trained to very closely reproduce the features of a training data set, resulting in poorer performance on other datasets that do not have the same characteristics (i.e. poor generalizability). L1 regularisation (also known as “Lasso” in the context of regression) add a regularization term to the loss function that penalizes models based on the sum of absolute value of the coefficients of the model. L2 regularisation (also known as “Ridge” in the context of regression) add a regularization term to the loss function that penalizes models based on the sum of squared value of the coefficients of the model. L1 regularisation can be used as a feature selection method as it minimizes the coefficients associated with less informative predictive features. A machine learning model as described herein may be any regression of classification model known in the art. For example, a machine learning model may be selected from: decision trees and variants thereof including regularised and/or gradient boosted decision trees and random forest models, regularised discriminant analysis, logistic regression models, artificial neural networks (ANNs) including multilayer perceptrons (with linear or non-linear activation functions), naïve Bayes classifiers, support vector machines (SVM, using linear or non-linear kernels such as radial basis function) and multivariate adaptive regression splines (MARS). For the purpose of generating synthetic data, non-linear models may be particularly useful as they are able to capture non-linear relationships between parents and target variables, which are likely to occur in clinical data and which can also be captured in the networks from which parent-target relationships are obtained through the use of mutual information-based network inference. Non-linear models include, for example, decision trees and variants thereof (including in particular random forests and gradient boosted trees), SVM with a non-linear kernel, and ANNs (e.g. multilayer perceptrons) with non-linear activation functions. In embodiments, a machine learning model used to generate synthetic data is a regularized model. In embodiments, the machine learning model is a tree-based model. In embodiments, the machine learning model is a regularized gradient boosted decision tree model. Examples of such models are available in the XGBoost software library (xgboost.ai/). Such models may be referred to as “XGBoost” models, although any other implementation of regularized gradient boosted models may equally be used. In embodiments, a machine learning model comprises an ensemble of models whose predictions are combined. Alternatively, a machine learning model may comprise a single model. Random forest models and gradient boosted tree models (such as XGBoost) are ensemble models. Ensemble versions of any models can be constructed. Ensemble models are expected to result in better prediction performance than single models and are therefore preferred in the context of the methods described herein. For example, the machine learning model may be a random forest classifier or regressor. A random forest classifier is a model that comprises an ensemble of decision trees and outputs a class that is the average prediction of the individual trees. Similarly, a random forest regressor is a model that comprises an ensemble of decision trees and outputs a continuous value that is the average prediction of the individual trees. Each decision tree may be trained on a sampled subset of the training data. Decision trees perform recursive partitioning of a feature space until each leaf (final partition sets) is associated with a single value (for classification) or range of values (for regression) of the target. Gradient boosting is a machine learning method that forms an ensemble of weak prediction models (e.g. decision trees) from which a combined strong prediction is obtained. The algorithm iteratively adds new weak predictors to improve the prediction obtained by combining the outputs of the weak predictors. By contrast, random forest iteratively trains a set number of trees using random subsets of the training data. The present disclosure also provides methods of obtaining a clinical predictor tool using a dataset comprising synthetic data generated using the methods described herein. A clinical predictor tool is typically a machine learning model. For example, a clinical predictor tool may be a machine learning model configured to predict a diagnosis or prognosis for a subject, based on the values of one or more clinical variables associated with the subject. The machine learning model may have been trained to provide as output a diagnosis or prognosis based on input comprising the values of the one or more clinical variables, using training data comprising, for each of a plurality of training subjects, a diagnosis or prognosis and values of the one or more clinical variables, the training data comprising synthetic data generated using the methods described herein. Thus, in some embodiments the machine learning model has been trained using data from “synthetic subjects”, i.e. using data that has been simulated and does not correspond to any real patient. A diagnosis may be a diagnosis of a subtype (including molecular subtypes, histopathological subtypes, phenotypic subtype, therapy response groups, severity groups, or any other distinction of groups of patients or disease, etc.) of a disease that the patient has been diagnosed as having. For example, the patient may have been diagnosed as having bladder cancer and the diagnosis may be the classification of the patient between a group associated with response to a particular therapy, and a group associated with a lack of response to the particular therapy. A prognosis may be survival, such as e.g. overall survival (OS), disease free survival (DFS), regression free survival (RFS) or any other survival metric or category derived therefrom. For example, a clinical predictor tool may classify a patient between a first class associated with “good” survival and a second class associated with “poor” survival, or any number of classes associate with different levels of survival (e.g. good prognosis, intermediate prognosis or poor prognosis). As another example, a clinical predictor tool may be a regression model and may predict a continuous variable such as the probability of good or poor survival. Any machine learning model known in the art may be used as a clinical predictor tool. In particular, any of the machine learning models described above in the context of the use of machine learning for synthetic data generation may be used. A DAG as described herein may have been obtained from a graph that comprises directed and/or undirected edges, including for example graphs that include a mixture of directed and undirected edges. Such a graph may have been obtained using any graphical model (i.e. network) reconstruction method. For example, causal discovery methods may be used, including in particular any causal discovery constraint-based method (also referred to as constraint-based causal discovery method). Causal networks can also be referred to as causal Bayesian networks, and any method to identify such networks may be used. A causal discovery method is a method for identifying relationships between variables in a dataset, which by default are likely causal relationships. Any method known in the art that can infer a DAG for a dataset may be used (i.e. any causal discovery method). For example, the Peter and Clark (PC) algorithm is an implementation of the Inductive Causation (IC) algorithm proposed by Verna and Pearl (1990). The PC algorithm (described in Spirtes, Glymour, and Scheines, 2000) is a well-known method of identifying causal networks, that can be used as an alternative to the method used in the examples below. The present inventors have found that an information theoretic constraint based method that starts from a fully connected graph and iteratively removes edges between variables X and Y for which I(X;Y|{Ai})=0 (i.e. there exists a set of variables {Ai} such that X is independent of Y given {Ai}) performed better than PC at capturing underlying associations between variables, was more stable to small fluctuations in the data and ran with a lower computational cost than the PCT algorithm. Relationships between variables indicate that two variables are not independent from each other. This does not imply that the relationships are necessarily in a causal relationship. Edges between nodes can be directed when a likely causal relationship can be inferred between the connected nodes. In other words, the presence of a directed edge between nodes A and B indicates the belief that there exists an intervention on node A that will directly change the distribution or value of B. In a DAG, all relationships are assumed to be causal (even if in reality this may not be the case). In a directed acyclic graph, every variable is assumed to be independent of its non-descendants conditional on its parents (the variables with edges directed into the variable). Causal discovery methods aim to identify all relationships, preferably causal relationships, that are supported by a dataset. In other words, these methods perform a statistical estimation of parameters describing a graphical causal structure. Causal discovery methods typically assume that all edges are directed. Constraint based methods identify structural constraints corresponding to all dispensable edges in a graph, and can identify both directed and undirected edges. Networks with both directed and undirected edges are more likely to capture real relationships in clinical datasets because unobserved (latent) variables frequently exist that impact the causal relationships between variables, leading to spurious causal associations. While the methods of the present disclosure ultimately use a DAG for data generation, it is believed to be beneficial for the graph from which such a DAG is obtained to have been obtained from a constraint based causal-discovery method. This is because the relationships that are identified are more likely to accurately represent reality. A graph used for generating a DAG for synthetic data generation as described herein may have been obtained using an information-theoretic constraint-based causal discovery method. Constraint-based approaches start from a fully connected network and iteratively remove edges between variables X and Y for which a conditional independence can be found. An information theoretic constraint based method iteratively removes edges between variables X and Y for which I(X;Y|{Ai})=0, i.e. there exists a set of variables {Ai} such that X is independent of Y given {Ai}. This is performed by estimating the Probability that I(X;Y|{Ai}) is equal to 0, which is the probability that the edge XY should be removed. Edge XY can therefore be removed if there is an {Ai} such that the probability that I(X;Y|{Ai})=0 is above a predetermined threshold. The probability that I(X;Y|{Ai})=0 can be estimated up to a normalization constant as ^^^^^^~^^^^^^ (equation (3) below) where N is the number of independent samples in the data set from which the graph is being learned. This process results in a pruned network that captures conditional independencies in the data, or in other words only maintains edges between variables that are dependent in the data. Some of the remaining edges can then be oriented by orienting V-structures and propagating orientation of V structures. Methods for obtaining a pruned and at least partially oriented graph, suitable for use in the context of the present disclosure, are described in Verny et al. 2017 and Cabeli et al 2020. A “sample” as used herein may be a cell or tissue sample (e.g. a biopsy), or an extract from which biological material can be obtained for analysis, to obtain values of one or more clinical variables. For example, the sample may be a tumour sample or a blood sample. In the context of histopathology the sample may be a tissue sample, such as a tumour sample. In the context of cancer prognosis or diagnosis a sample may be a tumour sample or a biological fluid sample, for example comprising circulating tumour DNA or tumour cells. The sample may be one which has been freshly obtained from a subject or may be one which has been processed and/or stored prior to making a determination (e.g. frozen, fixed or subjected to one or more purification, enrichment or extractions steps). In particular, the sample may be a cell or tissue culture sample that has been derived from a tumour. As such, a sample as described herein may refer to any type of sample comprising biological material from which values of clinical variables may be determined. Further, the sample may be transported ad/or stored, and collection may take place at a location remote from the biological data acquisition (e.g. sequencing) location, and/or the computer-implemented method steps may take place at a location remote from the sample collection location and/or remote from the clinical data acquisition (e.g. sequencing) location (e.g. the computer-implemented method steps may be performed by means of a networked computer, such as by means of a “cloud” provider). A “tumour sample” refers to a sample that contains tumour cells or genetic material derived therefrom. The tumour sample may be a cell or tissue sample (e.g. a biopsy) obtained directly from a tumour. A subject or individual according to the present disclosure is preferably a mammalian (including a human or a model animal such as mouse, rat, etc.), preferably a human. The terms “patient”, “subject” and “individual” are used interchangeably. The patient may be a patient who has been diagnosed as having or being likely to have a disease. The disease may be cancer. The cancer may be breast cancer or bladder cancer. The methods described herein have been specifically demonstrated on clinical data from breast cancer and bladder cancer patients. However, the approach is applicable to any clinical context, including but not limited to any cancer type. Figure 1 is a flowchart illustrating a method providing synthetic clinical data according to a general embodiment of the disclosure. At optional step 110, real clinical data comprising values for a plurality of clinical variables for a plurality of patients is obtained. This step is optional because the methods of the disclosure can also start from a previously obtained reconstructed network, learned probability density functions, multivariate probability tables and machine learning models obtained from said data. Obtaining data typically comprises receiving data from one or more computing devices, databases or memories, i.e. receiving previously collected data. The real clinical data used may comprise data for the plurality of clinical variables for a plurality of patients each representing an independent sample, wherein the number of samples is at least 5 times, at least 6 times, at least 7 times, at least 8 times, at least 9 times, or at least 10 times, preferably at least 10 times larger than the number of clinical variables in the plurality of clinical variables (optionally after filtering to remove samples with a proportion of missing data above a predetermined threshold). For example, synthetic data may be generated for 20 variables (i.e. using a DAG comprising nodes associated with 20 clinical variables) using a real clinical dataset comprising at least 2000 samples (patients). The clinical data may comprise one or more of preclinical data, clinical data (e.g. data from clinical trials) and real world data (e.g. data from patient healthcare records). The clinical variables may be seected from: cancer subtype (such as e.g., TCGA subtype, grade, molecular subtype such as histological subtype (detailed as per PAM50 subtypes, i.e. basal, her2, luminal A, luminal B, normal), claudin subtype, histological subtype), a biomarker variable (such as e.g. tumour mutational burden and/or nonsynonymous tumour mutational burden, her2-snp6 (HER2 SNP6 loss), ER status (estrogen responsive), her2-status, PR-status (progesterone responsive), ER status by immunohistochemistry), a demographic or exposure variable (such as e.g. sex, race, age, tobacco use, cohort, age at diagnosis), a patient or disease status variable (such as e.g. ECOG score, disease status, laterality, tumour stage, vital status, menopausal status, lymph node positive, Nottingham prognostic index), a sample variable (such as e.g. tissue type, cellularity), a clinical history variable (such as e.g. types of therapy, history of prior hormone therapy ,chemotherapy, radiotherapy, surgery), a survival and/or clinical response metric (such as e.g. overall Survival, RFS status). At step 112, a network is obtained from the data obtained at step 110. This may comprise obtaining a network between the variables in the real clinical data using a causal discovery method or a constraint- based causal discovery method. The causal discovery method or constraint-based causal discovery method may determine conditional mutual information between variables associated with the nodes in the network. Step 112 may comprise obtaining a network between the variables in the real clinical data using an information theoric constraint-based method that iteratively removes edges between variables X and Y for which there exists a set of variables {Ai} such that X is independent of Y given {Ai}. Removing said edges may be performed by estimating the probability that the edge should be removed as the probability that I(X;Y|{Ai}) is equal to 0, and removing edge XY if there is an {Ai} such that the probability that I(X;Y|{Ai})=0 is above a predetermined threshold, where said probability is provided by equation where N is the number of independent samples in the real clinical data set from which the network is being learned. Obtaining a network may be performed using the MIIC algorithm. The MIIC algorithm is described in Verny et al. 2017 and available at rdrr.io/cran/miic/man/miic.html. The causal discovery method or constraint-based causal discovery method may determine conditional mutual information between variables associated with the nodes in the network, wherein the mutual information between variables associated with the nodes in the network is estimated for the purpose of identifying relationships between nodes using a discretisation scheme that is specific to the set of nodes for which conditional independence is being evaluated. The discretisation scheme may comprise, for each continuous variable to be discretised, determining a discretisation scheme using an optimisation method comprising identifying a plurality of non- overlapping subranges for the continuous variable that are associated with a supremum value of the mutual information between the nodes for which conditional independence is being evaluated, amongst a plurality of values of said mutual information associated with respective candidate plurality of non- overlapping ranges for the continuous variable. The discretisation scheme may be as described in Verny et al.2017. The discretisation scheme may be a discretisation that is specific to the target and parent nodes and that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes. the objective criterion is expressed using equation (21) below, optionally wherein equations (19) and (20) also apply. At step 114, a directed acyclic graph (DAG) is obtained from the network obtained at step 112. The DAG comprises a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes. The edges correspond to conditional dependence relationships inferred from the real clinical data at step 112. The DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes. Obtaining a DAG may comprise receiving a network between the variables in the real clinical data using a constraint-based causal discovery method (obtained at step 112), wherein the network comprises directed and undirected edges, and obtaining a DAG from the network by using one of: (i) using a depth-first-search orientation algorithm; (ii) using a partially directed acyclic graph to DAG algorithm that transforms a graph that contains both directed edges and undirected edges, with no directed cycle in its directed subgraph into a fully directed acyclic graph on the same underlying set of edges, with the same orientation on the directed subgraph and the same set of v-structures by iteratively selecting a sink node x and where all nodes y connected to x by undirected edges are adjacent to each other, and making all undirected edges connected to x into edges directed toward x; or (iii) selecting the direction of each undirected edge that minimises the number of directed cycles in the network, and removing all directed cycles in the resulting directed graph by iteratively identifying the longest cycle in the graph (the one with the most edges) and changing the direction of the edge that minimizes the number of remaining cycles in the graph. At step 116, synthetic clinical data is obtained for a (synthetic, i.e. simulated) patient by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node (step 116B), and iteratively obtaining a value for each target node of the DAG using: a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table estimated from the real clinical data or a machine learning model trained on the real clinical data (step 116C). The plurality of variables comprises at least one continuous variable and at least one discrete variable, and obtaining a value for at least one target node of the DAG comprises using a machine learning model trained on the real clinical data obtained at step 110. Step 116 may further comprise step 116A of imputing missing data for one or more continuous variables. This step may use any data imputation algorithm known in the art. Multivariate imputation methods are preferred, such as e.g. the MICE algorithm. The method may further comprise, prior to the generation of synthetic clinical data, filtering the real clinical data to remove data for patients for which the proportion of missing data is above a predetermined threshold, such as e.g.20%, 30%, 40% or 50%. Embodiments of steps 116B and steps 116C will now be described in more detail. Step 116B may comprise, for each discrete source / isolated node, obtaining a probability table from the real clinical data obtained at step 110 for the node, and sampling from said probability table (i.e. generating data following the distribution in the probability table). Step 116B may comprise, for each continuous source / isolated node, fitting a probability density function (i.e. fitting a distribution to the data, for example by determining kernel density estimates from the data for the node) to the real clinical data obtained at step 110 for the node, and sampling from said distribution. For non-isolated/source nodes (step 116C), data is obtained iteratively using data obtained at the previous step for parent nodes of the target node for which data is being obtained at the current iteration. For example, a value is sampled for two source nodes, then a value is obtained for a target node that is the target of these source nodes using the respective sampled values for the source nodes and one or either a machine learning model or a multivariate conditional probability table. Step 116C may distinguish multiple situations depending on the nature of the target and parent nodes: (a) discrete parents and target nodes; (b) discrete target nodes, parent nodes comprise both discrete and continuous nodes and the number of continuous parents is below a first predetermined threshold; (c) discrete target nodes, parent nodes comprise both discrete and continuous nodes and the number of continuous parents is at or above the first predetermined threshold; (d) continuous target nodes, parent nodes comprise discrete nodes or both discrete and continuous nodes and the number of continuous parents is below a second predetermined threshold; and (e) continuous target nodes, parent nodes comprise both discrete and continuous nodes and the number of continuous parents is at or above the second predetermined threshold. Note that the first and second predetermined thresholds may be the same or different. A multivariate conditional probability table may be used in cases (a), (b), (d). A machine learning model may be used in cases (c) and (e). The method may comprise determining that the target node and all parent nodes are associated with discrete variables (case (a)) and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with target nodes and parent nodes. Thus, the method may comprise estimating a multivariate conditional probability table from the real clinical data for variables associated with target nodes and parent nodes that are all discrete variables. The method may comprise determining that at least one of the target node and parent nodes is associated with a continuous variable (cases (b) to (e)) and discretising the target or parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme (e.g. in cases (b) and (d)). The method may comprise determining that at least one of the target node and parent nodes is associated with a continuous variable (cases (b) to (e)) and using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes (e.g. in cases (c) and (e)). Thus, either a machine learning model or a multivariate conditional probability table may be used in cases where at least one of the parents and target nodes is continuous, depending on whether the number of parent nodes that are associated with a continuous variables satisfy a predetermined criterion. The predetermined criterion may be the number of parent nodes that are associated with a continuous variable being at or above a predetermined threshold (in which case a machine learning model is used), or below said predetermined threshold (in which case a multivariate conditional probability table is used). The predetermined criterion (e.g. predetermined threshold) may depend on whether the target node is associated with a continuous or discrete variable (i.e. there may be a first predetermined threshold and a second predetermined threshold that are different from each other). The machine learning models used to obtain a value for a target node at step 116C (note a separate machine learning model is used for each target node) has been trained using training data comprising, for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes. In other words, the machine learning model used to obtain a value for a target node is configured to predict a value for the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node. Obtaining a value for a target node at step 116C (e.g. in cases (c) and (e)) may comprise training a machine learning model to predict a value of the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node, wherein the training uses training data comprising for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes. The the machine learning model may be a regression or classification model. The machine learning model may be a non-linear classification or regression model. The machine learning model may be a tree-based classification or regression model, optionally a random forest model. The method may comprise obtaining an adjusted value for a predicted value obtained using the machine learning model, wherein the predicted value is outside of a predetermined range and the adjusted value is the value of the nearest boundary of the predetermined range. The predetermined range may be a range that is determined by a user for the particular variable that is predicted, or a range that is automatically determined between the smallest and largest value for the variable in the real clinical data. This may be particularly useful in the context of machine learning models that are regression models as these may provide outputs that are outside of the range of values present in the real clinical data on which they were trained. Thus, obtaining a value for a target node using a machine learning model trained on the real clinical data may comprise predicting a value using said machine learning algorithm and adjusting the predicted value to fall within the observed range for the variables in the real clinical data, optionally by setting the predicted value to the nearest boundary of the range when the predicted value is outside of the observed range. The method may comprise determining that at least one parent node is associated with a continuous variable (cases (b) to (e)), and when the number or proportion of parent nodes associated with a continuous variable is below a predetermined threshold, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme (e.g. cases (b) and (d)). The method may comprise determining that at least one parent node is associated with a continuous variable (cases (b) to (e)), and when the number or proportion of parent nodes associated with a continuous variable is at or above a predetermined threshold, using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes (e.g. cases (c) end (e)). The predetermined threshold may be 3 parent nodes associated with a continuous variable when the target node is discrete (case (b), first predetermined threshold) and 2 parent nodes associated with a continuous variable when the target node is discrete (case (d), second predetermined threshold). When a parent node that is associated with a continuous variable is discretised (e.g. cases (b) and (d)), this can use a respective discretisation scheme, i.e. a discretisation that is specific to the target and parent nodes. The discretisation scheme may be one that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes. The present inventors have found such approaches to make a real difference to the quality of the data generated. This is because the discretisation scheme can really change the relationship (e.g. correlation) between variables leading to multivariate conditional probability tables that do not accurately capture the relationships between the variables. Using a discretisation scheme that takes into account the effect of the discretisation on the relationship between the target and parent nodes ensures that the discretisation preserves the information in this relationship (where the information is estimated based on the objective criterion chosen). The objective criterion may be the maximisation of the absolute value of a statistical metric of dependence between the variables associated with the target and parent nodes. The statistical metric of dependence may be mutual information. Thus, obtaining a value for a target node may comprise: determining that at least one of parent node is associated with a continuous variable (and optionally that the number of parent nodes associated with a continuous variable satisfy a predetermined criterion, e.g. cases (b) and (d)), discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data (and optionally respective fitted probability density functions for the target variable if it is associated with a continuous variable) for the variables associated with the target nodes and parent nodes using said discretisation scheme; wherein discretising the parent nodes that are associated with a continuous variable comprises, for each continuous variable to be discretised, determining a discretisation scheme using an optimisation method comprising identifying a plurality of non-overlapping subranges for the continuous variable that are associated with a supremum value of a statistical metric of dependence between the target and parent nodes, amongst a plurality of values of said statistical metric associated with respective candidate plurality of non-overlapping ranges for the continuous variable. A statistical metric of dependence may be a correlation or mutual information. Mutual information is advantageous as it is able to capture non-linear relationships between variables. As the skilled person understands, an optimisation step does not guarantee that a global optimal solution is identified, and merely implements a strategy to explore the space of possible solutions (each solution corresponding to a candidate plurality of non-overlapping ranges for the continuous variable to be discretised) and select the best solution amongst the solutions explored, according to the objective criterion used. The objective criterion may be expressed as: where k’P:Q(N) is a complexity term that discriminates between variable dependence and independence given the number N of observations for variables X and Y in the real clinical data, P and Q are partitions for parent variable X and target variable Y respectively, where a partition is a set of non-overlapping subranges for a continuous variable or the discrete values observed in the real clinical data for a discrete variable, and I is the mutual information between discrete or discretised variables X and Y, optionally wherein with ^^^^^^(^^^^^^^^^^(^^ − 1; ^^ − 1)) = (^^ − 1)^^^^,^^ (20) where CN,r is the cost associated with each of the r-1 cut points defining the plurality of non-overlapping subranges for variables X and Y, with r=rx or ry , as defined in equation (18). When the target node is associated with a continuous variable and all parent nodes are associated with discrete or discretised variables (e.g. case (d)), the method may comprise fitting a probability density function to the real clinical data for the continuous variable associated with the target node, wherein a separate probability density function is fitted for each combination of values of the discrete or discretised parent nodes. In other words, the method may comprise obtaining a multivariate conditional probability table where the probability for the target node for each combination of discrete values of the parent nodes is defined by a fitted distribution (probability density function). When using a multivariate conditional probability table (e.g. cases (a), (b), (d)), the method may comprise, for each variable that is a discrete variable, including missing data (i.e. NAs) as a category of the discrete variable. Thus, the method may comprise generating missing data according to a conditional probability determined from the real clinical data. At step 118, the results of any one or more of the preceding steps are provided to a user (e.g. through a user interface), or to another computing device, memory or database. The methods described herein find application in a variety of contexts. For example, the methods described herein can be used to generate training data for training one or more machine learning models to predict clinically relevant characteristics (such as e.g. prognosis such as survival, response to a treatment, diagnosis including disease subtyping, disease severity, etc.). The synthetic training data may be used alone or in combination with additional “real” data. As another example, the methods described herein can be used to generate augmented clinical data, comprising real clinical data and additional synthetic clinical data that is generated from the real clinical data using a method as described herein. This synthetic data may comprise whole samples (i.e. complete data for a whole synthetic patient) or predicted values for specific variables, for example in order to fill in missing values. The augmented data set can be used for any clinical tool development (including e.g. machine learning predictors) or any clinical data analysis known in the art. In particular in the context of augmentation by filling in missing values (i.e. data imputation), the method may enable new methods to be applied on the data that may not have been applicable to data comprising missing values. This can make an extremely large difference where data for many patients have a low number of missing fields, as filtering to remove patients with any missing data would effectively drastically reduce the size of the dataset available for clinical investigations. As another example, the methods described herein may be used to combine multiple datasets for subsequent analysis, when at least one of the datasets cannot be shared in its original form. Each such dataset can be used to generate synthetic data using a method as described herein, and the resulting data can then be combined and jointly analysed in any desired way. For example, the combined dataset (which may comprise a mixture of real and synthetic datasets) can then be used for any clinical investigation known in the art, including the training of machine learning predictors. Figure 2 is a flowchart illustrating a method for analysing clinical data according to an embodiment of the disclosure. The method comprises obtaining, at optional step 210, real clinical data comprising values for a plurality of clinical variables for a plurality of patients. This step is optional because the subsequent steps may use exclusively synthetic data or a combination of real and synthetic clinical data. The method further comprises at step 212, obtaining synthetic clinical data comprising values for the plurality of clinical variables for one or more (simulated) patients using the method of any embodiment explained by reference to Figure 1. The method may comprise performing one or more of: (i) providing a clinical predictor tool by obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables, wherein the training data comprises said synthetic data; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis (steps 210, 212, 214, 216); (ii) combining said synthetic data with the real clinical data used to obtain the synthetic data (steps 210, 212, 214, 220); (iii) outputting said synthetic data to a third party or public data repository (step 218); (iv) combining said synthetic data with another clinical dataset for the purpose of further analysis (steps 220, 222); and analysing said data using any data-driven clinical discovery method (step 222). Thus, also described herein are methods of data augmentation comprising generating synthetic data using real clinical data and combining the real and synthetic data. This can result in smoother data, which can improve the performance of machine learning algorithms applied to said data, for example for the purpose of obtaining a clinical predictor tool. Similarly, also described herein are method of data imputation comprising generating synthetic data using real clinical data and using the synthetic data to fill in missing values in the real clinical data. The data obtained at step 212 and the data obtained at step 210 may optionally be combined at step 214 for form an “augmented” dataset comprising more data than would have been available if relying solely on the real clinical data obtained at step 210. In the illustrated embodiment, the plurality of variables comprise a variable indicative of a diagnosis or prognosis and one or more further clinical variables. The data obtained at step 214 or at step 212 is used to train a clinical predictor model at step 216 to predict the variable indicative of a diagnosis or prognosis, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis. The clinical predictor model may be a classification or regression model. The training data may comprise real and synthetic clinical data. The diagnosis or prognosis ma be selected from a disease severity, a disease subtype, and survival. The diagnosis or prognosis may be survival and the disease may be cancer. At step 218, the results of any one or more of the preceding steps are provided to a user (e.g. through a user interface), or to another computing device, memory or database. The methods described herein are computer-implemented unless context specifies otherwise (such as e.g. where measurement steps and/or wet steps are involved). Thus, the methods described herein are typically performed using a computer system or computer device. Any reference to an action such as “obtaining”, “processing”, “determining” may therefore refer to a processor performing the action, or a processor executing instructions that cause the processor to perform the action. Indeed, the methods of the present invention comprising at least the training of machine learning algorithms to generate synthetic data, the training of machine learning algorithms to provide a clinical prediction using synthetically generated data, the identification of mutual information based networks between multiple variables using data from hundreds of patients (involving multiple optimisation steps for each learned interaction as described in the Examples below), is such that it cannot be performed in the human mind. As used herein, the terms “computer system” of “computer device” includes the hardware, software and data storage devices for embodying a system or carrying out a computer implemented method. For example, a computer system may comprise one or more processing units such as a central processing unit (CPU) and/or a graphical processing unit (GPU), input means, output means and data storage, which may be embodied as one or more connected computing devices. Preferably the computer system has a display or comprises a computing device that has a display to provide a visual output display (for example in the design of the business process). The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network. For example, a computer system may be implemented as a cloud computer. The term “computer readable media” includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media. Figure 3 shows an embodiment of a system for providing synthetic clinical data, for providing a clinical predictor tool, and/or for comparing or combining data according to the present disclosure. The system comprises a computing device 1, which comprises a processor 101 and computer readable memory 102. In the embodiment shown, the computing device 1 also comprises a user interface 103, which is illustrated as a screen but may include any other means of conveying information to a user such as e.g. through audible or visual signals. The computing device 1 is communicably connected, such as e.g. through a network, to one or more databases 2 storing clinical data about a plurality of patients. The one or more databases 2 may further store one or more of: one or more machine learning algorithms, training data, parameters (such as e.g. parameters of machine learning model, feature selection algorithm, data preprocessing methods, etc.), etc. The computing device may be a smartphone, tablet, personal computer or other computing device. The computing device is configured to implement a method as described herein. In alternative embodiments, the computing device 1 is configured to communicate with a remote computing device (not shown), which is itself configured to implement a method of as described herein. In such cases, the remote computing device may also be configured to send the result of the method to the computing device. Further, the various steps of the methods described herein may be split between the computing device 1 and the remote computing device. The remote computing device may be a cloud computing device, a server node, etc. Any processing device known in the art may be used for this purpose. Communication between the computing device 1 and the remote computing device may be through a wired or wireless connection, and may occur over a local or public network 3 such as e.g. over the public internet. The following is presented by way of example and is not to be construed as a limitation to the scope of the claims. Examples Electronic data collection has become ubiquitous in the last two decades and is paramount for the development of new Artificial Intelligence (AI) algorithms and tools. However, in the context of clinical data there is often a need to preserve privacy which hinders the construction of large training datasets since datasets cannot easily be shared publicly. This ultimately impacts the performance of the models that can be developed, both in terms of accuracy and generalisability, since training datasets are smaller than they could otherwise be. For example, in 2015, EU countries agreed on the General Data Protection Regulation (GDPR) which was enforced starting 2018. It applies to anyone or any organization that is based in Europe and/or collects and processes information related to EU citizens, bringing strong restrictions on data collection and sharing. Concerning health records, GDPR requires data to be processed for health-related purposes only and under full patient consent. Under GDPR legislation, data collection, usage and sharing requires careful attention, pushing companies to the development of multiple organizational measures capable of protecting user’s personal data. Data publication needs to guarantee Statistical Disclosure Control (SDC), referring to techniques able to ensure that no person is identifiable from the published data. This includes two possible cases: i) identification disclosure when an attacker is able to link some data to a specific individual and ii) attribute disclosure when the attacker is able to learn new information on the subject, by using prior knowledge and the information contained in the data. Classical anonymization techniques like k-anonymity (Samarati & Sweeney, 1998) protect user data by minimizing the risk of re-identification, while keeping in theory a good level of data utility. K-anonymity is obtained through data suppression and data generalization, so that each person in the collection cannot be distinguished from at least k-1 individuals by using quasi-identifiers features (attributes available to an adversary). Machanavajjhala ay al. (2007) showed k-anonymity to be vulnerable to some attacks when using background knowledge and proposed a new privacy criteria, named l-diversity. Li et al. (2007) published a novel privacy criterion named t-Closeness, with even stronger properties for privacy preservation. In addition to being computationally expensive, as well as requiring prior knowledge to be able to anonymize the data, these classical tools are shown to deteriorate the data distribution, making them no longer exploitable in many situations (Hernandez et al.2022). In addition to the above traditional anonymization techniques, recent technological advances in artificial intelligence, notably in generative modeling, led to the development of synthetic tabular data generation (SDG) algorithms. SDG is performed by training a machine learning model using the real data set and generating data that mimic the original data (Hernandez et al.2022). This process is done by learning the underlying data distribution and using it to generate synthetic samples. A vast number of methods have been developed over the last decade for synthetic tabular data generation (reviewed in Hernandez et al. 2022). Algorithms that can deal with mixed type data include Generative Adversarial Networks (GANs), Classification and Regression Trees (CARTs), Bayesian Networks and Variational Autoencoders (VAE). However, there is no universal method or metric to evaluate and benchmark the performance of various approaches, both in terms of quality and privacy preservation. The present inventors set out to design a method that can satisfy these two requirements. They developed a novel synthetic tabular data generation algorithm based on the results of the Multivariate Information Inductive Causation (MIIC) algorithm (Verny et al. 2017; Cabeli et al. 2020). In order to properly assess this new method, the authors further introduced a trade-off measure between quality and privacy metrics. They used this to compare the proposed method with state-of-the-art synthetic healthcare data generation methods, demonstrating that the proposed approach is not only competitive with the best existing methods in terms of both privacy and quality, but also outperforms these methods when it comes to balancing these contradictory requirements. This example describes a novel synthetic data generation method: the MIIC-SDG algorithm is able to generate synthetic data that accurately captures the information present in the original data, without copying the data. The method is based on the reconstruction of a Bayesian network that preserves direct associations and causal relationships in the dataset. Method overview The proposed method, termed MIIC-SDG, takes advantage of the MIIC algorithm (MIIC network reconstruction, described in Cabeli et al.2020) which can reliably capture the set of direct associations in complex heterogeneous datasets such as healthcare medical records. MIIC-SDG builds on the MIIC algorithm by adding a new algorithm to transform a graph into a DAG by exploiting the information given by MIIC and a synthesizer (MIIC synthesizer) capable of generating samples that mimic the original data, while taking into account the multivariate distribution associated with the data. MIIC-SDG is composed of three steps, illustrated on Figure 15: 1. MIIC network reconstruction: inferring a graphical model associated with the original dataset using the MIIC algorithm (Figure 15A). 2. MIIC DAG generation: creating a directed acyclic graph (DAG) using the previously inferred network (Figure 15B). 3. MIIC synthesizer: generating synthetic samples based on the DAG and the original data using several approaches that depend on the nature of parents and children nodes in the graph (Figures 15C and D). Each of these is described in more detail below. MIIC network reconstruction The MIIC (Multivariate Information-based Inductive Causation) is an algorithm that infers a graphical network to represent the direct and possibly causal associations between variables in a dataset (described in Verny et al. 2017 and Cabeli et al.2020). The algorithm is able to estimate conditional mutual information, even when the dataset includes a mixture of categorical and continuous variables. MIIC does not have any hyperparameters and is not sensible to the order of features in the input data. It can estimate the set of associations between variables in the presence of missing data without the need of an a priori data imputation technique, which could inflate some bias in the data. MIIC has been shown to be robust to sampling noise and to reliably estimate (conditional) mutual information. These features have been demonstrated in multiple benchmarks (see Cabeli et al.2020). The graph generated by MIIC is composed of both undirected and directed edges, which may also form some directed cycles (as with any other causal discovery constraint-based methods). The directed edges originate from the discovery of v-structures (Verny et al.2017), which are signatures of causality in observational data, or through the propagation of orientation from upstream v-structures. Hence, directed edges do not necessarily correspond to causal associations. MIIC is available through a web-server (Sella et al. 2018) and an R package and has been recently applied to a breast cancer cohort of patients treated at Institut Curie in Paris, providing a novel way to globally visualize, analyze, and understand the connections between well-known clinical features (Sella et al.2022). The MIIC network reconstruction step uses the MIIC method as described in Verny et al. 2017 and Cabeli et al. 2020, also outlined below. An example of a MIIC reconstructed network applied on the METABRIC dataset (see Example 2) is shown on Figure 4. MIIC reconstruction was performed using the default parameters (orientation and propagation enabled, no latent variables, no filtering on edge confidence, no separating set consistency constraint and KL-distance enabled on the search for confounders in the presence of missing data). The MIIC algorithm is based on the concept of multivariate information I(X;Y;Z;..) which extends the concept of mutual information to more than two variables. The binary mutual information between a pair of variables (equation (1)) is decomposed relative to a set of variables {Ai}: I(X;Y)= I(X;Y; {Ai})+ I(X;Y|{Ai}) (2) where I(X;Y; {Ai}) can be seen as the global indirect contribution of {Ai} to I(X;Y), and I(X;Y|{Ai}) as the remaining (direct) contribution. Conditional independence, i.e. I(X;Y|{Ai})=0 indicates that {Ai} is a “separation set” which intercepts all indirect paths contributing to the total mutual information, since I(X;Y)= I(X;Y; {Ai}). This extends the concept of signatures of causality in observational data, which are associated with unique correlation patterns: two mutually independent variables (I(X;Y)= 0) are not connected to each other, but if they are both individually dependent on a third variable X, then there must be a V-structure X→Z←Y because this correlation pattern is not compatible with edges XZ and YZ being undirected or with Z being a cause of X or Y. In practice, mutual information cannot be exactly zero for finite datasets, but the probability that the edge XY should be removed (indicating that X and Y are mutually independent) can be estimated from data as up to a normalisation constant, hence the “~”, where N is the number of independent samples. This is performed in an edge pruning step (see below). The resulting network is undirected but can be partially directed by orienting all V-structures based on the signature of causality and propagating these orientations on downstream edges. Thus, the MIIC algorithm learns graphical models by progressively uncovering the information contributions of indirect paths in terms of multivariate information. The multivariate information between a set of variables X1 to Xp is obtained based on multivariate entropies as: For example, for 3 variables this leads to: I(X;Y;A)=H(X)+H(Y)+H(A)-H(X,Y)-H(X,A)-H(Y,A)+H(X,Y,A) (6) The 3 point information can be positive or negative. Conditional multivariate information I(X1;…;Xp|A) are defined similarly but using conditional multivariate entropies H({Xi}|A). The MIIC algorithm proceeds in 3 steps: 1. Edge pruning: Starting from a fully connected network, MIIC removes dispensable edges by iteratively subtracting the most significant information contributions from indirect paths between each pair of variables. The global indirect contribution is obtained iteratively as: I(X;Y; {Ai}n)= I(X;Y; {Ai}n-1) + I(X;Y; An|{Ai}n-1) (7) where n refers to the iteration and I(X;Y; An|{Ai}n-1)>0 corresponds to the contribution of the most likely nth variable An after collecting the first n-1 most likely contributors {Ai}n-1. Only positive information terms contribute to the global mutual information between X and Y, hence the requirement that I(X;Y; An|{Ai}n- 1)>0. Negative information I(X;Y; An|{Ai}n-1)<0 do not contribute to I(X;Y) but are the signature of causality in observational data and are used to orient v-structures, i.e. X→An←Y in the edge orientation step (see below). Significant contributors are collected based on the 3off2 score (Affeldt & Isambert, 2015; Affeldt et al. 2016), which maximises conditional three-point information while minimizing conditional two-point (mutual) information. This score reliably assesses conditional independence. The most likely contributor An after collecting the first n-1 contributors {Ai}n-1 is chosen by maximising this score. In particular, the score (Slb) for the addition of node Z on previously selected {Ai} combines the maximum of three-point information and the minimum of two-point information as Slb(Z;XY|{Ai})=min[Pnv(X;Y;Z|{Ai}), Pdpi(XY;Z|{Ai})] (8) and the pair of nodes XY with the most likely contribution from a third node Z and likely to be absent from the model can be ordered according to their rank R(XY;Z|{Ai})=maxZ(Slb(Z;XY|{Ai})) (9) Where Pnv(X;Y;Z|{Ai})=1/(1+exp(-NI’(X;Y;Z|{Ai}))) (10) is the probability that Z is likely to be included in {Ai} when I’(X;Y;Z|{Ai})>0 (non-v structure probability), and Pdpi(XY;Z|{Ai})=(1+(exp(-NI’(X;Z|{Ai}))/exp(-NI’(X;Y|{Ai}))) +(exp(-NI’(Z;Y|{Ai}))/exp(-NI’(X;Y|{Ai}))))-1 (11) is the probability that the edge removal associated with the symmetric contribution I(X;Y;Z|{Ai}) is consistent with the Data Processing Inequality, DPI (DPI-consistency probability). Thus, the algorithm removes dispensable edges by iteratively: selecting the top edge XY with highest rank R(XY;An|{Ai}n-1), updating the contributor list {Ai}n←{Ai}n-1+An; if I(X;Y|{Ai}n) is not significant given the finite number N of samples, remove edge XY, else search for the next best contributor An+1 of edge XY if one exists with I(X;Y; An+1|{Ai}n)>0 and update the ranking order R(XY;An+1|{Ai}n). This is repeated until no more edges can be removed. Significance is determined using the normalised maximum likelihood criterion or Bayesian information criterion (BIC)/minimum description length (MDL) criterion as described in Verny et al. 2017, see also below (comparing the mutual information to the NML or MDL complexity to obtain a regularised mutual information that has to be above a threshold to be significant – see in particular Verny et al.2017, supplementary file S1, section 1.2). The residual (conditional) mutual information I’N(X;Y|{Ai}) (i.e. after indirect effects of significant contributors, {Ai}, have been subtracted from I’N(X;Y)), including finite size corrections (see below) is related to the removal probability of each edge, ^^^^^^ = ^^^^^^ (−^^. ^^′^^(^^; ^^|{^^^^})), where N.I’N(X;Y|{Ai}) >0 corresponds to the strength of the retained edge. The strength of a retained edge is illustrated on Figure 4 by the thickness of the edge. 2. Edge filtering / confidence of edges (optional – not performed in the present examples): The remaining edges can be further filtered based on the confidence ratio assessment: CXY=PXY/<PrandXY> , where <PrandXY> is the average of the probability to remove the XY edge after randomly permutating the dataset for each variable, and ^^^^^^ = ^^^^^^ (−^^. ^^′^^(^^; ^^|{^^^^})) (3b) as above. The lower the value of CXY, the higher the confidence on the XY edge. In previous experiments, filtering edges with CXY > 0.1 or 0.01 was found to limits the false discovery rates with small datasets, while maintaining satisfactory true positive rates. This was not found to be necessary in the present examples, and was at least in some cases found to result in lower quality (but higher privacy) synthetic data due to the starting networks being sparser and therefore potentially missing informative relationships. As described above, the method without edge filtering was already found to strike a very good balance between privacy and quality, and it was therefore not deemed necessary to increase privacy by edge filtering if this negatively impacted quality. 3. Edge orientation: remaining edges are then oriented based on the sign of (conditional) three-point information in the observed data. Initially unspecified endpoint marks (o) can be established as arrow tail (-) or head (>) by iteratively taking the top (X,Y,Z)X≠Y with highest endmark orientation / propagation probability >1/2 (until no additional endmark orientation / propagation probability >1/2): - if I’N(X;Y; Z|{Ai}n)<0 and X*-o Z o-*Y or X*→Zo-*Y, orient edge(s) to form a v-structure: X*→Z←*Y -else if I’N(X;Y; Z|{Ai}n)>0 and X*→Zo-*Y , propagate second edge direction to form a non-v structure X*→Z→Y. In the above the symbol * stands for any endpoint mark. The endmark orientation / propagation probabilities are given by: - Probability of X*→Z←*Y if X,Y,Z form an unshielded triple X*-o Z o-*Y with X≠Y: Po X*→Z=(1+exp(NI’(X;Y;Z|{Ai})))/(1+3*exp(NI’(X;Y;Z|{Ai}))) (12) - Probability of X*→Z←*Y if X,Y,Z form an unshielded triple with one already known converging arrow into the middle node X*→ Z o-*Y : PoY*→Z= PoX*→Z*((1/(1+exp(NI’(X;Y;Z|{Ai}))))-1/2)+1/2 (13) The final network contains up to three types of edges: undirected, directed, as well as, bidirected edges, which originate from a latent variable, L, unobserved in the dataset but predicted to be a common cause of X and Y, i.e. X ^ (L) ^ Y. In the present examples, latent variables were removed and there are no bidirected edges. This is because latent variables could not be used in the synthetic data generation step (see below) since there is no data for them in the input data. The MIIC method further implements an information-maximizing discretization of continuous data. Indeed, in the above, a general definition of mutual information is used which can be applied to continuous or mixed type variables (instead of a discrete summation over nominal variables ^^(^^; ^^) = ^^^^, ^^ ^^^^,^^^^^ ^^ ^^, ^^^ ( ^^^^^^^^) : where the mutual information is the supremum of the values of mutual information calculated over all finite partitions P,Q or the variables X and Y. However, in practice with real datasets of finite sample size, this eventually assigns each of the N different samples into different bins. Therefore, a modification of this is used where: where k’P:Q(N) is a complexity term introduced to discriminate between variable dependence and independence given a finite dataset of size N. This works as a penalty term to outweigh the information gain in refining bin partitions further when there is not enough data to support such a refined model. For discrete variables the following complexity terms can be used: - Bayesian Information Criterion (BIC): o k’P:Q(N)=1/2*(rx-1)(ry-1)*logN (16) where rx and ry are the number of bins for X and Y; - X and Y normalised maximum likelihood criteria (NML): is the parametric complexity associated with the yth bin of variable Y containing ny and similarly is the parametric complexity for the nx-size bin of variable X, which are defined summing a multinomial likelihood function over all possible partitions of n data points into a maximum of r bins as: where lk are the number of data points in bin k. l1, l2,…,lr are r non-negative integers such that their sum is equal to n. For continuous variables, the variable categories need to be specified and encoded in the model complexity as: with ^^^^^^(^^^^^^^^^^(^^ − 1; ^^ − 1)) = (^^ − 1)^^^^,^^ (20) where CN,r is the cost associated with each of the r-1 cut points, with r=rx or ry (equation (18)). In the MIIC algorithm, instead of exhaustively calculating over all possible partitions P and Q when solving equation (15), a local optimisation heuristic is implemented which finds the optimal partition (cut points) for each continuous variable iteratively, keeping the partitions of the other continuous variables fixed. For example, for two variables this starts from an initial X partition with rx bins and an estimate of the number of Y bins, ry. The sample scaled mutual information with finite size correction, i.e. n*I’n(X;Y), is optimised iteratively for n=1,…,N samples over all Y partitions. Then, adopting this optimised partition for Y, the same scheme is applied to optimise X. The optimisation of X and Y partitions is iteratively optimised into a stable two-state limit circle is reached. Note that this scheme does not identify a single set of bind boundaries for each variable, and instead the evaluations depend on the conditioning nodes. A similar scheme is used in some cases in the synthetic data generation step (see below), except that continuous variables used to generate synthetic data are discretised taking into account only the source and target nodes (at least one of which is a continuous variable to be discretised), i.e. this uses bivariate information. The same scheme can be applied when computing conditional mutual information involving continuous or mixed-type variables. Starting from an initial partition for X, each term of the following equation: ^^′ ^^(^^; ^^|{^^^^}) = ^^′ ^^(^^; ^^, {^^^^}) − ^^′ ^^(^^; {^^^^}) (21) is optimised with respect to Y and {Ai} partitions using equation (17a) as parametric complexity extended to multivariate categories ny,{ai} and n{ai}. Then, each term of the following equation: ^^′ ^^(^^; ^^|{^^^^}) = ^^′ ^^(^^; ^^, {^^^^}) − ^^′ ^^(^^; {^^^^}) (22) is optimised with respect to X and {Ai} partitions using equation (17b) as parametric complexity extended to multivariate categories nx,{ai} and n{ai}. Partitions {Ai} are optimised separately for each of the 4 terms in equations (21) and (22) before taking their differences. MIIC DAG generation In the new methods described herein, the inventors expand MIIC’s ability to learn unparameterized network structures by incorporating a framework capable of generating synthetic data from MIIC reconstructed graphs. This approach takes advantage of the Bayesian framework, where the starting point is a Directed Acyclic Graph (DAG), which can be parametrized with the original data. For this reason, prior to data generation, the initial graph has to be transformed into a DAG. The inventors implemented two different algorithms for the generation of DAGs from MIIC networks. First, the MIIC graph can be treated as non-directed graph and converted into a DAG via a Depth-First Search (DFS) orientation algorithm. In this first scenario multiple DAGs can be generated by changing the starting node for the DFS visit (root node), which was found to generate some variability between synthetic datasets using different root nodes. The second approach consists of two steps: 1. First, each undirected edge in the MIIC network is oriented so as to minimize the number of directed cycles and possibly avoid them. This is performed in order of degrees, i.e. nodes are sorted by ascending degree, then going considering all undirected edges involving nodes with the same degree (increasing to the next degree at each iteration), any remaining undirected edges are oriented so as to minimise the number of cycles. The sorting by ascending degree is optional, and was performed to minimise the number of V-structures. Note that there is not necessarily a single global solution, i.e. multiple solutions may result in the same number of cycles – this step mostly aims to generate a directed graph as a starting point that has as few cycles as possible. All starting point directed graphs with the minimum number of cycles obtained with this step are equivalent from the point of view of the original data set since they only orient edges that could not be oriented based on the data (i.e. both orientations are consistent with the original data). 2. Then, all directed cycles are removed from the graph, if some are present. In order to guarantee the removal of all the cycles of the graph, the MIIC-to-DAG algorithm iteratively considers the longest cycle in the graph (the one with the most edges) and flips the edge that minimizes the number of remaining cycles in the graph. Taking the longest cycle guarantees the removal of at least one cycle at each iteration and therefore convergence towards a DAG. In other words, this step evaluatse, for each edge of the longest cycle, the number of remaining cycles if the edge is flipped, then select the edge for flipping that, if flipped, results in the smaller number of cycles. A pseudocode for this procedure, termed MIIC-to-DAG, is shown on Figure 14. The inventors compared these two methods and decided to use the second approach which best exploits MIIC capabilities of discovering a network that contains a mixture of directed and undirected edges (latent variables are excluded in the present setting). The second method also results in synthetic data generation with lower variability. The process described here was tested using both methods, leading to similar results (i.e. both would be usable), but the second method is better able to deal with cycles in the MIIC network. An additional approach that was tested was to use the MIIC network as a Partially Directed Acyclic Graph (PDAG) and transform undirected edges to directed edges through the application of an algorithm that extends a PDAG (Partially Directed Acyclic Graph) to a DAG, without contradicting already oriented edges (keeping the v-structures already found). The inventors used the pdag2dag function of the pcalg R package for this purpose, which implements the method described in Dor & Tarsi (1992). A partially directed acyclic graph is a graph that contains both directed edges and undirected edges, with no directed cycle in its directed subgraph. In the method in Dor & Tarsi (1992) this is transformed into a fully directed acyclic graph on the same underlying set of edges, with the same orientation on the directed subgraph and the same set of v-structures by selecting a sink node x and where all nodes y connected to x by undirected edges are adjacent to each other, and making all undirected edges connected to x into edges directed toward x. This is iteratively repeated until no node x remains. This method is less widely applicable than the two methods above because the MIIC algorithm is not guaranteed to generate a PDAG. Indeed, although cycles are rare in real data, when they do occur this method would not be applicable. MIIC synthesizer The data generation component leverages the Directed Acyclic Graph (DAG) obtained in the previous stage. The MIIC-SDG synthesizer is based on Bayesian assumptions, where data is initially generated for variables associated with isolated nodes or nodes without parents (source nodes) and then iteratively expands to nodes whose parent nodes have already been processed. For source nodes and isolated nodes that are discrete variables, a probability table is obtained from the input data and data is generated following the distribution in the probability table (this is the same approach that would be used when generating data for a classical Bayesian network). For continuous source/isolated variables, a probability density function is fitted to the data (using the “density” function from the R package stats). The density function computes kernel density estimates from a data series (see www.rdocumentation.org/packages/stats/versions/3.6.2/topics/density). To address the presence of mixed-type variables in the original data, specific modifications have been implemented for the data generation algorithm. There are 3 possible scenarios depending on the data type of parent (P) variables involved and the data type of the target variable (T) (see also decision tree for data generation on Figure 15D): 1. P discrete – T discrete: a multivariate conditional probability table for the target variable is estimated from the original data and then used to sample synthetic data based on parents values (like in a classical Bayesian network algorithm). 2. P mixed – T discrete: two different methods are applied depending on the number of continuous parents. a. If the number of continuous parents is low (less than 3), continuous distributions are discretized using the optimum discretization algorithm described above by reference to equations (19)-(21) and in Cabeli et al. 2020, except that the discretisation of the continuous variable takes into account only source and target variables (bivariate information). In other words, a discretisation (i.e. set of bins for a continuous variable or set of source-target continuous variables) is identified that maximizes the mutual information between each predictor and the target variable. This approach has been shown to reliably estimate theoretical mutual information and to be adaptive to the number of samples and multimodal continuous distributions. This approach finds the cut values (bins) for the continuous variables using the estimation of the mutual information minus a complexity term. The inventors limited the application of this method to cases where the number of continuous parents to be discretised is relatively low for the following reasons. There is no limit in the number of bins that are found for the continuous variables, which could result in a very large conditional probability table for the target variable. This is problematic because the data generated would then be very similar to the input data. Further, if the number of continuous variables is more important, machine learning algorithms can also have good performances, since they can capture the multivariate underlying distribution that can well explain the target variable. A multivariate conditional probability table for the target variable is then estimated from the discretized data and then used to sample synthetic data based on parents values. The choice of applying this method instead of directly using a machine learning approach also prevents the issue of predicting a strongly unbalanced discrete variable with a low number of continuous predictors. Indeed, if the variable to predict is discrete and there is a big class imbalance, then many machine learning algorithms would just predict the majority class, particularly if we use cross validation in the model construction. Using classical probability table minimizes this issue. b. When the number of continuous predictors is higher (3 or more), a random forest classification model is used to predict the target variable, as it is able to capture non- linear associations between multiple predictors without having to discretize continuous parents. This was implemented using the randomForest function from the RandomForest R package with default parameters (see cran.r- project.org/web/packages/randomForest/randomForest.pdf, Oct 14, 2022, package version 4.7.1.1). P mixed – T continuous: this case is also addressed with two methods depending on the number of continuous predictors (parent nodes). a. If the number of continuous predictors is low (less than 2), the optimum discretization algorithm from the MIIC framework is used (as in step 2a) and then the method learns to reproduce the density of the continuous target for each combination of discrete or discretized predictors. The density of the continuous target is learned by fitting a probability density function to the data (using the “density” function from the R package stats) separately for each combination of discrete or discretised predictors. These fitted density functions are then used to generate the data. This option was only applied to cases where the number of continuous predictors is low because it can otherwise create data that is quite similar to the original data (due to the discretisation scheme having an unbounded number of bins, as explained above). Tus, when the number of continuous predictors is high, a machine learning approach was preferred. b. If the number of continuous predictors is higher (2 or more), a random forest regression model is implemented. This was implemented as above. Data imputation. The MICE algorithm (Multiple imputation by chained equations, see Azur et al., 2011) was used for missing data imputation when there were many NAs in a continuous target node, as that would prevent the regression model from running. No data imputation was implied to discrete variables and instead NA was used as a separate category of data, such that NAs are created in the synthetic data with probabilities matching the input data. Implementation. The MIIC-SDG algorithm was implemented as an R package. It contains one function that returns three objects: the synthetic data generated by the algorithm as a data frame, the adjacency matrix representing the network corresponding to the DAG used to sample the synthetic data and a third object containing the data type (discrete or continuous) of the input variables as a data frame. The returned network can be very important in real scenarios when the investigator wants to inspect the set of associations between variables to better understand the structure of the data frame in terms of correlations. Example 2 – Benchmarking of MIIC-SDG relative to state of the art: quality metrics and machine learning performances In this example, the novel method described in Example 1 is benchmarked against a series of state-of- the-art synthetic data generation methods, in particular in relation to the quality of the data generated and its usability in training a prognostic machine learning model. Methods MIIC-SDG. This is described in Example 1. MIIC-SDG is composed of three steps: the first step discovers a network structure from the input data, the second step transforms this network into a DAG using the MIIC-to-DAG algorithm and the third step uses this DAG and the original data to generate synthetic samples resembling the original data. This method aims to learn the models from the original data without overfitting the data and using the correct set of ancestors for each parent. This was hypothesised to be better than classical Bayesian methods that have a tendency to overfit the data (therefore necessitating the injection of noise in the data to preserve privacy), Bayesian. This method builds a probabilistic graphical model (Bayesian network) that represents the joint multivariate distribution by exploiting dependencies between the random variables (Ankan & Panda 2015). In this framework, a directed acyclic graph and a corresponding conditional probability distribution are learned from the given data. Sampling from the model is finally performed to generate the resultant dataset. The code used is the Synthcity (Qian et al.2023) package that builds on the pgmpy package by Ankur and Panda (2015). The DAG is obtained using the tree search (Chow–Liu tree) or hill climbing algorithms. These two approaches are referred to as “Bayesian tree search” and “Bayesian hill climbing”. Synthpop. The Synthpop algorithm (described in Nowok et al. 2016), is a machine learning solution aimed at providing synthetic test data for users of confidential datasets. The synthetic data, generated through parametric and nonparametric methods, including the classification and regression trees (CART) model, aims to mimic the original data and can be used for exploratory analyses and for testing models. However, the CART model may result in final leaves representing a small number of individuals, potentially compromising the privacy of the synthesized data. The authors suggest limiting this effect by specifying a minimum size for the final node produced by the CART model. However, determining the appropriate value for this parameter is challenging as it depends on the data and the method does not offer a tuning procedure. CTGAN. CTGAN (Conditional Tabular Generative Adversarial Networks) is a deep learning algorithm (described in Xu et al.2019), that aims at creating a generative model suitable for tabular data. CTGAN differs from traditional GANs by adding a conditional structure to both the generator and the discriminator networks, allowing it to generate synthetic samples based on specific real-world conditions. Xu et al. (2019) have reported CTGAN outperforming Bayesian methods on most of the real datasets they presented. TVAE. Tabular Variational AutoEncoders are adapted from classical variational autoencoders (VAE) to enable the generation of mixed-type tabular data. This method was also used as a benchmark in the CTGAN paper (Xu et al. 2019) and is described therein. Authors claim that CTGAN achieves competitive performance across many datasets and outperforms TVAE on some benchmarks. PrivBayes. PrivBayes, described in Zhang et al. (2014) is a differentially private Bayesian network model capable of efficiently handling datasets with a large number of attributes. Authors present the package as a new implementation that requires the injection of less noise compared to other differential privacy algorithms, maintaining more signal in the synthetic data. To obtain differentially private synthetic data, PrivBayes starts by creating a Bayesian network that succinctly represents the correlations among the attributes and then injects noise into each marginal distribution to ensure differential privacy. The method finally uses these noisy marginals and the Bayesian network to generate synthetic samples. The most important parameter for the algorithm is epsilon, determining the amount of noise injected in the marginal distributions. However, the choice of epsilon is not straightforward since the level of both quality and privacy of the generated data depends on the type of distributions, number of samples and complexity of the Bayesian network. For this benchmark, the inventors chose an epsilon equal to 1 as it showed to be the best compromise in these simulations. RANDOM. This approach does not correspond to a synthetic data generation algorithm in itself, but it is used as a lower bound for normalizing the other benchmark methods. The synthetic dataset is obtained by generating random data using uniform distributions (inside the ranges of the original data). For categorical data this corresponds to a random sampling with replacement from all possible levels of categories of each feature. For continuous variables the sampling is made using the minimum and maximum values as ranges and sampling within the range with a uniform distribution. This is the most random benchmark that can be obtained, ensuring worst quality and best privacy, as it also loses the marginal distribution of the data (which is preserved when doing random permutations of the original data). The parameters used in each algorithm are listed in Table 1 below. Bayesian PrivBayes CTGAN Struct_learning_search_meth category_threshold : 20 embedding_dim: 128 od : tree_search / hill climbing Epsilon : 1 generator_dim: (256, 256) Struct_learning_score : bic Degree_bayesian_net : 3 discriminator_dim: (256, Struct_max_indegree : 4 256) generator_decay: 1e-6 Batch_size: 500 Epochs: 500 Synthpop MIIC-SDG Cart.minsplit: 20 Orientation: Enabled Propagation: Enabled Latent variables: Disabled Edge confidence filtering: Disabled Consistency: Disabled KL-distance check: Enabled DAG creation: MIIC-to-DAG keepContinuousVariableBoundaries: True Table 1. Parameters of the benchmark algorithms and method of the disclosure. For MIIC-SDG the default parameters of the MIIC algorithm were used (orientation and propagation enabled, no latent variables, no filtering on edge confidence, no separating set consistency constraint and KL-distance enabled on the search for confounders in the presence of missing data). Using this option, when we look for possible confounders or mediators (Ai), we take the column for variable x,y + Ai, we keep the rows without NAs and we work on that. We accept Ai being an ancestor, if and only if the distribution of the Ai in the x,y,Ai data is similar to the one in the whole Ai (evaluated by KL-distance, i.e. Kullback–Leibler divergence, which is a metric of similarity between statistical distributions), This avoids conditioning on small sample sizes (due to NAs) that do not correspond to the actual variable distribution. The parameters used by the synthesizer are: the DAG creation method and the compression of values that exceed real data boundaries in the continuous variables (setting values that exceed the range of observed or acceptable/realistic values for a variable to the nearest boundary of this range). The latter is optional. It is performed because regression models (random forest regression models used to predict continuous variables in some cases, see above), can generate data outside the range of the original data. The compression avoids the creation of synthetic data that has strange behaviours or wrong values (like small negative values for clinical variable with a lot of 0, e.g. number of sentinel node positive). These is optional and can be performed after the whole dataset has been generated, because in practice these are quite small fluctuations around the minimum or maximum and do not really affect the whole distribution. Benchmark data - Breast cancer (METABRIC). The METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset is a collection of over 2,000 clinically annotated primary breast cancer specimens obtained from tumor banks in the UK and Canada (Curtis et al. 2012). The cohort encompasses clinical variables and genetic information including copy number alterations, copy number variations, and single nucleotide polymorphisms. The METABRIC dataset was selected for this study due to its widespread usage and validation in the literature, as well as its suitable sample size for the application of machine learning algorithms and coexistence of numerical and categorical features. The original dataset, consisting of 2491 patients and 36 variables, was pre-processed by removing patients with more than 20% missing values and variables with unique values. This is mostly to evaluate all the methods in a scenario where we have some NAs but not samples with almost all NAs.The resulting filtered dataset comprised 1977 patients and 29 clinical variables, with 19 of them being discrete and 10 continuous. Figure 4 shows the network reconstructed by the MIIC algorithm for this data. The corresponding graph obtained by applying MIIC-to-DAG contains 63 edges reporting direct association between the 29 variables. The variables are shown on Figure 4 and include: laterality, cellularity, claudin subtype, her2-snp6 (HER2 SNP6 loss), ER status (estrogen responsive), her2- status, PR-status (progesterone responsive), grade, histological subtype, chemotherapy, radiotherapy, tumour stage, breast surgery, ER-IHC (ER status by immunohistochemistry), histological subtype (detailed as per PAM50 subtypes, i.e. basal, her2, luminal A, luminal B, normal), RFS status, vital status, inferred menopausal state, hormone therapy, cohort, int clust (integrative subgroups as detailed in Curtis et al.2012), three gene, nonsynonymous tumour mutational burden, RFS in months, OS (overall survival) in months, lymph nodes positive, NPI (Nottingham prognostic index), age at diagnosis, detailed cancer type (as assessed by baseline histology). Unconnected nodes represent features that are not associated with any other variable in the data (following MIIC residual mutual information evaluation). It is important to remember that the MIIC algorithm does not have hyperparameters, does not need any tuning, can deal with missing data and is not sensible to the order of features in the data. The used data comes from public data available in the cBioPortal repository: www.cbioportal.org/study/summary?id=brca_metabric. Benchmark data - bladder cancer. This work also uses the Phase II single-arm study IMvigor210 (clinicaltrials.gov/study/NCT0295176– - Hoffmann-La Roche. A Phase II, Multicenter, Single-Arm Study of Atezolizumab in Patients With Locally Advanced or Metastatic Urothelial Bladder Cancer) to test the ability of methods to reproduce clinical data and protect patients' privacy in a real phase II scenario, where the number of patients is limited by the nature of the study. This clinical trial originally contained 310 participants but only data belonging to patients with known outcomes was kept. The final data contains 297 samples and 24 features (e.g. Cancer subtype, mutational burden, sex, race, ECOG score, tobacco use, disease status, sample age, tissue type, TCGA subtype, types of therapy, Overall Survival, Clinical response). Benchmark setting. The inventors evaluated the different algorithms using the whole METABRIC dataset and different sample sizes, by subsampling the 1977 samples dataset in sizes 50, 100, 200, 500, 1000, 1500 and 1977. This allows one to assess the performance of each method in multiple subsets along with their stability. A total of 10 datasets were created for each sample size and each one of them was used to generate 10 synthetic datasets, each built with a different seed (100 datasets for each sample size). All comparative methods used a seed parameter. For example, Synthpop depends on the order of the features in the input dataset, and the order can change based on the seed. The absence of a seed is a benefit of the present method as it is not necessary to generate multiple datasets with different seeds in order to get a representative picture of the performance of the method. For the IMVIGOR210 study the dataset was analysed in sizes 100, 200 and 297 samples. Quality metrics – Univariate analysis. The inventors assess whether the distribution of each feature follows the same distribution in the original and synthetic datasets. Quality metrics – Mutual information (MI). In order to compare bivariate associations the inventors used the mutual information, which is a measure of dependence between two variables. The concept of mutual information is linked to the entropy of random variables, coming from information theory. MI has been shown to robustly capture the association between variables even when their relation is nonlinear. The inventors compared the MI matrices for real and generated data and computed the average difference between the two matrices. They estimated the MI for discrete-continuous or continuous- continuous variables through the optimum discretization algorithm implemented in the MIIC package, as described above. The quality of the generated data is directly associated with the (mean) distances between the MI matrices of the original data and the one of the generated data. Small distances correspond to data that reliably capture the underlying structure present in the original data. Quality metrics – Correlations. The inventors assessed whether the bivariate distributions between each pair of features are preserved in the synthetic data. They compared the correlation matrices in the original and synthetic data and computed the mean difference between the matrices. The analysis is performed on all variable pairs by calculating their correlation using two approaches. The lower triangular matrix was determined by computing Pearson's correlation coefficient between continuous variables and Cramer's V between categorical variables. The upper triangular matrix was dedicated to analyzing the relationship between continuous and discrete variables. To this end, the inventors used the MIIC algorithm which has been shown to optimally discretize the continuous features by maximizing the mutual information for all potential cut-points on the continuous variables. If a discretization was found (there is a significant correlation between the features – no discretisation is found if there is no association between the features, and I(x,y)=0), Cramer's V was then evaluated between the discrete and discretized variables. Also in this case the quality is directly associated with the distance between the correlation matrices. Small distances correspond to data that reliably capture the structure present in the original data. Quality metrics - Multivariate distribution. To assess whether the joint multivariate distribution is preserved in the synthetic dataset, the Wasserstein distance (earth mover's distance) between the original and the synthetic data was computed. The main advantage of using Wasserstein compared to other metrics such as the Kullback-Leibler divergence for instance, is that it is a proper distance with good properties such as symmetry and does not require both measures (original and synthetic) to be on the same probability space. A small Wasserstein distance corresponds to synthetic data that reliably represent the multivariate distribution. Quality metrics - Machine learning performances. One way to evaluate the quality of a dataset is to assess if the generated data can be used to perform classical machine learning tasks such as supervised learning. The inventors therefore chose to compare the algorithms based on their capability to build a relevant machine learning model to predict overall survival using a survival random forest model. Survival Random Forest is a time-to-event model, similar to a Cox regression model, for censored data. We predict the Overall Survival of patients, i.e. number of months before an event appear (death or censored event). What is evaluated is the c-index (concordance index) it represents the model’s ability to provide a reliable ranking of the survival times based on the individual risk scores. Implementation was done with the scikit-survival python library with the following hyper-parameters for the model: n_estimators = 1000; min_samples_split = 10, min_samples_leaf = 15. C-index was evaluated with a K-Fold cross validation procedure (K = 5). The model was trained on the synthetic datasets and performance was evaluated on the original dataset which acts as the test set. The inventors also evaluated whether each synthetic dataset retains robust relationships by comparing variable permutation importance ranking with the “true” ranking obtained on the original dataset. Permutation importance where obtained using the permutation_importance function of the scikit-learn python library. This determines the decrease in a model's R2 score when a single feature value is randomly shuffled. Results Univariate analysis When comparing two datasets that share the same set of features, the simplest analysis that can be conducted involves assessing the distribution of each variable within both the original and synthetic datasets. To accomplish this, the inventors applied the chi-squared test for categorical variables and the Wilcoxon test for numerical variables, with a significance level set at 0.05. Table 2 below presents the average count of features that exhibited statistically significant differences based on these two tests across various sample sizes (columns) and algorithms (rows), for the METABRIC dataset. The standard deviation is provided in parentheses. The results indicate that Synthpop is the most effective method for replicating univariate distributions, followed closely by Bayesian algorithms and MIIC-SDG, which demonstrated similar performance. Conversely, the other algorithms fell short in reproducing the univariate distribution, with between 16 (CTGAN and TVAE) and 21 features exhibiting differences in the largest sample size (from a total of 29 features). As expected, the random method flagged nearly all variables as different, owing to its random sampling approach within the original feature range. method 50 100 200 500 1000 1500 1977 MIIC-SDG 1.3 (1.3) 0.8 (0.9) 1.1 (0.9) 1.7 (1.6) 2.5 (1.6) 4.3 (2.7) 3.8 (1.2) Synthpop 0.3 (0.5) 0.2 (0.4) 0.3 (0.7) 0.8 (0.8) 0.3 (0.7) 0.3 (0.7) 0.5 (0.7) Bayesian tree search 0.1 (0.3) 0.9 (0.9) 1.7 (0.5) 2.0 (0) 2.4 (0.5) 2.6 (1.3) 2.0 (0) Bayesian hill climbing 0.3 (0.5) 0.9 (0.9) 1.8 (0.4) 2.1 (0.3) 2.0 (0) 2.9 (0.6) 2.0 (0) 11.3 16.2 18.1 17.2 15.8 20.7 PrivBayes (2.8) (1.1) (2.2) (4.4) (2.1) (3.6) 21.0 (0) 10.9 12.8 16.0 16.3 CTGAN 6.8 (2.7) 8.5 (3.4) (2.5) (4.2) 15.6 (4) (2.9) (2.7) 10.1 14.6 16.4 15.9 TVAE 4.5 (1.8) 5.1 (2.4) (3.2) (4.7) 19.9 (4) (3.7) (2.1) 22.9 26.0 26.6 27.2 27.5 RandomInRange 13.6 (2) 19.2 (1) (1.5) (0.9) (0.7) (0.8) (0.5) Table 2. Number of features resulting as different comparing real and synthetic features using chi- squared or Wilcoxon tests, with a 0.05 p-value threshold. Standard deviation is reported in parentheses. Mutual information (MI) distance Mutual information distances are evaluated by calculating the mutual information difference between real and synthetic data, evaluating it for all pairs of variables. The results are presented in Figure 6A, for the METABRIC dataset. The data generated by the MIIC-SDG algorithm reproduces well the mutual information between variables present in the original data, obtaining better scores than other methods for small sample sizes (50 and 100 samples). This is believed to be at least in part because the MIIC algorithm performs very well at capturing the set of associations between variables. With an increasing number of samples the mutual information is best reproduced by Synthpop, with MIIC-SDG positioning at second or third place (close to bayesian tree search). CTGAN obtained the worst score on the smallest sample size, and eventually improved its ranking at the fourth position with the largest sample size. TVAE and the Bayesian hill climbing technique scored similarly on the largest datasets, with Bayesian hill climbing reaching the third or fourth position in smaller samples (< 500 samples). Wasserstein distance The result of the multivariate Wasserstein distance is presented in Figure 6B. In this multivariate setting the Bayesian approach with tree DAG estimation reached the best scores with very small distances, followed by Synthpop, MIIC-SDG, Bayesian hill climbing (BIC criterion), with CTGAN and TVAE in similar ranges. The differential privacy bayesian approach with epsilon set to 1 generates datasets with distances much bigger to the ones given by the other methods. These results show that there are big gaps between the different methods, with the Bayesian approach with tree DAG estimation obtaining much better results than competitors. The results on Figure 6A suggest that this is probably linked to the generation of discrete variables that really resemble the original ones for the Bayesian tree search approach.By contrast, in the new method described herein a mixture of classical Bayesian approaches and machine learning are used to generate discrete variables, which introduces less overfitting of the data. When computing the Wasserstein distance using only continuous features, Bayesian tree search obtains instead similar scores with respect to Synthpop and MIIC-SDG (data not shown). Thus, with larger proportions of continuous features the methods of the present disclosure would be expected to have a performance more similar to the Bayesian tree search method using this metric. Correlation distance Bivariate correlation analysis was performed on all feature couples by calculating the correlation matrices and comparing them to the ones from the original corresponding data. Figure 6C shows the correlation distances between synthetic data and the corresponding original data, for the METABRIC dataset. It can be observed that the Bayesian tree search and Synthpop algorithms have comparable performances, both for low and high sample sizes. MIIC-SDG has better results than competitors on small sample sizes (< 200 samples) but its performance progression obtained by increasing sample sizes stabilizes at a higher plateau with respect to the first two competitors. CTGAN obtained bad scores for smaller sample sizes (< 500 samples), but the correlation distance decreases fast by increasing the number of samples. TVAE fails to capture the correlation structure, even if it improves starting from 1500 samples. Bayesian hill climbing with BIC criterion does not obtain competitive scores, and it does not significatively improve with larger sample sizes. PrivBayes showed similar results to the ones of the random data method, resulting in a completely lost correlation pattern. Figure 5 shows the correlation matrices for the datasets using 1000 samples (selecting 1000 samples of the 1977, 10 times, and for each dataset, generating 10 synthetic datasets, changing the seed for algorithms that use a seed) for the METABRIC dataset. Values are obtained as a mean correlation over all executions from running the algorithms on the 1000 sample datasets (using bootstrap) and using multiple seeds. The performance of the different methods are, in order: Bayesian with tree search algorithm, Synthpop, CTGAN, MIIC-SDG, TVAE and PrivBayes. As expected, the score of the random method is the worst and it is strongly dependent on the type of associations between variables that exist in the original data (correlation structure and strength). A dataset with only few strong correlations will provide small mean distances also when taking random values, due to the fact that most of the features are not correlated and taking random values does not alter significatively the resulting correlation. Machine learning performance: predicting overall survival (OS) response The aim of this study was to evaluate the ability of synthetic data generation algorithms to preserve multivariate information for the purpose of predicting survival in the METABRIC dataset. This comparison was made using survival Random Forest as the machine learning algorithm. To achieve this, a K-fold cross-validation approach (K=10) was employed, where for each fold, a classification model was trained on 75% of the synthetic data and evaluated on 25% of the real data (i.e. the model was trained on 75 synthetic samples from the 100 synthetic samples and tested on 25 original samples that was not used to generate the 100 synthetic samples (to avoid any risk of overfitting). Figure 8 shows the feature importance for the prediction of survival in the original data and in all benchmark algorithms. MIIC-SDG, Synthpop and bayesian networks with tree search all report the same 2 features (Nottingham Prognostic Index and the number of positive lymphatic nodes found) as the most important for survival prediction. Figure 9 shows the concordance index estimates from Survival Random Forest model to predict Overall Survival. It is important to notice that the ability to predict a target variable (OS in this case) from other features is also used as a metric for privacy, by building an inference attack on sensitive attributes. Having a high concordance with the true data also correlates to a high risk in the case of inference attacks. Indeed, if one is are able to correctly predict one variable using the other ones, it means that with some information on some patients attributes it is possible to infer quite precisely the value of the sensitive attribute that one would like not to disclose. Example 3 – Benchmarking of MIIC-SDG relative to state of the art: privacy metrics In this example, the algorithms described in Example 2 were assessed using the benchmark data as described in Example 2, and compared based on the level of privacy for each synthetic data generation method. Methods See Examples 1 and 2. Privacy metrics – Identifiability score. To ensure data privacy, generated synthetic patient records should be “different enough” from the original patient records. Following this idea the inventors used a framework for the evaluation of privacy risks described in Yoon et al. 2020. This aims at defining mathematically a new concept for identifiability, defining an identifiability property related to the minimum distance between real patients and the distance between real and synthetic samples. In order to weight each feature according to its probability of identifying patients having the same values, the method uses a weighted Euclidean distance as metric, giving more importance to features with an unbalanced distribution of events that are rarer. Yoon et al.2020 define ε-identifiability as the property of having less than ε ratio of observations from the original dataset in the generated synthetic dataset that are “not different enough” from the original observations. Ε corresponds to the defined identifiability score. In this scenario, zero identifiability would represent a perfectly non-identifiable (private) dataset and one identifiability would represent a perfectly identifiable dataset. The proposed identifiability is defined for all the samples or variables. The described identifiability distance is implemented in the Synthcity package. The derived privacy is evaluated as 1 - identifiability score. Privacy metrics - Membership inference score. Secondly, inspired by the work of El Emam K et al. (2022) and J. Yoon et al. (2020), the inventors also proposed to compute a membership inference metric. The inventors used the partitioning membership disclosure attack method proposed by El Emam K and colleagues where, instead of using the hamming distance between samples as a similarity measure, the inventors used a unidimensional weighted Wasserstein distance where the weights are defined as the entropy of each feature as proposed by J. Yoon et al. This score evaluates whether we are able to identify which patients were used to create the synthetic dataset by subsampling the original dataset into a training and test set, for varying sample size. The derived privacy metric is evaluated as 1 - membership inference score. Results In 2006 the paradigm of differential privacy was introduced and is still to date one of the most used techniques to try to preserve data privacy through mathematical constraints (Dwork et al. 2006). However, differential privacy has been shown to not fully mitigate the risk of re-identification. Stadler et al. (2022) have shown that under certain circumstances, neither the original implementation of PrivBayes nor PATEGAN reliably prevents linkage attacks, leaving some samples vulnerable to membership inference attacks. Moreover, it has been reported in many scenarios that strong differential privacy constraints lead to the generation of synthetic data exhibiting a disrupted correlation structure between features, making the resulting data problematic (Zhang et al. 2014), as also shown in the present work. Therefore, in the present work the inventors used the identifiability score and the membership inference score described above to assess privacy. 1 - Identifiability score The identifiability score corresponds to the probability of re-identification given the combination of all data on any individual patient. It is evaluated by measuring the identifiability of the finite original patient data using the finite generated synthetic data. Figure 7A shows the identifiability score evaluated on the synthetic data generation algorithms for the METABRIC dataset. The Bayesian algorithm with tree search has the highest identifiability scores (0.7– - 0.66), followed by the Synthpop algorithm (0.5– - 0.51), MIIC-SDG (0.5– - 0.40), TVAE (0.5– - 0.24), CTGAN (0.4– - 0.19), Bayesian with hill climbing (0.2– - 0.13), PrivBayes (0.0– - 0.01) and Random (0.0– - 0), with numbers between parenthesis corresponding to the smallest and biggest sample sizes. Interestingly, the random algorithm did not reach 0 for the smallest sample sizes. The membership inference score corresponds to the probability of identifying which patients have been used to generate the synthetic dataset. Figure 7B shows the membership inference score evaluated on the synthetic data generation algorithms for the METABRIC dataset. Bayesian tree search algorithm is the method where it is the easiest to guess if a sample has been used or not to generate the synthetic data, followed by the Synthpop method, where scores never decrease more than 0.5. MIIC-SDG remains at the third position, with scores that instead decrease well with bigger sample sizes. CTGAN obtains slightly better results in the membership inference attack, together with TVAE. Bayesian hill climbing generates datasets where it is hard to guess the membership of samples in the original data. PrivBayes and the random algorithm have similar scores, with values vanishing to 0. Literature results (van Breugel et al. 2023) and the inventors’ own findings manifestly suggest the necessity of a trade-off between data quality and data privacy. On the one hand, small modifications of the original data are directly associated to good quality scores but to poor privacy ratings, since almost all the information of the dataset is maintained. On the other hand, strong perturbations or noise addition leads to a net loss on quality and usually a concomitant gain on privacy. In this example, the algorithms described in Examples 1 and 2 are benchmarked in terms of how they balance these contradictory requirements. Methods See Examples 1 and 2. The trade-off highlighted above between privacy and quality is analogous to the classical machine learning dilemma of obtaining good precision and recall scores, simultaneously, which calls for defining a trade-off measure such as the F1 score measure: F1 = 2 * Prec * Recall / (Prec + Recall) Inspired by the F1 score definition, the inventors formulated a version of quality versus privacy trade-off by adapting the classical F1 formula. In order to compare algorithms on the same scale, they normalized each quality and privacy metric by a reference value, namely those computed on the data generated by the Random method. They define as normalized quality each of the previously described quality measures divided by the corresponding reference value: Normalized quality MI = 1 - (Mutual information distance / Mutual information distance random data) Normalized quality Wass = 1 - (wasserstein distance / wasserstein distance random data) Normalized quality Corr = 1 - (correlation distance / correlation distance random data) In the same way, normalized privacy is defined using the identifiability score or the membership inference score as: Normalized privacy IS = 1 - (Identifiability score / Identifiability score random data) Normalized privacy MIS = 1 - (membership inference score / membership inference score random data) with the Identifiability score or membership inference score ranging in [0,1]. In this setting, Quality-Privacy scores (QPS)are defined as: QPS = 2 * Normalized Quality * Normalized Privacy / ( Normalized Quality + Normalized Privacy ) with one of the normalized Quality and Normalized Privacy defined above. Results Quality-Privacy scores (QPS) can be evaluated by using different metrics for both quality and privacy. Both dimensions have been evaluated by calculating the ratio between the value obtained using the data of each algorithm and the value obtained using the corresponding random data, so that both quality and privacy range in [0,1] (normalized formula). For quality measures the inventors focused on the Mutual Information distance that reliably captures the bivariate associations between variables both for linear and non linear associations and clearly discriminates the different approaches, using the formula in its normalized version. As privacy metrics the inventors considered the results coming from both the identifiability and the membership inference score, in their normalized version. QPS are obtained through the F1 formula introduced in the Methods above. Figure 10 shows the quality mutual information metric, the two privacy dimensions and the two derived QPS (one for each privacy metric), for the METABRIC dataset. The QPS derived through mutual information distance shows that MIIC- SDG method is the best algorithm with respect to the quality-privacy trade-off for both QPS metrics, followed by Synthpop. However, it is important to note that the privacy evaluated from the identifiability score is always smaller than 0.5 for Synthpop, a value much lower to the one obtained through MIIC- SDG, ranging in 0.– - 0.7 for higher sample sizes (> 200 samples). This makes synthetic data generated with Synthpop significantly less private than synthetic data generated with MIIC-SDG. The QPS obtained using MI distances thus ranks MIIC-SDG as the best algorithm, for all sample sizes considered here, highlighting MIIC-SDG ability to reliably generate quality synthetic data while best preserving the privacy of the original sensitive data. The inventors also analyzed the results coming from the other quality distances, the two privacy measures and the corresponding QPS (4 supplementary metrics). These complete results are shown for the METABRIC dataset in Figure 12. MIIC-SDG algorithm obtained the best QPS results for correlation distance for small sample sizes (<500 samples) and obtained the second best scores for larger sample sizes, after the CTGAN algorithm. The QPS obtained through Wasserstein distance shows the Bayesian hill climbing as the best algorithm, followed by CTGAN and MIIC-SDG at comparable scores. TVAE shows slightly worse performances, followed by Synthpop that does not show competitive results due to a poor privacy score. Last in the group we find Bayesian tree search and PrivBayes, the first due to a poor privacy score and the second due to a poor quality of the generated data. Note that the correlation and Wasserstein distance metrics may not produce the same results because the Wasserstein distance can be highly influenced by the number of discrete variables and the number of levels for each variable. The results on Figure 10 are therefore considered to be more representative of the general performance of the algorithms compared. The inventors also run the same pipeline on the second data frame (IMVIGOR210), presented in Figure 11, obtaining comparable results to the ones on the METABRIC data. Also in this case MIIC-SDG shows better results than competitors, reporting a good trade-off between quality and privacy scores Example 4 – Discussion Over the past few decades, there has been a significant increase in the recruitment of patients for clinical trials and the collection of real-life health-related datasets. This surge has been witnessed in both public institutions and private companies, resulting in the accumulation of vast amounts of patient information. As the number of collected studies continues to grow, it becomes increasingly urgent to explore effective solutions for harnessing this wealth of data. This entails facilitating new research initiatives and promoting data sharing, all with the ultimate goal of pushing the boundaries of medical research. In recent years, various machine learning and deep learning approaches have been employed to synthesize health data. These approaches hold the promise of enabling data sharing while safeguarding patient privacy. Regulatory standards such as the European General Directive on Data Protection (GDPR) mandate that data holders implement robust measures to ensure data security and prevent potential data breaches, often leading to restrictions on data sharing and secondary data usage. However, few established standards exist to guarantee adequate data anonymization and data security. Previously proposed methods like k-anonymity, l-diversity, and t-closeness have limitations when it comes to preserving privacy while maintaining sufficient data quality for research purposes. Therefore, the development of new quantitative standards is imperative to facilitate data anonymization through the generation of synthetic data and assess the level of risk associated with data publication. To identify a suitable algorithm for synthetic data generation, it is essential to consider the preservation of data quality and the protection of privacy simultaneously. However, there has been relatively limited research that addresses these two critical aspects in tandem. To address this challenge, the inventors conducted a comprehensive evaluation of the various state-of-the-art algorithms across multiple scenarios, examining data quality and privacy both separately and more importantly in combination. They further proposed a new synthetic data generation algorithm that better balances these two aspects, providing a tool to generate shareable data, which can be used to generate clinical predictors with better performance than was previously possible. The benchmarking approach involved the following steps: 1. Defining Quality Metrics: the inventors first established various metrics for assessing the preservation of data quality. These metrics were designed to gauge the ability of the methods to generate data that closely resembles the original dataset. One of the most used methods to compare data have been performed through the Pearson correlation coefficients. However, Pearson correlation is also known to be very sensitive to outliers, which may explain some of the apparent good relative rankings of certain methods under correlation scores, while they exhibit poorer performance under more robust statistical criteria such as MI, which only depends on the ranks (and not the specific values) of the variables of interest. The inventors hence focused our analysis on MI, but also presented results using the more classical correlation concept. 2. Privacy Considerations: the inventors then focused on the privacy of the original sensitive data. The aim was to prevent the re-identification of patients and safeguard against the disclosure of sensitive patient information. 3. Defining a trade-off Metrics: To provide a comprehensive assessment of the sensitive synthetic data generated, the inventors proposed to combine the quality and privacy metrics. The resulting combination of metrics introduced in the paper defines a novel quality-privacy score, which we used to rank all the algorithms included in our benchmark. The results of this study demonstrate that our proposed solution, available as the MIIC-SDG method and tool, achieves commendable scores in both data quality and privacy preservation. Furthermore, it outperforms competing methods when focusing on the delicate balance between these two dimensions. Finally, as shown on Figure 13, which shows execution times for a method according to the present disclosure (MIIC-SDG) and comparative methods, the computational runtime of MIIC-SDG is comparable to the one of CTGAN or TVAE, showing that it is able to obtain the demonstrated benefits while remaining computationally efficient. References All documents mentioned in this specification are incorporated herein by reference in their entirety. Dor, Dorit and Michael Tarsi. “A simple algorithm to construct a consistent extension of a partially oriented graph.” (1992). Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, Debbie Rankin. Synthetic data generation for tabular health records: A systematic review, Neurocomputing, Volume 493, 2022, Pages 28-45. Samarati, Pierangela and Latanya Sweeney. “Protecting privacy when disclosing information: k- anonymity and its enforcement through generalization and suppression.” (1998). Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. L-diversity: Privacy beyond k- anonymity. ACM Trans. Knowl. Discov. Data 1, 3-es (2007). Li, N., Li, T. & Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. in 2007 IEEE 23rd International Conference on Data Engineering 106–115 (2007). Affeldt, S., Isambert, H. Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information. Proceedings of the 31th conference on Uncertainty in Artificial Intelligence (UAI), Amsterdam. Morgan Kaufmann (2015). Affeldt, S., Verny, L., Isambert, H.3off2: A network reconstruction algorithm based on 2-point and 3- point information statistics. BMC Bioinformatics, 17 Suppl 2:12 (2016). Verny, L., Sella, N., Affeldt, S., Singh, P. P. & Isambert, H. Learning causal networks with latent variables from multivariate information in genomic data. PLOS Comput. Biol.13, e1005662 (2017). Cabeli, V. et al. Learning clinical networks from medical records based on information estimates in mixed-type data. PLOS Comput. Biol.16, e1007866 (2020). Sella, N., Verny, L., Uguzzoni, G., Affeldt, S. & Isambert, H. MIIC online: a web server to reconstruct causal or non-causal networks from non-perturbative data. Bioinformatics 34, 2311–2313 (2018). Sella, N. et al. Interactive exploration of a global clinical network from a large breast cancer cohort. Npj Digit. Med.5, 1–10 (2022). Ankan, A. & Panda, A. pgmpy: Probabilistic Graphical Models using Python. Proceedings of the 14th Python in Science Conference SciPy2015 (2015). Qian, Z., Cebere, B.-C. & van der Schaar, M. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv:2301.07573 (2023). Nowok, B., Raab, G. M. & Dibben, C. synthpop: Bespoke Creation of Synthetic Data in R. J. Stat. Softw. 74, 1–26 (2016). Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling Tabular data using Conditional GAN. in Advances in Neural Information Processing Systems vol.32 (Curran Associates, Inc., 2019). Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. PrivBayes: Private Data Release via Bayesian Networks. ACM Trans. Database Syst.42, 4, Article 25 (December 2017), 41 pages. Dwork, C., McSherry, F., Nissim, K. & Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. in Theory of Cryptography (eds. Halevi, S. & Rabin, T.) 265–284 (Springer, 2006). Stadler, T., Oprisanu, B. & Troncoso, C. Synthetic Dat–-- Anonymisation Groundhog Day. arxiv.org/abs/2011.07018 (2022). Yoon, J., Drumright, L. N. & van der Schaar, M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE J. Biomed. Health Inform.24, 2378–2388 (2020). El Emam, K., Mosquera, L. & Fang, X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 5, ooac083 (2022). van Breugel, B., Sun, H., Qian, Z. & van der Schaar, M. Membership Inference Attacks against Synthetic Data through Overfitting Detection. Preprint at https://doi.org/10.48550/arXiv.2302.12580 (2023). Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012). Hoffmann-La Roche. A Phase II, Multicenter, Single-Arm Study of Atezolizumab in Patients With Locally Advanced or Metastatic Urothelial Bladder Cancer. https://clinicaltrials.gov/study/NCT02951767 (2023). Verma and J. Pearl, “Equivalence and synthesis of causal models,” in Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, July 27-29, 1990, pp.220–227. P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2nd ed. Cambridge, MA: MIT Press, 2000. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res.2011 Mar;20(1):40-9. Equivalents and Scope Unless context dictates otherwise, the descriptions and definitions of the features set out above are not limited to any particular aspect or embodiment of the invention and apply equally to all aspects and embodiments which are described. “and/or” where used herein is to be taken as specific disclosure of each of the two specified features or components with or without the other. For example “A and/or B” is to be taken as specific disclosure of each of (i) A, (ii) B and (iii) A and B, just as if each is set out individually herein. It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about” or “approximately”, it will be understood that the particular value forms another embodiment. The terms “about” or “approximately” in relation to a numerical value is optional and means for example +/- 10%. Throughout this specification, including the claims which follow, unless the context requires otherwise, the word “comprise” and “include”, and variations such as “comprises”, “comprising”, and “including” will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. Other aspects and embodiments of the invention provide the aspects and embodiments described above with the term “comprising” replaced by the term “consisting of” or ”consisting essentially of”, unless the context dictates otherwise. The features disclosed in the foregoing description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof. While the invention has been described in conjunction with the exemplary embodiments described above, many equivalent modifications and variations will be apparent to those skilled in the art when given this disclosure. Accordingly, the exemplary embodiments of the invention set forth above are considered to be illustrative and not limiting. Various changes to the described embodiments may be made without departing from the spirit and scope of the invention. For the avoidance of any doubt, any theoretical explanations provided herein are provided for the purposes of improving the understanding of a reader. The inventors do not wish to be bound by any of these theoretical explanations. Any section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.

Claims

CLAIMS 1. A computer-implemented method of providing a clinical predictor tool, the method comprising: obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis; wherein obtaining the training data comprises obtaining synthetic clinical data comprising values for a plurality of clinical variables for one or more patients, obtained using a method comprising: obtaining a directed acyclic graph (DAG) comprising a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes, wherein the edges correspond to conditional dependence relationships inferred from real clinical data comprising values for the plurality of clinical variables for a plurality of patients, and wherein the DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes; and generating synthetic clinical data for a patient using the DAG by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node; and iteratively, obtaining a value for each target node of the DAG using: a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table estimated from the real clinical data or a machine learning model trained on the real clinical data, wherein the plurality of variables comprise at least one continuous variable and at least one discrete variable, and wherein obtaining a value for at least one target node of the DAG comprises using a machine learning model trained on the real clinical data. 2. A computer-implemented method of providing synthetic clinical data comprising values for a plurality of clinical variables for one or more patients, the method comprising: obtaining a directed acyclic graph (DAG) comprising a set of nodes each corresponding to a respective one of the plurality of clinical variables and a set of directed edges between nodes in the set of nodes, each edge indicating a dependency between the value of the connected nodes, wherein the edges correspond to conditional dependence relationships inferred from real clinical data comprising values for the plurality of clinical variables for a plurality of patients, and wherein the DAG comprises one or more source nodes that do not have any incoming edge and one or more target nodes that have one or more edges incoming from respective parent nodes; and generating synthetic clinical data for a patient using the DAG by: obtaining a value for each variable associated with a source node or an isolated node of the DAG by sampling from a distribution estimated from the real clinical data for the respective node; and iteratively, obtaining a value for each target node of the DAG using: a value obtained for each of the variables associated with parent nodes of the target node, and either a multivariate conditional probability table estimated from the real clinical data or a machine learning model trained on the real clinical data, wherein the plurality of variables comprise at least one continuous variable and at least one discrete variable, and wherein obtaining a value for at least one target node of the DAG comprises using a machine learning model trained on the real clinical data. 3. The method of any preceding claim, wherein the machine learning model used to obtain a value for a target node has been trained using training data comprising, for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes; and/or wherein the machine learning model used to obtain a value for a target node is configured to predict a value for the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node. 4. The method of any preceding claim, wherein obtaining a value for a target node comprises training a machine learning model to predict a value of the variable associated with the target node using as input values for the variables associated with the parent nodes of the target node, wherein the training uses training data comprising for each of a plurality of patients in the real clinical data, values for the variable associated with the target node and corresponding values for the variables associated with the parent nodes. 5. The method of any preceding claim, wherein obtaining a value for a target node comprises: determining that the target node and all parent nodes are associated with discrete variables and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with target nodes and parent nodes; or determining that at least one of the target node and parent nodes is associated with a continuous variable and performing one of: (i) discretising the target or parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; or (ii) using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes. 6. The method of claim 5, wherein the method comprises discretising a target or parent node that is associated with a continuous variable using a respective discretisation scheme, wherein the discretisation scheme is a discretisation that is specific to the target and parent nodes and that has been obtained by applying an optimisation step using an objective criterion that applies to the relationship between the target and parent nodes, and/or wherein the method comprises determining that the target node is associated with a continuous variable and fitting a probability density function to the real clinical data for the continuous variable associated with the target node, wherein a separate probability density function is fitted for each combination of values of the discrete or discretised parent nodes, optionally wherein the objective criterion is the maximisation of the absolute value of a statistical metric of dependence between the variables associated with the target and parent nodes, optionally wherein the statistical metric of dependence is mutual information. 7. The method of any of claims 5 or 6, wherein obtaining a value for a target node comprises: determining that at least one of the parent nodes is associated with a continuous variable, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; wherein discretising the target or parent nodes that are associated with a continuous variable comprises, for each continuous variable to be discretised, determining a discretisation scheme using an optimisation method comprising identifying a plurality of non-overlapping subranges for the continuous variable that are associated with a supremum value of a statistical metric of dependence between the target and parent nodes, amongst a plurality of values of said statistical metric associated with respective candidate plurality of non-overlapping ranges for the continuous variable. 8. The method of any of claims 5 to 7, wherein the method comprises: determining that at least one of the parent nodes is associated with a continuous variable; and when the number or proportion of parent nodes associated with a continuous variable is below a predetermined threshold, discretising the parent nodes that are associated with a continuous variable using a respective discretisation scheme for each continuous variable, and using a multivariate conditional probability table estimated from the real clinical data for the variables associated with the target nodes and parent nodes using said discretisation scheme; or when the number or proportion of parent nodes associated with a continuous variable is at or above a predetermined threshold, using a machine learning model trained on the real clinical data for the variables associated with the target node and parent nodes. 9. The method of any preceding claim, wherein the machine learning model is a regression or classification model, wherein the machine learning model is a non-linear classification or regression model, or wherein the machine learning model is a tree-based classification or regression model, optionally a random forest model. 10. The method of any preceding claim, wherein the step of obtaining a directed acyclic graph (DAG) comprises obtaining a network between the variables in the real clinical data using a causal discovery method or a constraint-based causal discovery method, optionally wherein the causal discovery method or constraint-based causal discovery method determines conditional mutual information between variables associated with the nodes in the network. 11. The method of any preceding claim, wherein the step of obtaining a DAG comprises: obtaining a network between the variables in the real clinical data using a constraint-based causal discovery method, wherein the network comprises directed and undirected edges; and obtaining a DAG from the network by: using a depth-first-search orientation algorithm; or selecting the direction of each undirected edge that minimises the number of directed cycles in the network, and removing all directed cycles in the resulting directed graph by iteratively identifying the longest cycle in the graph (the one with the most edges) and changing the direction of the edge that minimizes the number of remaining cycles in the graph. 12. The method of any preceding claim, wherein the real clinical data used comprises data for the plurality of clinical variables for a plurality of patients each representing an independent sample, wherein the number of samples is at least 5 times, at least 6 times, at least 7 times, at least 8 times, at least 9 times, or at least 10 times, preferably at least 10 times larger than the number of clinical variables in the plurality of clinical variables; and/or wherein the clinical predictor model is a classification or regression model, wherein the training data comprises real and synthetic clinical data, wherein the diagnosis or prognosis is selected from a disease severity, a disease subtype, and survival, optionally wherein the diagnosis or prognosis is survival and the disease is cancer. 13. A computer-implemented method of analysing clinical data, the method comprising: obtaining synthetic clinical data using the method of any of claims 2 to 12; and performing one or more of: providing a clinical predictor tool by obtaining training data comprising, for each of a plurality of patients, values for a plurality of clinical variables comprising a variable indicative of a diagnosis or prognosis and one or more further clinical variables, wherein the training data comprises said synthetic data; and training a clinical predictor model to predict the variable indicative of a diagnosis or prognosis using said training data, wherein the clinical predictor model is a machine learning model configured to take as input the values of the one or more further clinical variables and produce as output a prediction of the variable indicative of a diagnosis or prognosis; combining said synthetic data with the real clinical data used to obtain the synthetic data; outputting said synthetic data to a third party or public data repository; combining said synthetic data with another clinical dataset for the purpose of further analysis, optionally including obtaining a clinical predictor tool; and analysing said data using any data-driven clinical discovery method. 14. A system comprising: at least one processor; and at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1 to 13. 15. A non-transitory computer readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any of claims 1 to 13.
PCT/EP2025/051373 2024-01-22 2025-01-21 Clinical data analysis Pending WO2025157774A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP24305127 2024-01-22
EP24305127.3 2024-01-22

Publications (1)

Publication Number Publication Date
WO2025157774A1 true WO2025157774A1 (en) 2025-07-31

Family

ID=89897220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2025/051373 Pending WO2025157774A1 (en) 2024-01-22 2025-01-21 Clinical data analysis

Country Status (1)

Country Link
WO (1) WO2025157774A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230229906A1 (en) * 2022-01-20 2023-07-20 Microsoft Technology Licensing, Llc Estimating the effect of an action using a machine learning model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230229906A1 (en) * 2022-01-20 2023-07-20 Microsoft Technology Licensing, Llc Estimating the effect of an action using a machine learning model

Non-Patent Citations (26)

* Cited by examiner, † Cited by third party
Title
AFFELDT, S.ISAMBERT, H.: "Proceedings of the 31th conference on Uncertainty in Artificial Intelligence", 2015, MORGAN KAUFMANN, article "Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information"
AFFELDT, S.VERNY, L.ISAMBERT, H.: "3off2: A network reconstruction algorithm based on 2-point and 3-point information statistics", BMC BIOINFORMATICS, vol. 17, 2016, pages 12
ANKAN, A.PANDA, A.: "pgmpy: Probabilistic Graphical Models using Python", PROCEEDINGS OF THE 14TH PYTHON IN SCIENCE CONFERENCE SCIPY2015, 2015
AZUR MJSTUART EAFRANGAKIS CLEAF PJ: "Multiple imputation by chained equations: what is it and how does it work?", INT J METHODS PSYCHIATR RES., vol. 20, no. 1, March 2011 (2011-03-01), pages 40 - 9, XP093088612, DOI: 10.1002/mpr.329
CABELI, V. ET AL.: "Learning clinical networks from medical records based on information estimates in mixed-type data", PLOS COMPUT. BIOL., vol. 16, 2020, pages 1007866
CURTIS, C. ET AL.: "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups", NATURE, vol. 486, 2012, pages 346 - 352
DOR, DORITMICHAEL TARSI, A SIMPLE ALGORITHM TO CONSTRUCT A CONSISTENT EXTENSION OF A PARTIALLY ORIENTED GRAPH, 1992
DWORK, C.MCSHERRY, F.NISSIM, K.SMITH, A: "Theory of Cryptography", 2006, SPRINGER, article "Calibrating Noise to Sensitivity in Private Data Analysis", pages: 265 - 284
EL EMAM, K.MOSQUERA, L.FANG, X.: "Validating a membership disclosure metric for synthetic health data", JAMIA OPEN, vol. 5, 2022
HOFFMANN-LA ROCHE.: "A Phase II, Multicenter", SINGLE-ARM STUDY OF ATEZOLIZUMAB IN PATIENTS WITH LOCALLY ADVANCED OR METASTATIC UROTHELIAL BLADDER CANCER., 2023, Retrieved from the Internet <URL:https://clinicaltrials.gov/study/NCT02951767>
JUN ZHANGGRAHAM CORMODECECILIA M. PROCOPIUCDIVESH SRIVASTAVAXIAOKUI XIAO: "PrivBayes: Private Data Release via Bayesian Networks", ACM TRANS. DATABASE SYST., vol. 42, no. 4, 2017, pages 41
LI, N.LI, T.VENKATASUBRAMANIAN, S.: "t-Closeness: Privacy Beyond k-Anonymity and I-Diversity", 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, vol. 106-115, 2007
MACHANAVAJJHALA, A.KIFER, D.GEHRKE, J.VENKITASUBRAMANIAM, M: "L-diversity: Privacy beyond k-anonymity", ACM TRANS. KNOWL. DISCOV., 2007
MIKEL HERNANDEZGORKA EPELDEANE ALBERDIRODRIGO CILLADEBBIE RANKIN: "Synthetic data generation for tabular health records: A systematic review", NEUROCOMPUTING, vol. 493, 2022, pages 28 - 45, XP087053983, DOI: 10.1016/j.neucom.2022.04.053
NOWOK, B.RAAB, G. M.DIBBEN, C.: "synthpop: Bespoke Creation of Synthetic Data in R", J. STAT. SOFTW., vol. 74, 2016, pages 1 - 26
P. SPIRTESC. GLYMOURR. SCHEINES: "Causation, Prediction, and Search", 2000, MA: MIT PRESS
QIAN, ZCEBERE, B.-C.VAN DER SCHAAR, M.: "Synthcity: facilitating innovative use cases of synthetic data in different data modalities", ARXIV:2301.07573, 2023
SAMARATI, PIERANGELALATANYA SWEENEY, PROTECTING PRIVACY WHEN DISCLOSING INFORMATION: K-ANONYMITY AND ITS ENFORCEMENT THROUGH GENERALIZATION AND SUPPRESSION, 1998
SELLA, N. ET AL.: "Interactive exploration of a global clinical network from a large breast cancer cohort.", NPJ DIGIT. MED., vol. 5, 2022, pages 1 - 10
SELLA, N.VERNY, L.UGUZZONI, G.AFFELDT, S.ISAMBERT, H.: "MIIC online: a web server to reconstruct causal or non-causal networks from non-perturbative data", BIOINFORMATICS, vol. 34, 2018, pages 2311 - 2313
STADLER, T.OPRISANU, B.TRONCOSO, C: "Synthetic Dat--- Anonymisation Groundhog Day", ARXIV.ORG/ABS/2011.07018, 2022
VAN BREUGEL, B.SUN, H.QIAN, Z.VAN DER SCHAAR, M., MEMBERSHIP INFERENCE ATTACKS AGAINST SYNTHETIC DATA THROUGH OVERFITTING DETECTION, 2023, Retrieved from the Internet <URL:https://doi.org/10.48550/arXiv.2302.12580>
VERMAJ. PEARL: "Equivalence and synthesis of causal models", PROCEEDINGS OF THE SIXTH CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 1990, pages 220 - 227
VERNY, L.SELLA, N.AFFELDT, S.SINGH, P. P.ISAMBERT, H.: "Learning causal networks with latent variables from multivariate information in genomic data", PLOS COMPUT. BIOL., vol. 13, 2017, pages 1005662
XU, L.SKOULARIDOU, M.CUESTA-INFANTE, A.VEERAMACHANENI, K: "Advances in Neural Information Processing Systems", vol. 32, 2019, CURRAN ASSOCIATES, INC., article "Modeling Tabular data using Conditional GAN"
YOON, J.DRUMRIGHT, L. N.VAN DER SCHAAR: "M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN", IEEE J. BIOMED. HEALTH INFORM., vol. 24, 2020, pages 2378 - 2388, XP011802599, DOI: 10.1109/JBHI.2020.2980262

Similar Documents

Publication Publication Date Title
Bazgir et al. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks
Duan et al. Evaluation and comparison of multi-omics data integration methods for cancer subtyping
Lee et al. Review of statistical methods for survival analysis using genomic data
Hassanzadeh et al. Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms
Sridhar et al. A probabilistic approach for collective similarity-based drug–drug interaction prediction
Djebbari et al. Seeded Bayesian Networks: constructing genetic networks from microarray data
Fan et al. Graph2GO: a multi-modal attributed network embedding method for inferring protein functions
Gupta et al. A comprehensive data‐level investigation of cancer diagnosis on imbalanced data
Meduri et al. Leveraging federated learning for privacy-preserving analysis of multi-institutional electronic health records in rare disease research
Golestan Hashemi et al. Intelligent mining of large-scale bio-data: Bioinformatics applications
Hosni et al. A mapping study of ensemble classification methods in lung cancer decision support systems
Kim et al. RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Yaqoob et al. SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis: A comparative study
Ganie et al. Improved liver disease prediction from clinical data through an evaluation of ensemble learning approaches
Zhou et al. Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization
Kolasseri et al. Comparative study of machine learning and statistical survival models for enhancing cervical cancer prognosis and risk factor assessment using SEER data
Chen et al. Improved interpretability of machine learning model using unsupervised clustering: predicting time to first treatment in chronic lymphocytic leukemia
Xing et al. Non-imaging medical data synthesis for trustworthy AI: A comprehensive survey
Romano et al. Improving QSAR modeling for predictive toxicology using publicly aggregated semantic graph data and graph neural networks
Mohtasham et al. Comparative analysis of feature selection techniques for COVID-19 dataset
Li et al. A comprehensive evaluation of disease phenotype networks for gene prioritization
Agyemang et al. Addressing Class Imbalance Problem in Health Data Classification: Practical Application from an Oversampling Viewpoint
Kim et al. The Fermi–Dirac distribution provides a calibrated probabilistic output for binary classifiers
Rahman et al. BIRDMAn: A Bayesian differential abundance framework that enables robust inference of host-microbe associations
Wei Integrative analyses of cancer data: a review from a statistical perspective

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25701004

Country of ref document: EP

Kind code of ref document: A1