WO2025106454A1

WO2025106454A1 - Training data augmentation and validation system

Info

Publication number: WO2025106454A1
Application number: PCT/US2024/055586
Authority: WO
Inventors: Christopher J. T. VILLONGCO; Christian David MARTON
Original assignee: Vektor Group Inc
Current assignee: Vektor Group Inc
Priority date: 2023-11-13
Filing date: 2024-11-12
Publication date: 2025-05-22
Anticipated expiration: 2026-05-13

Abstract

A system is described that validates training data. The system may generate simulated data sets that include simulated data and a simulated label. The simulated data of a simulated data set is generated based on a simulation that factored in the simulated label of that simulated data set. The system may access a collection of input data sets. Each input data set has input data and an input label. The system may for each of a plurality of input data sets, identify a simulated data set with simulated data that is similar to the input data of that input data set. When the input label of that input data set and the simulated label of the identified simulated data set satisfy a valid training data criterion, the system indicates that that input data set represent valid training data.

Description

TRAINING DATA AUGMENTATION AND VALIDATION SYSTEM

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Pro. App. No. 63/598,477 titled “Training Data Augmentation and Validation System” and filed on November 13, 2024, which is hereby incorporated by reference.

BACKGROUND

[0002] Many heart disorders can cause symptoms, morbidity (e.g., syncope or stroke), and mortality. Common heart disorders include atrial fibrillation (“AF”), ventricular fibrillation (“VF”), atrial tachycardia (“AT”), ventricular tachycardia (“VT”), atrial flutter, and premature ventricular contractions (“PVC”). The sources of heart disorders include stable electrical rotors, recurring electrical focal sources, and so on. These sources are important drivers of sustained or clinically significant episodes of heart disorders. These heart disorders can be treated with therapeutic ablation, radio frequency waves, cryogenic, ultrasound, external radiation sources, and so on by targeting the source of the heart disorder. To target the source of a heart disorder, the source location of the heart disorder should be identified.

[0003] Unfortunately, many methods for reliably identifying the source locations of a heart disorder can be complex, cumbersome, and expensive. For example, one method uses an electrode basket catheter that needs to be inserted into the heart (e.g., left ventricle) to collect from within the heart measurements of the electrical activity of the heart, such as during an induced VF. The measurements can then be analyzed to help identify a possible source location. Such an electrode basket catheter is expensive. Moreover, the use of an electrode basket catheter can lead to serious complications. Another method uses a body surface vest with electrodes to collect from the patient’s body surface measurements, which can be analyzed to help identify a possible source location. A body surface vest is expensive, is difficult to manufacture, and may interfere with the placement of defibrillator pads needed after inducing a fibrillation to collect measurement during fibrillation. In addition, the vest analysis requires a CT scan and is unable to sense the interventricular and interatrial septa. [0004] Some techniques employ machine learning models to identify the source location of an arrhythmia given an arrhythmia cardiogram. However, such machine learning models may need significant amounts of training data that include arrhythmia cardiograms and corresponding source locations. Electronic health records of patients who underwent ablation procedures may be a potential source of training data. However, the arrhythmia cardiograms and the source locations of the electronic health records may be just approximations of the actual arrhythmia portions of larger cardiograms and of the actual source locations. A machine learning model trained using such training data may be unable to effectively identify the source location of a target patient’s arrhythmia. As a result, the ablation procedure that is planned assuming such a source location may have an outcome that is less than optimal.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] Figure 1 is a flow diagram that illustrates the processing of a TDAV system in some embodiments.

[0006] Figure 2 is a block diagram illustrating the components of a TDAV system in some embodiments.

[0007] Figure 3 is a flow diagram that illustrates the processing of a generate simulated data sets component of the TDAV system in some embodiments.

[0008] Figure 4 of a generate derived ECGs component of the TDAV system in some embodiments.

DETAILED DESCRIPTION

[0009] A system is provided for augmenting existing training data for training a target machine learning (ML) model to identify the source location of an arrhythmia. The existing training data (if any) may not be sufficient to effectively train the target ML model. To overcome this lack of training data, the target ML model is trained with additional training data that is determined to be valid. (The term “additional” is used even though there may not be any existing training data.) Potential training data may include training data sets that are not valid training data because, for example, a person assigned an incorrect feature or label to a training data set. If the target ML model is trained with invalid training data, the accuracy of the target ML mode is likely to be diminished. So, additional training data that satisfies a valid training data criterion may be used to augment the existing training data. The valid training data criterion is based on comparison of the additional training data to the existing training data. By augmenting the existing training data with the additional valid training data, the target ML model will likely be a more accurate target ML model. The target ML model may also be trained with training data that includes only the additional valid training data and does not include the existing training data.

[0010] In some embodiments, a training data augmentation and validation (TDAV) system validates input data sets (i.e. , additional training data) that each includes input data and an input label. For example, the input data may be clinical data such as an arrhythmia cardiogram collected from a patient, and the input label may be a clinical label such as a source location or a type of the arrhythmia represented by the arrhythmia cardiogram. A cardiogram may be represented as an image or a voltage-time series or data derived from the image or voltage-time series (e.g., QRS integral). (Note: An image represents a voltage-time series.) The clinical data sets may be derived from an electronic health record (EHR) system. An ablation may target a source location or another location. An input label may be an ablation location and/or a source location. Although the TDAV system is described primarily as processing cardiograms that are electrocardiograms (ECGs) represented as voltage-time series, the TDAV system may process vectorcardiograms (VCGs) represented as direction and magnitude of electrical activity at time intervals. An ECG may be converted to a VCG and vice versa depending on whether the TDAV system is adapted to directly process ECGs or VCGs.

[0011] However, because the arrhythmia cardiogram and the ablation location of a clinical data set would typically be specified by a person, the specified arrhythmia cardiogram may cover more than or less than the actual arrhythmia cardiogram (e.g., including a non-arrhythmia portion or not including the full arrhythmia portion) and/or the specified ablation location may be somewhat different from the actual ablation location. As a result, a target ML model that is trained using a clinical data set that includes such a specified arrhythmia cardiogram and/or specified ablation location may have its accuracy diminished. Such a clinical data set is considered to be training data that is not valid. Moreover, it would be impractical for various reasons for a person to review clinical data and specify an accurate arrhythmia cardiogram and an ablation location. One reason is that since the volume of clinical data needed to train an ML model may be very large, it would be extremely time consuming for a person to even review the clinical data of each EHR. Another reason is that specifying the relevant portion of an arrhythmia cardiogram is a subjective process and prone to error. Another reason is that a person would likely not know whether the specified ablation location is correct or not. Even if a person knew that it was incorrect, the person would likely not know what the actual ablation location was.

[0012] In some embodiments, the TDAV system validates the input data sets using simulated data sets that each includes simulated data and a simulated label. The simulated data sets may be generated by running simulations with a simulated label as a simulation parameter and with the simulated data derived from the results of the simulation. Continuing with the arrhythmia cardiogram example, a simulation may simulate electrical activity of a heart based on a cardiac specification that includes a simulated source location of an arrhythmia. A simulated cardiogram may be generated based on the simulated electrical activity. A simulated data set includes the simulated cardiogram as the simulated data and the simulated source location as the simulated label.

[0013] To determine if an input data set satisfies the valid training data criterion, the TDAV system determines “data similarity” between the input data of that input data set and a simulated data of a simulated data set. If the data similarity satisfies a data similarity criterion, then the TDAV system determines “label similarity” between the input label of that input data set and the simulated label of that simulated data set. If the label similarity satisfies a label similarity criterion, then the TDAV system designates the input data set as valid training data. The TDAV system may determine data similarity and input similarity for multiple simulated data sets to identify one in which the data similarity criterion and label similarity criterion are satisfied. If none can be identified, the TDAV system designates the input data set as not valid training data. The valid training data criterion is based on the data similarity criterion and the label similarity criterion. The TDAV system may determine label similarity before data similarity or may employ a combined criterion based on both input data and an input label. A combined criterion may weight data similarity and label similarity differently.

[0014] Continuing with the arrhythmia cardiogram example, the TDAV system attempts to identify a simulated data set in which the simulated cardiogram is similar to a clinical cardiogram of a clinical data set and the simulated source location is similar to the clinical source location of that clinical data set. If one can be identified, the TDAV system designates the input data set as valid training data. Otherwise, the TDAV system designates the input data set as not valid training data. The TDAV system may employ various techniques to determine the similarity between cardiograms such as cosine similarity, sum of the square of the differences, and Pearson correlation coefficient. When cosine similarity is used, a cardiogram similarity criterion may be based on the cosine being above a threshold such as 0.9. The TVAD system may determine similarity between source locations based on distance between the source locations. A source location similarity criterion may be that the source locations are within 1 .0 mm or within a distance that may vary based on, for example, the cardiac segment (e.g., American Heart Association segments) in which the source locations are located.

[0015] A target ML model may be trained using training data that includes both simulated data sets and input data sets that are designated as valid training data. Thus, the training data may be considered to be the simulated data sets augmented with those input data sets. The target ML model may be trained using supervised learning, semisupervised learning, or unsupervised learning techniques. The target ML model may employ various ML architectures such a neural network (NN), convolutional neural network (CNN), recurrent neural network (RNN), a support vector machine (SVM), and a transformer. Various ML architectures are described below.

[0016] Once the target ML model is trained, it may be used to identify an input label given input data. For example, the input data may be an arrhythmia cardiogram collected from a patient during an ablation procedure. The arrhythmia cardiogram is input to the target ML model which outputs information, such as a source location, an ablation pattern, or a target location, to help inform the ablation procedure. The patient may then be treated based on the output information. For example, an electrophysiologist may perform an ablation on the patient targeting the target location and with the ablation pattern. As another example, if the electrophysiologist determines that an ablation at the target location would be too risky (e.g., given extensive scarring near the target location), the patient may be treated with medicine such as an anticoagulant, a sodium channel blocker, a beta blocker, and so on. [0017] To generate the simulated data sets, the TVAD system runs simulations of electrical activity of a heart that are each based on a cardiac specification. A cardiac specification specifies various cardiac characteristics including anatomical characteristics and electrophysiological characteristics such as chamber size, conduction velocity, action potential, scar tissue, ablation location, and source location of an arrhythmia. Techniques for running such simulations are described in U.S. Pat. No. 10,860,754 titled “Calibration of Simulated Cardiograms” and issued on December 8, 2020, which is hereby incorporated by reference. For each simulation, a simulated cardiogram is generated based on the simulated electrical activity and a thorax specification. The simulated data set for that simulation is the simulated cardiogram as the simulated data and the source location as the simulated label. A cardiogram may be an ECG or a VCG. An ECG may have one or more leads such as a 1 -lead ECG, 3-lead ECG, and a 12-lead ECG.

[0018] In some embodiments, the TDAV system employs computational modeling to simulate the electromagnetic (EM) output of a heart over time based on a source configuration of the heart. The EM output may represent an electrical voltage, a current, a magnetic field, and so on. The source configuration may include information on cardiac geometry, cardiac muscle fibers, scar locations, a source location of an arrhythmia, electrical properties (e.g., action potential and conduction velocity), and so on, and the EM output is a collection of the electrical characteristics (e.g., voltages) at various heart locations within the myocardium over time. The source configurations may be derived from simulated data, clinical data, or patient-specific data. To generate the EM output, a simulation may be performed with simulation steps at step intervals (e.g., 1 ms) to generate an EM mesh for that step. The EM mesh may be a finite- element three-dimensional (3D) mesh that stores an EM value (e.g., voltage) at each heart location (i.e. , vertex of the mesh) for that step. For example, the left ventricle may be defined as having approximately 70,000 heart locations with the EM mesh storing an EM value for each heart location. With such an EM mesh, a three-second simulation with a step interval of 1 ms would generate 3,000 sets of 70,000 EM values. The sets of EM values are the EM output of the simulation. Computational modeling is described in Villongco, C., Krummen, D., et al., “Patient-Specific Modeling of Ventricular Activation Pattern using Surface ECG-derived Vectorcardiogram in Bundle Branch Block,” Progress in Biophysics and Molecular Biology, vol. 1 15, iss. 2-3, Aug. 2014, pp. 305- 313. The TDAV system also generates an ECG or other cardiogram for each simulation based on the EM outputs of the simulation steps assuming various thoracic characteristics (e.g., body fat composition).

[0019] The TDAV system may be employed to validate clinical data sets that include clinical cardiograms and clinical labels other than a source location. For example, a clinical label may be a clinical ablation pattern used in a successful ablation and a clinical ablation location used in a successful ablation that is other than source location of the arrhythmia. The simulated ablation patterns that are used to validate a clinical ablation pattern may be generated as described in U.S. Pat. No. 11 ,259,871 titled “Identify Ablation Pattern For Use In An Ablation” and issued on March 1 , 2022, which is hereby incorporated by reference. A simulated ablation location may be based on a pulmonary vein isolation (PVI) even though the actual source location may be in right pulmonary vein. Such a simulated ablation location may be identified using the techniques described in U.S. Pat. No. 1 1 ,259,871 . A source location may be specified as being a general area such as in the right ventricular outflow tract (RVOT) or in the left ventricular outflow tract (LVOT).

[0020] The simulated data sets that are used to validate the clinical data set of a patient may be calibrated to the characteristics of the patient. For example, if the patient has scar tissue in a certain area of the heart, the simulated data sets used to validate may be limited to those of simulations with a cardiac specification that specifies scar tissue at that area of the simulated heart. Because the clinical data sets are validated based on simulated data sets generated based on cardiac characteristics that are similar to the patient’s cardiac characteristics, the chance of incorrectly validating a clinical data set is reduced. Techniques for such calibrating are described in U.S. Pat. No. 10,860,754.

[0021] The TDAV system may, before validating a clinical data set, refine the clinical cardiogram of that clinical data set to represent more accurately the actual arrhythmia portion of a cardiogram. As described above, a clinical cardiogram may not be an accurate representation because, for example, the person who selected the clinical cardiogram may not have sufficient experience to do so accurately. The extents of the clinical cardiogram (e.g., start time and end time) may be adjusted as described in PCT Pub. No. 2024/044719 titled “Automatic Refinement of Electrogram Selection” and published on February 29, 2024, which is hereby incorporated by reference.

[0022] The TDAV system may also employ a validation ML model to validate clinical data sets. The validation ML model may be trained with simulated data sets that are labeled as being valid training data and derived data sets that are labeled as invalid training data. The derived data sets may be generated using techniques such as varying the extent of a simulated cardiogram or a clinical cardiogram as described in PCT Pub. No. 2024/044719. For example, a derived data set has a derived cardiogram that is derived from a simulated cardiogram or a clinical cardiogram by, for example, varying the extent of, adjusting baseline drift of, and/or adding noise to that simulated cardiogram. A derived data set may include a clinical data set whose clinical cardiogram has not been modified. The TDAV system may determine whether the derived data set is a valid training data set based on similarity to the simulated data sets as described above. The training data for the validation ML model may include the simulated data sets labeled as valid and the invalid derived data sets labeled as invalid. To determine whether a clinical data set is valid, the clinical data set is input to the validation ML model which outputs an indication of whether the clinical data is valid or invalid. Also, the valid derived data sets may be used as additional training data for the target ML model.

[0023] In some embodiments, the TDAV system may be employed to validate clinical data sets with a data label that indicate whether a source location is in the LVOT or RVOT. The source location may be specified as being LVOT or RVOT or specified as being within a subdivision of an OT. A subdivision may be RV septum, RV free wall, RV near the His-bundle, LV endocardium, and so on. The TDAV system may employ various OT algorithms to perform the validation. For example, one such algorithm analyzes morphology of an ECG to identify various characteristics indicative of a subdivision. The analysis may be based on magnitude of QRS complex (e.g., II, III, aVF, aVR, and aVL), presence of S-wave, and so on. Such an algorithm is described in Ito, S., Tada, FL, Naito, S., Kurosaki, K., Ueda, M., Hoshizaki, H., Miyamori, I., Oshima, S., Taniguchi, K. and Nogami, A., 2003. Development and validation of an ECG algorithm for identifying the optimal ablation site for idiopathic ventricular outflow tract tachycardia. Journal of cardiovascular electrophysiology, 14(12), pp.1280-1286, which is hereby incorporated by reference. As another example, machine learning algorithms may be employed to classify arrhythmias as LVOT or RVOT. Such algorithms are described in (1 ) Doste, R., Lozano, M., Jimenez-Perez, G., Mont, L., Berruezo, A., Penela, D., Camara, O. and Sebastian, R., 2022. Training machine learning models with synthetic data may improve the prediction of ventricular origin in outflow tract ventricular arrhythmias. Frontiers in Physiology, 13, p.909372 and (2) Zheng, J., Fu, G., Abudayyeh, I., Yacoub, M., Chang, A., Feaster, W.W., Ehwerhemuepha, L., El-Askary, H., Du, X., He, B. and Feng, M., 2021 . A high-precision machine learning algorithm to classify left and right outflow tract ventricular tachycardia. Frontiers in physiology, 12, p.641066, which are hereby incorporated by reference. More generally, various algorithms may be employed to identify a source location that may be in an OT or not in an OT. For example, a source location may be in the left atrium. One such algorithm is described in Mohammadi, F., Sheikhani, A., Razzazi, F. and Sharif, A.G., 2021. Non-invasive localization of the ectopic foci of focal atrial tachycardia by using ECG signal based sparse decomposition algorithm. Biomedical Signal Processing and Control, 70, p.103014, which is hereby incorporated by reference.

[0024] To validate a clinical data set based on such an OT algorithm, the TVAD system inputs a clinical cardiogram into the OT algorithm which outputs an OT source location. If the OT source location satisfies a source location similarity criterion, then the clinical data set is valid training data. The TVAD system may employ multiple OT algorithms to validate a clinical data set. When multiple OT algorithms are employed, the TVAD system inputs a clinical cardiogram to each OT algorithm which generates an OT source location. If the OT source locations satisfy combined source location similarity criterion, then the TVAD system designates the clinical data set as valid training data. Various combined source location similarity criterions may be employed. One combined source location similarity criterion may weight the OT algorithms differently. The weight of each OT algorithm may be based on its sensitivity and/or specificity (or other assessment measure) which may be derived from a publication relating to that OT algorithm or which may be calculated using simulated data sets. An OT algorithm with a sensitivity and/or specificity that is higher than that of another OT algorithm would be given a higher weight. For example, if a higher weighted OT algorithm identifies a source location that is 0.2 mm from the clinical source location and a lower weighted OT algorithm identifies a source location that is only 0.05 mm from the clinical source location, the source location may be considered to not satisfy the combined source location similarity criterion. Conversely, if a higher weighted OT algorithm identifies a source location that is 0.05 mm from the clinical source location and a lower weighted OT algorithm identifies a source location that is only 0.2 mm from the clinical source location, the source location may be considered to satisfy the combined source location similarity criterion. Another combined source location similarity criterion may be based on majority or super-majority of the OT algorithms each satisfying the source location similarity criterion. For example, if 4 out of 5 OT algorithms identify a source location that is less than or equal to 0.1 mm from the clinical source location, the clinical data set may be considered valid.

[0025] An OT ML model may be trained using the OT clinical data sets that are determined to be valid. The training data may include feature vectors with features derived from various leads (II and aVL) of an ECG such as QRS complexes and S waves and labels relating to OT. The TDAV system may also generate additional training data based on the clinical data sets. To generate the additional training data, a derived ECG may be derived from an ECG of a valid clinical data set, which is referred to as a base ECG. The various OT algorithms may be applied to a derived ECG to determine based on the label of its base ECG whether the derived ECG and the label represent a valid derived clinical data set. If so, the derived clinical set may be used as training data. Techniques for generating derived ECGs from base ECGs are described in PCT App. No. PCT/US2023/072866. The ‘866 application also describes the automatic refinement of an ECG (e.g., a manually selected ECG) so that it more accurately represents a portion related to a characteristic of interest (e.g., source location). The derived ECGs and automatically refined ECGs may be based on clinical ECGs of clinical data sets representing successful OT-related ablations. The derived ECGs and labels may also be validated based on an initial OT ML model that is trained using clinical data sets (and possibly derived clinical data sets). To validate a derived ECG, a derived ECG is input to the initial OT ML model to generate a label. If the label is that same (or nearly the same ECG based on a similarity criterion) as that of the valid clinical data set from which it is derived, then the derived ECG and its label may be added to the training data for additional training of the OT ML model.

[0026] To generate additional training data, a generative ML model may be trained using valid clinical data sets. For example, a generative adversarial network (GAN) may be trained using valid clinical ECGs associated with the same label (e.g., RV septum or RV free wall). Once trained, the generator of the GAN may be used to generate ECGs corresponding to that same label. As another example, a diffusion ML model may be similarly trained to generate ECGs. A diffusion ML model is described in Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695), which is hereby incorporated by reference.

[0027] The techniques described above for validating clinical data using OT algorithms may also be employed to validate source locations that are not OT source locations. Examples of such algorithms are described in (1 ) Shahsavari, M., Delfan, N. and Forouzanfar, M., 2023. Localizing the Origin of Idiopathic Ventricular Arrhythmia from ECG Using an Attention-Based Recurrent Convolutional Neural Network. arXiv preprint arXiv:2302. 10824 and (2) Mohammadi, F., Sheikhani, A., Razzazi, F. and Sharif, A.G., 2021 . Non-invasive localization of the ectopic foci of focal atrial tachycardia by using ECG signal based sparse decomposition algorithm. Biomedical Signal Processing and Control, 70, p.103014, which are hereby incorporated by reference.

[0028] In some embodiments, the TDAV system may employ the UniTS (Unifying Time Series) model to both generate training data and classify training data as being valid or invalid. The UniTS model employs a transformer-based architecture that inputs sample tokens, prompt tokens, and task tokens. Each sample token may be a nonoverlapping patch of a voltage-time series. The prompt tokens provide context for the generative and classification tasks. The context token may be, for example, “atrial” and “fibrillation” to generate a voltage-time series representing AF. The task tokens specify that a generative task and a classification task are to be performed. The generative task is to generate new voltage-time series, and the classification task is to classify the voltage-time series as being valid or invalid.

[0029] The UniTS model comprises a series of UniTS blocks that each includes a time multi-headed self-attention (MHSA) component, a voltage MHSA component, and a dynamic feed forward network (FFN) component. Each component is associated with a gate module that inputs the output of the component, scales the elements of the inputs, and outputs the scaled elements. The output of the series of the UniTS blocks are input to generative tower and a classification tower. The generative tower transforms the input to a voltage-time series, and the classification tower generates a classification (e.g., valid or invalid) for the voltage-time series. The TDAV system may employ a pre-trained UniTS model that is then further trained using simulated voltagetime series to customize the model for generating and classifying voltage-time series data. The UniTS model is described in Gao, S., Koker, T., Queen, O., Hartvigsen, T., Tsiligkaridis, T. and Zitnik, M., 2024. UNITS: A Unified Multi-Task Time Series Model. arXiv preprint arXiv:2403.00131 , which is hereby incorporated by reference.

[0030] The UniTS model may also be employed to separately generate valid voltage-time series and to validate existing (e.g., clinical) voltage-time series. When generating, an initial voltage-time series is input to the UniTS model to generate a number of next voltage-time values. The TDAV system may repeatedly input the previously input voltage-time series concatenated with the previously generated next voltage-time values to incrementally generate a valid voltage-time series. When classifying, a voltage-time series is input to the UniTS model which outputs the classification. The training for the generation and classification tasks are performed in parallel so that the weights that are learned support both tasks.

[0031] The techniques described above may also be employed to validate clinical data relating to atrial flutter (AFL) for training an AFL ML model. A clinical data set may comprise an AFL ECG and flutter data of an EHR. The flutter data may include, flutter type, source location, ablation pattern, and so on. A clinical data set may be validated using various AFL algorithms. Examples of such AFL algorithms are described in (1) Luongo, G., Vacanti, G., Nitzke, V., Nairn, D., Nagel, C., Kabiri, D., Almeida, T.P., Soriano, D.C., Rivolta, M.W., Ng, G.A. and Dossel, O., 2022. Hybrid machine learning to localize atrial flutter substrates using the surface 12-lead electrocardiogram. EP Europace, 24(7), pp.1186-1194 and (2) Azman, M.H.K., Meste, O., Kadir, K., La|cu, D.G., Saoudi, N. and Bun, S.S., 2021 . Variability in the atrial flutter vectorcardiographic loops and non-invasive localization of circuits. Biomedical Signal Processing and Control, 66, p.102472, which are hereby incorporated by reference. An AFL ML model is described in PCT Pub No. 2024/173597 titled “Atrial Flutter Classification System” and published on August 22, 2024, which is hereby incorporated by reference. An assessment of the effectiveness of an ablation pattern may be based on simulations of cardiac electrical activity assuming that ablation pattern. Techniques for evaluating the effectiveness of an ablation pattern are described in U.S. Pat. No. 11 , 259,871 titled “Identify Ablation Pattern for Use in an Ablation” and issued on March 1 , 2022, which is hereby incorporated by reference.

[0032] The techniques described above may also be employed to validate clinical data relating to heart failure risk (HFR) for training an HFR ML model and myocardial scarring (MS) for training an MS ML model. A clinical data set relating to HFR may include an ECG, various cardiac characteristics, and an indication relating to heart failure. A clinical data set relating to MS may include an ECG, various cardiac characteristics, and scar data. The scar data may include scar shape, scar location, and percent of scarring (e.g., 30% of lateral wall). These clinical data set may be validated using various HFR algorithms and MS algorithms. Examples of such algorithms are described in (1 ) Boehmer, J.P., Hariharan, R., Devecchi, F.G., Smith, A.L., Melon, G., Capucci, A., An, Q., Averina, V., Stolen, C.M., Thakur, P.H. and Thompson, J.A., 2017. A multisensor algorithm predicts heart failure events in patients with implanted devices: results from the MultiSENSE study. JACC: Heart Failure, 5(3), pp.216-225; (2) Boehmer, J.P., Hariharan, R., Devecchi, F.G., Smith, A.L., Molon, G., Capucci, A., An, Q., Averina, V., Stolen, C.M., Thakur, P.H. and Thompson, J. A., 2017. A multisensor algorithm predicts heart failure events in patients with implanted devices: results from the MultiSENSE study. JACC: Heart Failure, 5(3), pp.216-225; and (3) Virmani, R. and Roberts, W.C., 1980. Quantification of coronary arterial narrowing and of left ventricular myocardial scarring in healed myocardial infarction with chronic, eventually fatal, congestive cardiac failure. The American journal of medicine, 58(6), pp.831 -838, which are hereby incorporated by reference.

[0033] Although described primarily in the context of clinical cardiograms, the TDAV system may be employed to validate other type of electrograms that represent electrical activity of other electromagnetic sources within a body such as a brain (e.g., electroencephalogram) or gastrointestinal tract (e.g., gastroenterogram) or, more generally, electromagnetic waveforms whose regions are mapped to data such as radar waves with regions or signatures mapped to reflecting object types. Techniques for simulating electrical activity of the digestive system are described in U.S. Pub. No. 2022/0415518 tiled “Digestive System Simulation and Pacing” and published on December 29, 2022. In the case of radar wave, the simulated radar waves may be generated by simulation of reflected signals that would be received based on a transmitted signal and a reflecting object (e.g., drone). In such a case, the simulated data may be a description of transmitted and received signals and the simulated label may be object type (e.g., drone type). The data sets to be validated may be actual transmitted and received signals along with an indication of the object type of the reflecting object. Techniques for identifying the type of an object are described in U.S. Pat. No. 10,353,052 titled “Object Discrimination Based On A Swarm of Agent” and issued on July 16, 2019, which is hereby incorporated by reference. As another example, the TDAV system may be employed to generate augmented training data for training an autonomous vehicle (AV) ML model. The AV ML model may input data collected by an AV (e.g., photograph or radar signals) and output controls for the AV (e.g., turn right or stop). The training data may be collected from AVs labeled with a control or simulated using an artificial intelligent (Al) program to generate simulated images of the environment in which the AV is traveling and simulated controls for the AV. The AV may be, for example, an automobile, an aerial drone, a robot, and so on. As another example, the TDAV system may be employed to generate augmented training data for training various robot devices such as used in automotive manufacturing, electronics manufacturing (e.g., assembly of circuit boards and inspection of micro components using cameras and sensors), pharmaceutical and medical device manufacturing (e.g., automatic packing and labeling of prescription medicines), food and beverage manufacturing (e.g., sort and quality control of fruits), aerospace manufacturing (e.g., inspection of materials and components), and logistics and warehousing (e.g., sorting and moving goods in a warehouse and packing orders for e-commerce platforms).

[0034] Figure 1 is a flow diagram that illustrates the processing of a TDAV system in some embodiments. The TDAV system 100 identifies clinical data sets that are valid training data for a target ML model. In block 101 , the TDAV system selects the next clinical data set. In decision block 102, if all the clinical data sets have already been selected, then the TDAV system completes, else the TDAV system continues at block 103. In block 103, the TDAV system selects the next simulated data set. In decision block 104, if all the simulated data sets have already been selected, then the TDAV system loops to block 101 to select the next clinical data set, else the TDAV system continues at block 105. In block 105, the TDAV system generates a similarity score that may be a combined similarity score based on cardiogram similarity and source location similarity. Alternatively, a similarity score may be generated based on the cardiograms and a separate similarity score may be generated based on the source locations. In decision block 106, if the similarity score (or similarity scores) satisfy a valid training data criterion, then the TDAV system continues at block 107, else the TDAV system loops to block 103 to select the next simulated data set. In block 107, the TDAV system adds the clinical data set to training data for the target ML model and loops to block 101 to select the next clinical data set.

[0035] Figure 2 is a block diagram that illustrates the components of a TDAV system in some embodiments. A TDAV system 200 includes a TDAV controller component 201 , a run simulations component 202, a generate simulated data sets component 203, a validate clinical data sets component 204, and a train ML model component 205. The TDAV system accesses a cardiac specifications data store 21 1 , a simulated electrical activity data store 212, a simulated data sets data store 213, a clinical data sets data store 214, an ML training data data store 215, and an ML weights data store 216. The TDAV controller invokes the run simulations component, the generate simulated data components, the validate clinical data sets component, and the train ML model component to augment the simulated data sets with clinical data sets for training the ML model. The cardiac specifications data store stores the cardiac specifications used by the run simulations component to simulate electrical activity that is stored in the simulated electrical activity data store. The simulated data sets data store stores simulated data sets that are generated by the generate simulated data sets component based on the cardiac specifications and simulated electrical activity. The clinical data sets data store stores clinical data sets that may be retrieved from EHRs. The validate clinical data sets component validates the clinical data sets based on the simulated data sets and stores the validated clinical data sets in the ML training data store. The train ML model trains the target ML model using the ML training data that includes valid clinical data sets and simulated data sets. The weights (and biases) learned during training are stored in the ML weights data store.

[0036] The computing systems (e.g., network nodes or collections of network nodes) on which the TDAV system and the other described systems may be implemented may include a central processing unit, input devices, output devices (e.g., display devices and speakers), storage devices (e.g., memory and disk drives), network interfaces, graphics processing units, communications links (e.g., Ethernet, Wi-Fi, cellular, and Bluetooth), global positioning system devices, and so on. The input devices may include keyboards, pointing devices, touch screens, gesture recognition devices (e.g., for air gestures), head and eye tracking devices, microphones for voice recognition, and so on. The computing systems may include high-performance computing systems, distributed systems, cloud-based computing systems, client computing systems that interact with cloud-based computing systems, desktop computers, laptops, tablets, smartphones, servers, and so on. The computing systems may access computer-readable media that include computer-readable storage mediums and data transmission mediums. The computer-readable storage mediums are tangible storage means that do not include a transitory, propagating signal. Examples of computer-readable storage mediums include memory such as primary memory, cache memory, and secondary memory (e.g., DVD), and other storage. The computer-readable storage media may have recorded on them or may be encoded with computer-executable instructions or logic that implements the described systems. The data transmission media are used for transmitting data via transitory, propagating signals or carrier waves (e.g., electromagnetism) via a wired or wireless connection. The computing systems may include a secure crypto processor as part of a central processing unit (e.g., Intel Secure Guard Extension (SGX)) for generating and securely storing keys and for encrypting and decrypting data using the keys and for securely executing all or some of the computer-executable instructions of the system. Some of the data sent by and received by the systems may be encrypted, for example, to preserve personal privacy (e.g., to comply with government regulations, such as the European General Data Protection Regulation (GDPR) or the Health Insurance Portability and Accountability Act (HIPAA) of the United States). The systems may employ asymmetric encryption (e.g., using private and public keys of the Rivest-Shamir- Adleman (RSA) standard) or symmetric encryption (e.g., using a symmetric key of the Advanced Encryption Standard (AES)).

[0037] The one or more computing systems may include client-side computing systems and cloud-based computing systems (e.g., public, or private) that each execute computer-executable instructions of the systems. A client-side computing system, such as a device in an ablation procedure room, may send data to and receive data from one or more servers of the cloud-based computing systems of one or more cloud data centers. For example, a client-side computing system may send a request to a cloud- based computing system to perform tasks such as identify a source location given an ECG. A cloud-based computing system may respond to the request by sending to the client-side computing system data derived from performing the task. The servers may perform computationally expensive tasks in advance of processing by a client-side computing system, such as training a machine learning model, or in response to data received from a client-side computing system. A client-side computing system may provide a user experience (e.g., user interface) to a user of the systems. The user experience may originate from a client computing device or a server computing device. For example, a client computing device may generate a graphic representing the ECG received from the ECG processing system. Alternatively, a cloud-based computing system may generate the graphic (e.g., in a Hypertext Markup Language (HTML) format or an Extensible Markup Language (XML) format) and provide it to the client-side computing system for display. A client-side computing system (e.g., ablation planning device) may also send data to and receive data from various medical devices, such as an arrhythmia mapping system, an EHR system, and so on. The data received from the medical devices may include an ECG, actual ablation characteristics (e.g., ablation location and ablation pattern), and so on. The term “cloud-based computing system” may encompass computing systems of a public cloud data center provided by a cloud provider (e.g., Azure provided by Microsoft Corporation) or computing systems of a private server farm (e.g., operated by a hospital for its internal use).

[0038] Figure 3 is a flow diagram that illustrates the processing of a generate simulated data sets component of the TDAV system in some embodiments. The generate simulated data sets component 300 runs simulations based on cardiac specifications to generate simulated electrical activity and then generates simulated data sets based on the simulated electrical activity. In block 301 , the component selects the next cardiac specification. In decision block 302, if all the cardiac specification have already been selected, then the component completes, else the component continues at block 303. In block 303, the component invokes the run simulation component to simulate electrical activity based on the cardiac specification. In block 304, the component generates a simulated ECG based on the simulated electrical activity. In block 305, the component stores a simulation data set that includes the simulated ECG and the source location used in the selected simulation and then loops to block 301 to select the next cardiac specification. [0039] Figure 4 is a flow diagram that illustrates the processing of a generate derived ECGs component of the TDAV system in some embodiments. The generate derived ECGs component 400 generates derived ECGs from clinical ECGs so that the derived ECGs may be used to further augment the training data for a target ML model. In block 401 , the component selects the next clinical data set. In decision block 402, if all the clinical data sets have already been selected, then the component completes, else the component continues at block 403. In block 403, the component selects a next derivation technique. A derivation technique may include a combination of subderivation techniques such as increasing the start time and the end time of and adding noise to a clinical ECG. In decision block 404, if all the derivation techniques have already been selected, then the component loops to block 401 to select the next clinical data set, else the component continues at block 405. In block 405, the component generates a derived ECG based on the selected derivation technique. In block 406, the component stores a derived data set that includes the derived ECG and then loops to block 403 to select the next derivation technique.

[0040] An ML model (e.g., the target ML model and validation ML model) may be any of a variety or combination of supervised, semi-supervised, self-supervised, unsupervised, or reinforcement learning ML models including a neural network such as fully connected, convolutional, recurrent, or autoencoder neural network, or restricted Boltzmann machine, a support vector machine, a Bayesian classifier, k-means clustering, decision tree, generative adversarial networks, transformer, a diffusion model, and so on. When the ML model is a deep neural network, the model is trained using training data that includes features derived from data and labels corresponding to the data. For example, the data may be images of ECGs with a feature being the image itself, and the labels may be a characteristic indicated by the ECGs (e.g., source location). The training results in a set of weights for the activation functions of the layers of the deep neural network. The trained deep neural network can then be applied to new data to generate a label for that new data. When the ML model is a support vector machine, a hyper-surface is found to divide the space of possible inputs. For example, the hyper-surface attempts to split the positive examples (e.g., valid training data) from the negative examples (e.g., invalid training data) by maximizing the distance between the nearest of the positive and negative examples to the hyper-surface. The trained support vector machine can then be applied to new data to generate a classification (e.g., normal sinus rhythm or arrhythmia) for the new data. An ML model may generate classification values such as values of discrete domain (e.g., AF v. VF) and/or values of a continuous domain (e.g., source location or probability of AFL).

[0041] Various techniques can be used to train a support vector machine such as adaptive boosting, which is an iterative process that runs multiple tests on a collection of training data. Adaptive boosting transforms a weak learning algorithm (an algorithm that performs at a level only slightly better than chance) into a strong learning algorithm (an algorithm that displays a low error rate). The weak learning algorithm is run on different subsets of the training data. The algorithm concentrates increasingly on those examples in which its predecessors tended to show mistakes. The algorithm corrects the errors made by earlier weak learners. The algorithm is adaptive because it adjusts to the error rates of its predecessors. Adaptive boosting combines rough and moderately inaccurate rules of thumb to create a high-performance algorithm. Adaptive boosting combines the results of each separately run test into a single, very accurate classifier. Adaptive boosting may use weak classifiers that are single-split trees with only two leaf nodes.

[0042] A neural network model has three major components: architecture, loss function, and search algorithm. The architecture defines the functional form relating the inputs to the outputs (in terms of network topology, unit connectivity, and activation functions). The search in weight space for a set of weights that minimizes the loss function is the training process. A neural network model may use a radial basis function (RBF) network and a standard or stochastic gradient descent as the search technique with backpropagation.

[0043] A CNN has multiple layers such as a convolutional layer, a rectified linear unit (ReLU) layer, a pooling layer, a fully connected (FC) layer, and so on. Some more complex CNNs may have multiple convolutional layers, pooling layers, and FC layers. Each layer includes a neuron for each output of the layer. A neuron inputs outputs of prior layers (or original input) and applies an activation function to the inputs to generate an output.

[0044] A convolutional layer may include multiple filters (also referred to as kernels or activation functions). A filter inputs a convolutional window, for example, of an image, applies weights to each pixel of the convolutional window, and outputs value for that convolutional window. For example, if the static image is 256 by 256 pixels representing an ECG, the convolutional window may be 8 by 8 pixels. The filter may apply a different weight to each of the 64 pixels in a convolutional window to generate the value.

[0045] An activation function has a weight for each input and generates an output by combining the inputs based on the weights. The activation function may be an ReLU that sums the values of each input times its weight to generate a weighted value and outputs max(0, weighted value) to ensure that the output is not negative. The weights of the activation functions are learned when training an ML model. The ReLU function of max(0, weighted value) may be represented as a separate ReLU layer with a neuron for each output of the prior layer that inputs that output and applies the ReLU function to generate a corresponding “rectified output.”

[0046] A pooling layer may be used to reduce the size of the outputs of the prior layer by downsampling the outputs. For example, each neuron of a pooling layer may input 16 outputs of the prior layer and generate one output resulting in a 16-to-1 reduction in outputs.

[0047] An FC layer includes neurons that each inputs all the outputs of the prior layer and generates a weighted combination of those inputs. For example, if the penultimate layer generates 256 outputs and the FC layer inputs a neuron for each of two classifications, each neuron inputs the 256 outputs and applies weights to generate value for its classification.

[0048] One example of a CNN is a U-Net ML model. The U-Net ML model includes a contracting path and an expansive path. The contracting path includes a series of max pooling layers to reduce spatial information of the input image and increase feature information. The expansive path includes a series of upsampling layers to convert the feature information to the output image. The input and output of a U-Net represent an image such as an image of a patient ECG as input and an image of a base region as output.

[0049] A generative adversarial network (GAN) or an attribute (attGAN) may also be used. (See, Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen, “AttGAN: Facial Attribute Editing by Only Changing What You Want,” IEEE Transactions on Image Processing, 2019; and Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative Adversarial Nets,” Advances in Neural Information Processing Systems, pp. 2672-2680, 2014, which are hereby incorporated by reference.) The TDAV system may employ a GAN to generate data sets that may be considered valid training data or compared to simulated data sets to determine the degenerated data sets that are valid training data. A GAN employs a generator and discriminator and is trained using simulated training data. The generator generates generated data sets based on random input. The generator is trained to generate generated data sets that cannot be distinguished from simulated data sets. The discriminator indicates whether a data set is simulated or generated. The generator and discriminator are trained in parallel to learn weights. The generator is trained to generate increasingly more realistic generated data sets, and the discriminator is trained to discriminate between simulated data sets and generated data sets more effectively. After being trained, the generator can be used to generate generated data sets that can be employed to augment the training or be validated before augmenting the training data.

[0050] The ML models that input a cardiogram input a feature vector of one or more features derived from the cardiogram. The features may include an image of cardiogram, a voltage-time series specifying voltages and time increments of the cardiogram, images and voltage-time series of portions of the cardiogram (e.g., QRS complex), length in seconds of various intervals (e.g., R-R interval, QRS complex, T wave, T-Q interval, and Q-R interval), QRS integral, maximum, minimum, mean, and variance of voltages of portions of the cardiogram, a maximal vector of QRS loop and angle of the vector derived from VCG, location of a peak (Q peak) or zero crossing relative to a maximum peak (T peak) in an interval, and so on. The features used by an ML model may be manually or automatically selected. An assessment of which features may be useful in providing an accurate output for a ML model are referred to as informative feature. The assessment of which features are informative may be based on various feature selection techniques such as a predictive power score, a lasso regression, a mutual information analysis, and so on.

[0051] In some embodiments, the TDAV system may employ a diffusion ML model to generate additional training data using a generative process. (See, Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695), which is hereby incorporated by reference.) A diffusion ML model is a generative ML model that inputs noisy data and progressively denoises the data until the denoised data appears to be indistinguishable from real data such as an image of an ECG. A diffusion ML model is trained using a forward diffusion process that successively adds noise to input training data such as ECG images to generate noisy data and a reverse diffusion process that successively denoises the noisy data to generate denoised data that approximates the input training data. The training learns weights for the reverse diffusion process that tend to minimize the difference between the input training data and the denoised data. After a diffusion model is trained, the reverse diffusion process is employed to generate data that can be used to train a target ML model. To do so, randomly generated noisy data is input to the reverse diffusion process which denoises the noisy data to generate the denoised data that appears to be real data.

[0052] The forward diffusion process employs a Markov chain that incrementally adds Gaussian noise to the training data over a series of steps. This process transforms the training data from its initial distribution to a Gaussian distribution. The reverse diffusion process employs a neural network to incrementally approximate and remove the noise that was added at each step of the forward diffusion process. When generating data, randomly generated noisy data is input to the reverse diffusion process which incrementally removes the noise that was learned during training.

[0053] The forward diffusion process systematically adds Gaussian noise to the original data A'o Gaussian noise over T timesteps, resulting in a sequence of increasingly noisy data Xi, x% . . . , XT- The process at each time step t may be represented by the equation:

where x__t is data at timestep t, £_t is Gaussian noise, (Xt is the amount of noise added, and / is the identity matrix.

[0054] The reverse diffusion process learns the distribution of the training data by starting from noise and progressively denoising it over the timesteps. The training estimates the reverse of the forward diffusion process using an NN that may be represented by the equation:

where L

represent the cumulative noise and

represents the NN.

[0055] The goal of training a diffusion model is to minimize the difference between the original data and the data reconstructed by the reverse diffusion process using a loss function that may be represented by the equation.

[0056] A diffusion model may also include a conditioning mechanism that allows for factoring in domain-specific information into the reverse diffusion process. The domain-specific information may be employed by a cross-attention mechanism of the NN (e.g., U-Net architecture) of the reverse diffusion process. The TDAV system may train the reverse diffusion process with domain-specific information that includes an ECG image and a source location. To generate ECG images, the TDAV system inputs a noisy image and a source location into the reverse diffusion process which generates an ECG image corresponding to that source location.

[0057] The features may also be latent vectors generated using a ML model such as an autoencoder. For example, an autoencoder may be trained using ECG images or an ECG voltage-time series. In such a case, when an ECG voltage-time series is input into the trained autoencoder, the latent vector that is generated is a feature vector that represents the ECG. That feature vector can be input into a target ML model such as an NN or SVM to generate an output. When training the target ML model, for example, to identify a source location, the training ECG voltage-time series are input to the autoencoder to generate training feature vectors that are labeled with source location. The target ML model is then trained using the labeled feature vectors. The autoencoder may be trained using the training ECG voltage-time series. Rather pretraining an autoencoder, only the portion of the autoencoder that generates the latent vector may be trained in parallel with the target ML model using a combined loss function. In such a case, no autoencoding is performed. Rather the latent vector represents features of an ECG voltage-time series that are particularly relevant to generating the output of the other ML model. [0058] The following paragraphs describe various aspects of the TDAV system. Implementations of these systems may employ any combination or sub-combination of the aspects and may employ additional aspects. The processing of the aspects may be performed by one or more computing systems with one or more processors that execute computer-executable instructions that implement the aspects and that are stored on one or more computer-readable storage mediums.

[0059] In some aspects, the techniques described herein relate to a method performed by one or more computing systems for generating training data for training a machine learning model, the method including: generating simulated data sets that include simulated data and a simulated label, the simulated data and the simulated label of a simulated data set generated based on a simulation; accessing a collection of input data sets, each input data set having input data and an input label; and for each of a plurality of input data sets, identifying a simulated data set with simulated data that is similar to the input data of that input data set; and when the input label of that input data set and the simulated label of the identified simulated data set satisfy a valid training data criterion, indicating that that input data set represent valid training data. In some aspects, the techniques described herein relate to a method further including training the machine learning model using training data that includes the input data sets that represent valid training data. In some aspects, the techniques described herein relate to a method wherein a simulated data set includes simulated data that includes a simulated cardiogram and a simulated label that includes an ablation pattern of an arrhythmia, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes an ablation pattern, and wherein each simulated cardiogram is generated based on simulated electrical activity that is simulated prior to including the ablation pattern in the simulation. In some aspects, the techniques described herein relate to a method wherein a simulated data set includes simulated data that includes a simulated cardiogram and a simulated label that includes a target location of an ablation, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes a target location, and wherein each simulated cardiogram is generated based on the simulated electrical activity. In some aspects, the techniques described herein relate to a method wherein a simulated data set includes simulated data that includes a simulated cardiogram and a simulated label that includes a cardiac characteristic that the simulation is based on, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes the cardiac characteristic, and wherein each simulated cardiogram is generated based on the simulated electrical activity. In some aspects, the techniques described herein relate to a method wherein a simulated data set includes simulated data that includes a simulated radar return signal and a simulated label that includes an object type, wherein a simulation simulates a radar received signal based on a radar transmitted signal and an object type. In some aspects, the techniques described herein relate to a method wherein a simulated data set includes simulated data that is a simulated cardiogram and a simulated label that is a source location of an arrhythmia, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes a source location, and wherein each simulated cardiogram is generated based on the simulated electrical activity. In some aspects, the techniques described herein relate to a method further wherein the input data sets are derived from clinical data that includes a clinical cardiogram and a clinical source location of an arrhythmia. In some aspects, the techniques described herein relate to a method further including automatically refining a clinical cardiogram to represent more accurately an arrhythmia cardiogram. In some aspects, the techniques described herein relate to a method further including for each of a plurality of cardiac specifications, running the simulations to generate simulated electrical activity of a heart based on the source location of that cardiac specification. In some aspects, the techniques described herein relate to a method further including, for each simulation, generating a simulated cardiogram based on the simulated electrical activity. In some aspects, the techniques described herein relate to a method further including training the machine learning model using training data that includes the input data sets that represent valid training data.

[0060] In some aspects, the techniques described herein relate to a method performed by one or more computing systems for augmenting training data for a machine learning model, the method including: accessing a plurality of cardiac specifications that each specify cardiac characteristics including a simulated source location of an arrhythmia; for each of the plurality of cardiac specifications, simulating electrical activity of a heart having that cardiac specification; and generating a simulated cardiogram based on the simulated electrical activity, the simulated cardiogram associated with the simulated source location of that cardiac specification; and for a plurality of clinical cardiograms that are each associated with a clinical source location, identifying a simulated cardiogram based on similarity to the clinical cardiogram; and when the simulated source location associated identified simulated cardiogram and that clinical source location satisfy a valid training data criterion, indicating that the clinical cardiogram and the associated clinical source location are valid training data. In some aspects, the techniques described herein relate to a method further including training the machine learning model using training data that includes the valid training data.

[0061] In some aspects, the techniques described herein relate to one or more computing systems for generating training data to improve performance of a machine learning model, the one or more computing systems including: one or more computer- readable storage mediums that stores a plurality of clinical cardiograms, each clinical cardiogram associated with a clinical source location of an arrhythmia; and computerexecutable instructions for controlling the one or more computing systems to: for each of a plurality of clinical cardiograms, derive one or more derived clinical cardiograms based on that clinical cardiograms; and associate the clinical source location of that clinical cardiogram with each derived clinical cardiogram; and for one or more derived clinical cardiograms, identify a simulated cardiogram that is similar to that derived clinical cardiogram; and indicate that the derived clinical cardiogram and the associated clinical source location are valid training data based on similarity of the associated clinical source location to a simulated source location associated with the simulated cardiogram; and one or more processors for controlling the one or more computing systems to execute one or more of the computer-executable instructions. In some aspects, the techniques described herein relate to one or more computing systems wherein the computer-executable instructions further include instructions to run simulations of electrical activity of heart based on a cardiac specification that includes a simulated source location and generate simulated cardiograms from the simulated electrical activity. In some aspects, the techniques described herein relate to one or more computing systems wherein the computer-executable instructions further include instructions to train the machine learning model using training data that includes one or more derived clinical cardiogram and the associated source location that are indicated as valid training data.

[0062] In some aspects, the techniques described herein relate to a method performed by one or more computing systems for generating training data for a machine learning model, the training data including a plurality of input data sets, each input data set having input data and an input label, the method including: for each input data set, identifying simulated data that is similar to the input data of that input data set; determining whether a simulated label associated with the identified simulated data and the input label of that input data set satisfy a valid training data criterion; and when the valid training data criterion is satisfied, designating that input data set represents a valid training data set. In some aspects, the techniques described herein relate to a method further including training the machine learning model using training data that includes the valid training data sets. In some aspects, the techniques described herein relate to a method further including, for each of a plurality of simulations, running the simulation based on a simulated label and generating simulated data based on the simulation. In some aspects, the techniques described herein relate to a method further including training the machine learning model using training data that includes the valid training data sets. In some aspects, the techniques described herein relate to a method further including applying the trained machine learning model to target data to identify a target label. In some aspects, the techniques described herein relate to a method wherein the input data and the simulated data are cardiograms and the input labels and the simulated labels are source locations of an arrhythmia. In some aspects, the techniques described herein relate to a method further including applying the trained machine learning model to a patient cardiogram to identify a patient source location of an arrhythmia. In some aspects, the techniques described herein relate to a method wherein the input data and the simulated data relate to sensor data collected by an autonomous vehicle and the input labels and the simulated labels are vehicle control instructions. In some aspects, the techniques described herein relate to a method further including applying the trained machine learning model to sensor data collected by an autonomous vehicle while moving to identify vehicle control instructions. In some aspects, the techniques described herein relate to a method wherein the input data and the simulated data relate to sensor data collected by a robotic device and the input labels and the simulated labels are robotic device control instructions. In some aspects, the techniques described herein relate to a method further including applying the trained machine learning model to sensor data collected by a robotic device in real time to identify robotic device control instructions for that robotic device. In some aspects, the techniques described herein relate to a method wherein the input data and the simulated data relate to radar signal reflected off an object and the input labels and the simulated labels are object identifiers. In some aspects, the techniques described herein relate to a method further including applying the trained machine learning model to signals reflected off an object to identify an object type of the object.

[0063] In some aspects, the techniques described herein relate to a method performed by one or more computing systems for training a machine learning model using training data including a plurality of training data sets, each training data set having training data and a training label, the method including: accessing a plurality of clinical data sets representing patient data, each clinical data set having clinical data and a clinical label; for each clinical data set, applying an algorithm to the clinical data of that clinical data set to generate an algorithm label for the clinical data; determining whether the clinical label of that clinical data set and the algorithm label satisfy a valid training data criterion; and when the valid training data criterion is satisfied, designating that that clinical data set represents a valid training data set; and training the machine learning model using valid training data sets and simulated data sets. In some aspects, the techniques described herein relate to a method further including, for each of a plurality of simulations, running the simulation based on a simulated label and generating simulated data based on the simulation, the simulated data and the simulated label being a simulated data set. In some aspects, the techniques described herein relate to a method further including applying the trained machine learning model to target data to identify a target label. In some aspects, the techniques described herein relate to a method wherein multiple algorithms are applied to a clinical data of a clinical data set and wherein the valid training data criterion is based on the algorithm labels generated based on the multiple algorithms.

[0064] In some aspects, the techniques described herein relate to a method for treating a patient having a medical condition, the method including: performing by one or more computing systems training of a machine learning model using training data including a plurality of training data sets, each training data set having training data and a training label, the training including: accessing a plurality of clinical data sets representing patient data, each clinical data set having clinical data and a clinical label; for each clinical data set, applying an algorithm to the clinical data of that clinical data set to generate an algorithm label for the clinical data; determining whether the clinical label of that clinical data set and the algorithm label satisfy a valid training data criterion; and when the valid training data criterion is satisfied, designating that that clinical data set represents a valid training data set; and training the machine learning model using valid training data sets and simulated data sets; accessing a patient data of the patient; and applying the trained machine learning model to the patient data to generate a patient label; and treating the patient for the medical condition factoring in the patient label. In some aspects, the techniques described herein relate to a method wherein the medical condition is an arrhythmia, the patient data includes a patient cardiogram collected from the patient, the patient label is based on a source location of an arrhythmia, and the treating is an ablation procedure targeting the source location of the arrhythmia. In some aspects, the techniques described herein relate to a method wherein the medical condition is an arrhythmia, the patient data includes a patient cardiogram collected from the patient, the patient label is based on a source location of an arrhythmia, and the treating is a medication.

[0065] Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1 . A method performed by one or more computing systems for generating training data for training a machine learning model, the method comprising: generating simulated data sets that include simulated data and a simulated label, the simulated data and the simulated label of a simulated data set generated based on a simulation; accessing a collection of input data sets, each input data set having input data and an input label; and for each of a plurality of input data sets, identifying a simulated data set with simulated data that is similar to the input data of that input data set; and when the input label of that input data set and the simulated label of the identified simulated data set satisfy a valid training data criterion, indicating that that input data set represent valid training data.

2. The method of claim 1 further comprising training the machine learning model using training data that includes the input data sets that represent valid training data.

3. The method of claim 1 wherein a simulated data set includes simulated data that includes a simulated cardiogram and a simulated label that includes an ablation pattern of an arrhythmia, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes an ablation pattern, and wherein each simulated cardiogram is generated based on simulated electrical activity that is simulated prior to including the ablation pattern in the simulation.

4. The method of claim 1 wherein a simulated data set includes simulated data that includes a simulated cardiogram and a simulated label that includes a target location of an ablation, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes a target location, and wherein each simulated cardiogram is generated based on the simulated electrical activity.

5. The method of claim 1 wherein a simulated data set includes simulated data that includes a simulated cardiogram and a simulated label that includes a cardiac characteristic that the simulation is based on, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes the cardiac characteristic, and wherein each simulated cardiogram is generated based on the simulated electrical activity.

6. The method of claim 1 wherein a simulated data set includes simulated data that includes a simulated radar return signal and a simulated label that includes an object type, wherein a simulation simulates a radar received signal based on a radar transmitted signal and an object type.

7. The method of claim 1 wherein a simulated data set includes simulated data that is a simulated cardiogram and a simulated label that is a source location of an arrhythmia, wherein a simulation simulates electrical activity of heart having a cardiac specification that includes a source location, and wherein each simulated cardiogram is generated based on the simulated electrical activity.

8. The method of claim 7 further wherein the input data sets are derived from clinical data that includes a clinical cardiogram and a clinical source location of an arrhythmia.

9. The method of claim 8 further comprising automatically refining a clinical cardiogram to represent more accurately an arrhythmia cardiogram.

10. The method of claim 7 further comprising for each of a plurality of cardiac specifications, running the simulations to generate simulated electrical activity of a heart based on the source location of that cardiac specification.

1 1. The method of claim 10 further comprising, for each simulation, generating a simulated cardiogram based on the simulated electrical activity.

12. The method of claim 7 further comprising training the machine learning model using training data that includes the input data sets that represent valid training data.

13. A method performed by one or more computing systems for augmenting training data for a machine learning model, the method comprising: accessing a plurality of cardiac specifications that each specify cardiac characteristics including a simulated source location of an arrhythmia; for each of the plurality of cardiac specifications, simulating electrical activity of a heart having that cardiac specification; and generating a simulated cardiogram based on the simulated electrical activity, the simulated cardiogram associated with the simulated source location of that cardiac specification; and for a plurality of clinical cardiograms that are each associated with a clinical source location, identifying a simulated cardiogram based on similarity to the clinical cardiogram; and when the simulated source location associated identified simulated cardiogram and that clinical source location satisfy a valid training data criterion, indicating that the clinical cardiogram and the associated clinical source location are valid training data.

14. The method of claim 13 further comprising training the machine learning model using training data that includes the valid training data.

15. One or more computing systems for generating training data to improve performance of a machine learning model, the one or more computing systems comprising: one or more computer-readable storage mediums that stores a plurality of clinical cardiograms, each clinical cardiogram associated with a clinical source location of an arrhythmia; and computer-executable instructions for controlling the one or more computing systems to: for each of a plurality of clinical cardiograms, derive one or more derived clinical cardiograms based on that clinical cardiograms; and associate the clinical source location of that clinical cardiogram with each derived clinical cardiogram; and for one or more derived clinical cardiograms, identify a simulated cardiogram that is similar to that derived clinical cardiogram; and indicate that the derived clinical cardiogram and the associated clinical source location are valid training data based on similarity of the associated clinical source location to a simulated source location associated with the simulated cardiogram; and one or more processors for controlling the one or more computing systems to execute one or more of the computer-executable instructions.

16. The one or more computing systems of claim 15 wherein the computerexecutable instructions further include instructions to run simulations of electrical activity of heart based on a cardiac specification that includes a simulated source location and generate simulated cardiograms from the simulated electrical activity.

17. The one or more computing systems of claim 15 wherein the computerexecutable instructions further include instructions to train the machine learning model using training data that includes one or more derived clinical cardiogram and the associated source location that are indicated as valid training data.

18. A method performed by one or more computing systems for generating training data for a machine learning model, the training data comprising a plurality of input data sets, each input data set having input data and an input label, the method comprising: for each input data set, identifying simulated data that is similar to the input data of that input data set; determining whether a simulated label associated with the identified simulated data and the input label of that input data set satisfy a valid training data criterion; and when the valid training data criterion is satisfied, designating that input data set represents a valid training data set.

19. The method of claim 18 further comprising training the machine learning model using training data that includes the valid training data sets.

20. The method of claim 18 further comprising, for each of a plurality of simulations, running the simulation based on a simulated label and generating simulated data based on the simulation.

21 . The method of claim 20 further comprising training the machine learning model using training data that includes the valid training data sets.

22. The method of claim 21 further comprising applying the trained machine learning model to target data to identify a target label.

23. The method of claim 18 wherein the input data and the simulated data are cardiograms and the input labels and the simulated labels are source locations of an arrhythmia.

24. The method of claim 23 further comprising applying the trained machine learning model to a patient cardiogram to identify a patient source location of an arrhythmia.

25. The method of claim 18 wherein the input data and the simulated data relate to sensor data collected by an autonomous vehicle and the input labels and the simulated labels are vehicle control instructions.

26. The method of claim 25 further comprising applying the trained machine learning model to sensor data collected by an autonomous vehicle while moving to identify vehicle control instructions.

27. The method of claim 18 wherein the input data and the simulated data relate to sensor data collected by a robotic device and the input labels and the simulated labels are robotic device control instructions.

28. The method of claim 27 further comprising applying the trained machine learning model to sensor data collected by a robotic device in real time to identify robotic device control instructions for that robotic device.

29. The method of claim 18 wherein the input data and the simulated data relate to radar signal reflected off an object and the input labels and the simulated labels are object identifiers.

30. The method of claim 29 further comprising applying the trained machine learning model to signals reflected off an object to identify an object type of the object.

31. A method performed by one or more computing systems for training a machine learning model using training data comprising a plurality of training data sets, each training data set having training data and a training label, the method comprising: accessing a plurality of clinical data sets representing patient data, each clinical data set having clinical data and a clinical label; for each clinical data set, applying an algorithm to the clinical data of that clinical data set to generate an algorithm label for the clinical data; determining whether the clinical label of that clinical data set and the algorithm label satisfy a valid training data criterion; and when the valid training data criterion is satisfied, designating that that clinical data set represents a valid training data set; and training the machine learning model using valid training data sets and simulated data sets.

32. The method of claim 31 further comprising, for each of a plurality of simulations, running the simulation based on a simulated label and generating simulated data based on the simulation, the simulated data and the simulated label being a simulated data set.

33. The method of claim 31 further comprising applying the trained machine learning model to target data to identify a target label.

34. The method of claim 31 wherein multiple algorithms are applied to a clinical data of a clinical data set and wherein the valid training data criterion is based on the algorithm labels generated based on the multiple algorithms.

35. A method for treating a patient having a medical condition, the method comprising: performing by one or more computing systems training of a machine learning model using training data comprising a plurality of training data sets, each training data set having training data and a training label, the training comprising: accessing a plurality of clinical data sets representing patient data, each clinical data set having clinical data and a clinical label; for each clinical data set, applying an algorithm to the clinical data of that clinical data set to generate an algorithm label for the clinical data; determining whether the clinical label of that clinical data set and the algorithm label satisfy a valid training data criterion; and when the valid training data criterion is satisfied, designating that that clinical data set represents a valid training data set; and training the machine learning model using valid training data sets and simulated data sets; accessing a patient data of the patient; and applying the trained machine learning model to the patient data to generate a patient label; and treating the patient for the medical condition factoring in the patient label.

36. The method of claim 35 wherein the medical condition is an arrhythmia, the patient data includes a patient cardiogram collected from the patient, the patient label is based on a source location of an arrhythmia, and the treating is an ablation procedure targeting the source location of the arrhythmia.

37. The method of claim 36 wherein the medical condition is an arrhythmia, the patient data includes a patient cardiogram collected from the patient, the patient label is based on a source location of an arrhythmia, and the treating is a medication.