WO2025229197A1 - Dispositif et procédé de génération d'un jumeau numérique - Google Patents
Dispositif et procédé de génération d'un jumeau numériqueInfo
- Publication number
- WO2025229197A1 WO2025229197A1 PCT/EP2025/062101 EP2025062101W WO2025229197A1 WO 2025229197 A1 WO2025229197 A1 WO 2025229197A1 EP 2025062101 W EP2025062101 W EP 2025062101W WO 2025229197 A1 WO2025229197 A1 WO 2025229197A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- ontology
- representation
- numerical
- twin
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Definitions
- the present invention relates to the technical field of computational modeling and process simulation.
- the invention relates to a device and method for generating at least one numerical twin.
- the invention further relates to a device for modeling a process using a numerical twin generated using the device for generating at least one numerical twin.
- a numerical twin refers to a mathematical or computational model configured to emulate the behavior, characteristics and response of a given process.
- a numerical twin serves as a digital counterpart that mirrors the complexities and dynamics of its real-world counterpart, enabling users to gain a deeper insight, conduct predictive analysis by simulating future states based on current and previous data, anticipate trends, identify potential risks, and devise proactive strategies to optimize the process.
- numerical twins offer a cost-effective and time-efficient alternative to real-world experimentation, allowing for virtual testing and scenario analysis without the need for costly prototypes or trials.
- Numerical twins find applications across a diverse range of industries and domains, including but not limited to manufacturing, healthcare, transportation and logistics, finance and economy, energy, environmental monitoring, aerospace and defense.
- This invention thus relates to a device for generating at least one numerical twin configured to model a process based on at least one request expressed by a user, said request describing said process, said device comprising: at least one input configured to receive: o said at least one request, o a user database and at least one entity associated with said user database, said user database comprising at least one instance, o an ontology ensemble comprising at least one ontology of reference; o an ensemble of models comprising at least one architecture of machine learning predictive model, at least one processor configured to: o build a training dataset by:
- ⁇ generating a representation of the defined ontology (e.g., in a widely accepted format such as OWL, Turtel/Trig, etc.), said representation of the defined ontology comprising at least one feature (optionally, further comprising at least one NameSpace, at least one Class, at least one attribute), each feature corresponding to an entity of said defined ontology,
- ⁇ defining the obtained representation of the defined ontology as the training dataset, o choose one architecture of machine learning predictive model from the ensemble of models, o obtain optimal hyperparameters for said chosen architecture, o train said chosen architecture using said optimal hyperparameters and said training dataset so as to obtain said numerical twin, o generate statistical quality control metadata relative to said numerical twin, at least one output configured to output said numerical twin and said statistical quality control metadata.
- the invention relates to a numerical twin factory wherein the numerical twins are generated by training a machine learning predictive model with an ontology or an adapted representation (i.e., adapted to the chosen framework and/or chosen architecture of machine learning predictive model) of the ontology as training dataset.
- a tabular representation of the defined ontology may be generated, the tabular representation comprising at least one column corresponding to one entity of the defined ontology, the at least one instance being entered into the generated tabular representation.
- Tabular representations are particularly adapted for frameworks such as "Tensorflow” or "Pytorch” that process “tensors” (i.e., n*p flat matrices).
- an ontology refers to a multidimensional set of entities (e.g. descriptor headers of a data field), wherein the entities are connected through “relations”. These relations can be mathematical or conceptual. According to one embodiment, the ontology may be represented as a graph.
- Ontologies are particularly advantageous for generating numerical twins.
- This structured approach not only facilitates the modeling of complex systems, but also serves directly as a training dataset for machine learning models.
- an ontology — or a representation derived from it, such as a tabular or tensor-compatible format adapted to the chosen framework and/or model architecture — can be used as input for training the numerical twin.
- the ontology can also inform the automatic selection of the most suitable model architecture by providing semantic insight into the type and structure of the data to be modeled.
- Ontologies provide a perspective onto the dataset that translates into “meaning” during the machine learning phase. This semantic enrichment enhances the performance of artificial intelligence algorithms by accelerating convergence with fewer data points and improving overall model reliability. Additionally, ontologies offer a rich structural foundation that captures complex relationships and embeds domain knowledge into the training process, ultimately contributing to the development of more interpretable and context-aware machine learning models.
- the device of the invention offers the possibility to design, develop, and implement a user’ s own numerical twin, tailored to his specific request.
- the device of the invention allows for a streamlined and cost- effective process in creating numerical twins.
- automating various stages of twin generation it significantly reduces the time and resources required for development.
- traceability and auditability features ensure compliance with stringent regulations, providing a robust framework for regulatory approval.
- the device ensures the longevity and reliability of digital twins over time, facilitating their certification by regulatory authorities.
- the device comprises one or more of the features described in the following embodiments, taken alone or in any possible combination.
- said at least one request is collected using a natural language processing interface.
- Collecting user requests through a natural language processing interface provides a more natural and intuitive way for users to interact with the device of the invention, especially users with no technical or computational background, leading to higher user engagement and satisfaction. Additionally, it allows for the collection of detailed and context-rich information from users, enabling a better understanding of their needs and preferences. This rich data can then be used to tailor the numerical twin more effectively.
- said training dataset further comprises the defined ontology.
- said at least one input is further configured to receive an enrichment database, said enrichment database being associated with at least one additional entity, said at least one processor being further configured to define said ontology using said additional entity associated with said enrichment database.
- the data provided by the user may not be comprehensive enough to construct a sufficiently detailed ontology and corresponding representation for modeling the intended process.
- the initial data from the user serves as a valuable starting point, it may lack the depth necessary to capture all entities and relationships relevant to accurately model the process. Therefore, the additional data enrichment step allows to enhance the ontology and ensure its effectiveness in representing the process accurately.
- This enrichment process involves supplementing the user-provided data with additional entities obtained from various other sources or databases. By integrating a broader range of data, the ontology can better capture the complexities and nuances of the process, leading to more robust and informative numerical twins.
- said representation of the defined ontology is sparse and wherein said at least one processor is further configured to complete said sparce representation of the defined ontology by generating instances using a pretrained filing model, said pretrained filing model being trained on a filing training dataset comprising the instances from the sparce representation of the defined ontology.
- the filing model may include at least one of a generative approach (e.g., GANs, Variational Autoencoders) or a predictive approach (e.g., Gradient Boosted Decision Trees such as XGBoost or CatBoost) capable of inferring missing instances or generating synthetic instances.
- a generative approach e.g., GANs, Variational Autoencoders
- a predictive approach e.g., Gradient Boosted Decision Trees such as XGBoost or CatBoost
- the data provided by the user may prove insufficient to construct a comprehensive ontology due to its sparsity and lack of instances.
- additional steps are required to complete the ontology and corresponding representation.
- This enrichment process may involve generating synthetic data to complete the user dataset or integrating data from external sources.
- additional information such as simulated or real-world data
- the ontology can achieve greater completeness and accuracy, thereby enhancing its utility for modeling and analysis purposes.
- said at least one processor is further configured to augment said representation of the defined ontology.
- said augmentation is performed using a model selected from a family of machine learning architectures configured for data generation or representation learning, the architecture comprising, at least one of a Generative Adversarial Networks (GAN), Tabular Variational Auto-Encoders (TVAE), Boltzmann Machines, Support Vector Machines (SVM).
- GAN Generative Adversarial Networks
- TVAE Tabular Variational Auto-Encoders
- SVM Support Vector Machines
- the model comprises a modular or chimeric architecture assembled by combining subcomponents from different root models, such as integrating multi-head attention mechanisms from Transformer architectures with sequential processing layers derived from convolutional or recurrent neural networks.
- Data augmentation involves the process of enhancing the existing dataset by creating synthetic instances or incorporating instances from external sources. By expanding the dataset in this manner, the ontology can ensure a more robust representation of the underlying process. This augmentation facilitates more accurate modeling and analysis, thereby enhancing the overall effectiveness and reliability of the generated numerical twin.
- the architecture of the machine learning predictive model and associated hyperparameters used to obtain the numerical twin can either be chosen by the user himself or be automatically chosen based on several parameters such as the nature of the request and associated process to be modeled, the content of the user database and the architecture of the defined ontology.
- said at least one input is configured to receive: a selection, by a user, of one architecture of machine learning predictive model from the ensemble of models, and a selection, by a user, of hyperparameters for said selected architecture, said at least one processor being configured to define the chosen architecture as the architecture selected by said user and to define the optimal hyperparameters as the hyperparameters selected by said user.
- said at least one processor is further configured to choose said architecture of machine learning predictive model from the ensemble of models based on said user database, said defined ontology and said request.
- said optimal hyperparameters are obtained using an auto-tuning model configured to receive as input a sub-part of the representation of the defined ontology.
- Utilizing an auto-tuning model on a subset of the training dataset streamlines the hyperparameter tuning process, saving time and computational resources by focusing on a smaller subset rather than the entire training dataset. This targeted approach allows for more efficient exploration of hyperparameter configurations, leading to improved model performance. Additionally, by iteratively adjusting hyperparameters based on performance metrics derived from the subset, the auto-tuning model can quickly converge towards optimal settings, enhancing the overall effectiveness and accuracy of the machine learning model.
- said auto-tuning model is a Generative Adversarial Network (GAN).
- GAN Generative Adversarial Network
- said at least one processor is further configured to anonymize the instances comprised in the representation of the defined ontology.
- Anonymizing the instances in the representation of the defined ontology enhances data privacy and security while ensuring compliance with regulations. By masking sensitive information, such as personally identifiable data, the risk of data breaches and unauthorized access is reduced. This approach not only safeguards individuals' privacy but also fosters trust and transparency in data handling practices. Moreover, anonymized datasets enable responsible data sharing and collaboration, facilitating innovation while respecting data subjects' rights. Beyond data privacy, it also solves the issue of differential privacy/confidentiality to ensure not only the data privacy based on Personally Identifiable Information (PII) protection, but also to ensure that the extraction of valuable/meaningful insights from these data, still protects privacy (i.e., that a model under certain circumstances and specific prompting will not disclose portions of its training dataset).
- PII Personally Identifiable Information
- the present invention further relates to a device for modeling a process using a numerical twin generated using the device described above, said device comprising: at least one input configured to receive at least one input variable of the numerical twin, at least one processor configured to feed said at least one input variable to said numerical twin so as to generate at least one output variable of the numerical twin, at least one output configured to provide said at least one output variable.
- the device comprises one or more of the features described in the following embodiments, taken alone or in any possible combination.
- said at least one input variable of the numerical twin is obtained using a user interface generated by the at least one processor from the device for generating at least one numerical twin.
- Integrating a user interface with the numerical twin simplifies the process of inputting variables and observing the numerical twin's responses, allowing users to quickly test different scenarios and understand the process’s behavior in various conditions.
- the at least one numerical twin is configured to model a physical process.
- Such physical process may be a biological process such as organs functioning, metabolic pathways, or immune responses.
- the at least one numerical twin is configured to model consumers behaviors.
- the at least one numerical twin is configured to model consumers behaviors in response to a marketing campaign.
- the at least one numerical twin is configured to model traffic patterns in urban areas to optimize transportation routes.
- the at least one numerical twin is configured to model weather patterns and simulate potential climate change scenarios.
- the at least one numerical twin is configured to model the spread of infectious diseases and assess the effectiveness of various containment measures. [0050] According to one embodiment, the at least one numerical twin is configured to model the structural integrity of buildings and simulate potential seismic events.
- the at least one numerical twin is configured to model the interactions between genes and environmental factors to predict disease susceptibility.
- said numerical twin is configured to model synthetic populations.
- the present invention further relates to a computer implemented method for generating at least one numerical twin configured to model a process based on at least one request expressed by a user, said request describing said process, said method comprising: receiving said at least one request, receiving a user database and at least one entity associated with said user database, said user database comprising at least one instance, receiving an ontology ensemble comprising at least one ontology of reference; receiving an ensemble of models comprising at least one architecture of machine learning predictive model, building a training dataset by: o extracting at least one entity associated to said process based on said at least one request received as input, o defining an ontology comprising at least two entities connected by a relation using said at least one ontology of reference, said ontology being defined using:
- the present invention further relates to a computer implemented method for modeling a process using a numerical twin generated using the method for generating at least one numerical twin of the present invention, said method for modeling comprising:
- the disclosure relates to a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the methods for generating and for modeling compliant with any of the above execution modes.
- the present disclosure further pertains to a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the methods for generating and for modeling compliant with any of the above execution modes.
- Such a non-transitory program storage device can be, without limitation, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, or any suitable combination of the foregoing. It is to be appreciated that the following, while providing more specific examples, is merely an illustrative and not exhaustive listing as readily appreciated by one of ordinary skill in the art: a portable computer diskette, a hard disk, a ROM, an EPROM (Erasable Programmable ROM) or a Flash memory, a portable CD-ROM (Compact-Disc ROM).
- processor should not be construed to be restricted to hardware capable of executing software, and refers in a general way to a processing device, which can for example include a computer, a microprocessor, an integrated circuit, or a programmable logic device (PLD).
- the processor may also encompass one or more Graphics Processing Units (GPU), whether exploited for computer graphics and image processing or other functions.
- GPU Graphics Processing Unit
- the instructions and/or data enabling to perform associated and/or resulting functionalities may be stored on any processor- readable medium such as, e.g., an integrated circuit, a hard disk, a CD (Compact Disc), an optical disc such as a DVD (Digital Versatile Disc), a RAM (Random- Access Memory) or a ROM (Read-Only Memory). Instructions may be notably stored in hardware, software, firmware or in any combination thereof.
- processor- readable medium such as, e.g., an integrated circuit, a hard disk, a CD (Compact Disc), an optical disc such as a DVD (Digital Versatile Disc), a RAM (Random- Access Memory) or a ROM (Read-Only Memory).
- Instructions may be notably stored in hardware, software, firmware or in any combination thereof.
- a “hyperparameter” presently means a parameter used to carry out an upstream control of a model construction, such as a remembering-forgetting balance in sample selection or a width of a time window, by contrast with a parameter of a model itself, which depends on specific situations. In ML applications, hyperparameters are used to control the learning process.
- Datasets are collections of data used to build an ML mathematical model, so as to make data-driven predictions or decisions.
- supervised learning i.e. inferring functions from known input-output examples in the form of labelled training data
- three types of ML datasets are typically dedicated to three respective kinds of tasks: “training”, i.e. fitting the parameters, “validation”, i.e. tuning ML hyperparameters (which are parameters used to control the learning process), and “testing”, i.e. checking independently of a training dataset exploited for building a mathematical model that the latter model provides satisfying results.
- a “neural network (NN)” designates a category of ML comprising nodes (called “neurons”), and connections between neurons modeled by “weights”. For each neuron, an output is given in function of an input or a set of inputs by an “activation function”. Neurons are generally organized into multiple “layers”, so that neurons of one layer connect only to neurons of the immediately preceding and immediately following layers.
- Figure 1 is a block diagram representing schematically a particular mode of a device for generating at least one numerical twin, compliant with the present disclosure
- Figure 2 is a flow chart showing successive steps executed with the device for generating at least one numerical twin of figure 1 ;
- Figure 3 is a block diagram representing schematically a particular mode of a device for modeling a process using a numerical twin generated using the device of figure 1;
- Figure 4 is a flow chart showing successive steps executed with the device for modeling a process of figure 3;
- Figure 5 is an ontology according to one embodiment of the invention.
- Figure 6 is a sparce tabular representation corresponding to the ontology of figure 6,
- Figure 7 is the filled tabular representation from figure 7, and
- Figure 8 is the augmented tabular representation from figure 8.
- the functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software.
- the functions may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared.
- the device 1 is adapted to produce at least one numerical twin 30 (e.g. a computational model configured to mirror a process such as for instance any process that may be observed in the real-world).
- a numerical twin 30 e.g. a computational model configured to mirror a process such as for instance any process that may be observed in the real-world.
- an ontology and associated representation is constructed based on an input request 21 formulated by a user, describing the process to be modeled, and a user database 22 comprising data collected by said user.
- the constructed ontology and associated representation may be used to train a chosen machine learning predictive model architecture so as to obtain the numerical twin 30.
- the device 1 for generating the numerical twin 30 is associated with a device 2 for modeling the process using the generated numerical twin 20, represented on Figure 3, which will be subsequently described.
- Each of the devices 1 and 2 is advantageously an apparatus, or a physical part of an apparatus, designed, configured and/or adapted for performing the mentioned functions and produce the mentioned effects or results.
- any of the device 1 and the device 2 is embodied as a set of apparatus or physical parts of apparatus, whether grouped in a same machine or in different, possibly remote, machines.
- the device 1 and/or the device 2 may e.g. have functions distributed over a cloud infrastructure and be available to users as a cloud-based service, or have remote functions accessible through an API.
- the device 1 and the device 2 for may be integrated in a same apparatus or set of apparatuses, and intended to same users.
- the structure of device 2 may be completely independent of the structure of device 1, and may be provided for other users.
- modules are to be understood as functional entities rather than material, physically distinct, components. They can consequently be embodied either as grouped together in a same tangible and concrete component, or distributed into several such components. Also, each of those modules is possibly itself shared between at least two physical components. In addition, the modules are implemented in hardware, software, firmware, or any mixed form thereof as well. They are preferably embodied within at least one processor of the device 1 or of the device 2.
- the device 1 may comprise a module 11 for receiving the user request 21, the user database 22, an ontology ensemble 23 and an ensemble of models 20 and optionally an enrichment database 24, from one or more local or remote database(s) 10.
- the latter can take the form of storage resources available from any kind of appropriate storage means, which can be notably a RAM or an EEPROM (Electrically-Erasable Programmable Read-Only Memory) such as a Flash memory, possibly within an SSD (Solid-State Disk).
- the user request 21 may comprise elements describing the process to be modeled gathered through a natural language processing interface (e.g. chat).
- a natural language processing interface e.g. chat
- the user may describe the process using free text, providing a narrative description of the process to be modeled, in addition with for instance its inputs, outputs, main components, and their interactions.
- This free-text description allows the user to convey nuanced details, specific requirements, and contextual information relevant to the modeling task.
- the user may be asked questions such as: “what is his/her statistical issue to be addressed?”, “Is it a multi variable optimization?”, “Is it a clustering task?”, “Is it a regression task?, “Is it a classification task?”.
- the questions may be asked to the used through a “scripted/guided” interview.
- the questions may be closed questions (yes/no answers) or open questions.
- the user database 22 may comprise data relating to the process to be modeled that the user wants to include in the numerical twin 30.
- the user database 22 may comprise structured and unstructured data from which entities (e.g. objects, concepts, or items of interest within a dataset or system that help conceptualizing the process under consideration, such as for example genes, proteins, chemical compounds, cities, countries, landmarks, customers, suppliers, orders, invoices%) and instances (e.g. specific occurrences or examples of an entity, such as for example a supplier name, an invoice number, a gene sequence). may be directly or indirectly derived.
- the user database 22 may be associated with multiple entities, such as for example at least two entities.
- the user database 22 not necessarily refers to data in the database format (e.g. such as SQL or DDL), but to any raw data in any format that the user may see fit (e.g. such as for example .txt files, .csv files, .json files, etc.) that inherently comprise entities and instances.
- Module 11 may then be configured to read and structure the data as required to construct the ontology. For instance, module 11 may be configured to reconcile the user database 22 structure with a pre-built ontology (e.g., from the ontology ensemble 23), using an algorithm based on semantic proximity.
- Database such as the user database 22 are often represented as tables comprising at least one column and at least one row, each column having a descriptor corresponding to an entity and each row of a given column representing a specific instance of the entity.
- the ontology ensemble 23 comprises a collection of ontologies previously constructed based on existing ontologies from the literature. Such collection may comprise multiple ontologies, such as for instance at least two ontologies. This catalog houses a collection of structured representations of entities and their relationships. Each ontology within the ensemble is associated to a specific process and presents a specific framework (e.g. specific relations between the entities). The ontology ensemble may be used for constructing a new ontology based on the request 21 and database 22 provided by the user. Additionally, the ontology ensemble 23 may be further enriched using the ontologies frameworks previously constructed for previous users.
- the ensemble of models 20 comprises a collection of architecture of machine learning predictive models previously constructed using existing numerical twin architectures from the literature. Such collection may comprise multiple architectures, such as for instance at least two architectures. These architectures may serve as templates for the construction of numerical twins tailored to the specific process to be modeled. Each architecture encapsulates a set of design choices, algorithms, and methodologies that have been demonstrated to be effective in capturing and simulating real-world phenomena. By leveraging these pre-existing architectures, the system can accelerate the development and deployment of the numerical twins, minimizing the need for experimentation from scratch.
- the enrichment database 24 comprises a collection of entities previously collected based on various sources of data available in the literature. These entities are associated with a diverse range of processes, the enrichment database 24 may be further enriched using the entities extracted from previous requests of other users and from previous user database.
- the device 1 further comprises optionally a module 12 for preprocessing the user database 22 and the user request 21 received by module 11. For instance, the user database 22 and the enrichment database 24 may be cleaned to remove or correct errors, inconsistencies, missing values.
- Module 12 may for instance be configured to execute at least one preprocessing pipeline tailored to the nature of the user database 22.
- These pipelines may extract numeric insights from unstructured or semi- structured data formats such as images (via domain-specific computer vision algorithms), audio signals (e.g., heartbeat recordings in healthcare applications, using frequency analysis or signal processing techniques), or even free-text inputs (e.g., clinical notes or reports, through Natural Language Processing).
- These numeric insights are then converted into structured formats (e.g., typically numerical vectors or embeddings) that are aligned with the ontology-based representation, thereby making them usable for downstream training by machine learning predictive models.
- the device 1 may further comprise a module 13 for generating a training dataset for training a machine learning predictive model in order to obtain the numerical twin 30.
- module 13 may first be configured to extract 131 entities from the user request 21 received by module 11.
- the user request 21 may be processed using a Large Language Model (LLM) to vectorize the user request 21.
- LLM Large Language Model
- Entities may be extracted from the vectorized user request.
- the LLM may be trained and fine-tuned to identify the entities from the user request 21.
- the LLM may rely on a structure of said user database 22 to identify the entities from the user request 21.
- the LLM may rely on the entities comprised in the enrichment database 24 to identify the entities from the user request 21.
- semantic distance measurement between the vectorized user request and the enrichment database 24 may be calculated to determine the entities that match the most the meaning of the user request 21. New entities may be determined through this process and be added to the enrichment database 24.
- module 13 may be further configured to extract the entities from the user database 22 and/or from the preprocessed user database 22.
- module 13 may be configured to query the user database 22 (with consent and secure access conditions) to read the user database 22 schemas (tabular or graph databases).
- module 13 does not access the user’s data but rather, it reads only the data schemas and associated metadata (database descriptors).
- the descriptors of each column of the user database 22 may be extracted and correspond to entities.
- semantic distance measurement between the extracted descriptors and the enrichment database 24 may be calculated to determine the entities that match the most the meaning of the user database 22. New entities may be determined through this process and be added to the enrichment database 24.
- the ensemble of extracted entities may be further enriched using the enrichment database 24.
- a LLM may be trained to provide recommendations of missing entities, based on the intent of the user as deduced from the user request 21, that could improve the statistical relevance of the model.
- the LLM may recommend entities based on the enrichment database 24.
- the entities extracted from the user request 21 and the user database 22 may be classified and standardized to ensure that their taxonomy aligns directly with the entities already present in the enrichment database 24.
- the ensemble of extracted entities may be edited directly by the user using a user interface 120.
- the user may add or remove new entities from the ensemble.
- the user may also directly import a pre-existing ensemble of entities.
- module 13 may be configured to evaluate the model's entities in terms of relevance to the process at hand, as well as the quality and quantity of information they provide. Moreover, module 13 may be configured to generate metadata to form an identity profile for the created entity model, facilitating further analysis, interpretation, and utilization of the extracted entities.
- Rationalization may be performed using a semantic distance measurement to a reference (e.g. a list of entities already validated for example by the user). Some entities may be added, or merged.
- a reference e.g. a list of entities already validated for example by the user.
- Module 13 may be further configured to define 132 an ontology based on the rationalized ensemble of entities.
- a machine learning architecture may be used such as Graph Neural Networks (GNN), a Large Language Model (LLLM) or a transformer-based model pretrained using a training dataset comprising ontologies structures wherein relationships between the entities have already been established.
- the training dataset may for instance include ontologies from the ontology ensemble 23.
- an ontology may comprise nine entities 101, 102, 103: “Thing”, “Patient”, “Social Security Number”, “Name”, “Surname”, “Address”, “Birth date”, “Age”, “Place of birth”.
- the nine entities 101, 102, 103 may be connected by a relation 105.
- Each entity 101, 102, 103 may be connected to one or several other entities 101, 102, 103 through one or several relations 105.
- the relation 105 may convey a relationship between the entities 101, 102, 103, such as a hierarchical relation, an associative relation, an attribute relation, a temporal relation, a spatial relation or a functional relation.
- “Patient” may be connected to “Thing” through a subclass relation and to “Age” through an attribute relation.
- Module 13 may be further configured to generate 133 a representation of the defined ontology that is compatible with the framework and/or architecture of the machine learning predictive model chosen from the ensemble of models 20.
- the machine- processable representation may comprise features corresponding to the entities of the ontology.
- the representation of the defined ontology may for instance be a tabular representation that may comprise columns and rows, wherein each column corresponds to an entity.
- the tabular representation may notably be a tensor-compatible flat matrix structure suited for deep learning frameworks such as TensorFlow or PyTorch.
- the representation of the defined ontology may be a graph-based representation in which entities are modeled as nodes and relations as edges, particularly useful for graph neural networks (GNNs).
- GNNs graph neural networks
- the representation of the defined ontology may be a sequencebased representation, such as a tokenized entity-relation sequence, configured to be processed by transformer-based or recurrent architectures.
- the representation of the defined ontology may be a vector embedding format derived from the ontology, enabling compatibility with models using vector space projections.
- Module 13 may be further configured to complete the representation of the defined ontology and/or the ontology using instances 134 from the user database 22. For instance, the rows of the tabular representation may be completed using the instances 134 from the user database 22.
- the tabular representation corresponding to the ontology from Figure 5 comprises a first row comprising the “Thing” entity 101, a second row comprising the “Patient” entity 102 and a third row subdivided into seven columns comprising each of the entities 103 “Social Security Number”, “Name”, “Surname”, “Address”, “Birth date”, “Age” and “Place of birth”. Under each entity 103 among “Social Security Number”, “Name”, “Surname”, “Address”, “Birth date”, “Age” and “Place of birth”, several instances 106 are entered.
- the defined ontology and associated representation may only be partially completed by the instances from the user database 22.
- some boxes of the tabular representation are empty.
- module 13 may be configured to synthetize (i.e., generate or predict using generative or predictive models) instances to fill the voids in the representation and/or ontology.
- the instances synthesis ensures that there is no loss of informational fidelity, and that the statistical profile of the sparce representation of the defined ontology and the filled representation of the defined ontology is the same.
- Synthesis of new instances may be performed using statistically informed data feed-in, such as Gradient Boosted Decision Trees (e.g., XGBoost or CatBoost) and/or using a filing generative model such as a Generative Adversarial Network (GAN), or a Variational Autoencoder.
- GAN Generative Adversarial Network
- Statistically informed data feed- in involves building missing instances based on instances already present in the representation of the defined ontology.
- the new instances may be deduced by applying statistical or mathematical equations to the already existing instances.
- BMI Body Mass Index
- GANs consist of at least one generator and at least one discriminator, which are trained simultaneously in a competitive manner.
- the GAN receives as input a filing training dataset comprising the instances from the sparce representation of the defined ontology and the generator generates new instances from random noise, while the discriminator either receives real instances from the sparce representation of the defined ontology or synthetic instances generated by the generator and evaluates whether the received instance is real or not.
- the generator improves its ability to generate realistic new instances, while the discriminator enhances its ability to differentiate between real and generated data. This adversarial process leads to the synthesis of high-quality instances that closely resemble the characteristics of the original filing training dataset and respect its statistical profile.
- Module 13 may be further configured to generate new instances in the representation of the ontology and/or the ontology to augment the quality of the information comprised in the ontology and associated representation.
- the instances are once again generated so as to ensure that there is no loss of informational fidelity, and that the statistical profile of the augmented representation and/or augmented ontology is the same as the sparce representation and/or sparce ontology.
- Synthesis of new instances may be performed using a generative model such as a Generative Adversarial Network (GAN), a Tabular Variational Auto-Encoders (TVAE), a Boltzmann Machines, a Support Vector Machines (SVM).
- GAN Generative Adversarial Network
- TVAE Tabular Variational Auto-Encoders
- SVM Support Vector Machines
- Module 13 may be further configured to anonymize the instances.
- the instances may be anonymized by employing a generative model (such as for instance a GAN) specifically designed to generate artificial data that closely resembles real Personally Identifiable Information (PII) data comprised in the instances, while mimicking the statistical properties and patterns of the original PII data.
- a generative model such as for instance a GAN
- PII Personally Identifiable Information
- Module 13 may further be configured to define the completed representation and/or ontology as the training dataset.
- the device 1 may further comprise a module 14 for choosing an appropriate model architecture for the numerical twin 30.
- the model architecture may be chosen from the model ensemble 20 using the user interface 120. The architecture may either be selected directly by the user. Alternatively, module 14 may be configured to automatically select the appropriate model architecture based on the user request 21, and/or the user database 22 and/or the defined ontology and/or its representation.
- Automatic selection of the appropriate model architecture may be performed by first selecting the optimal Deep Neural Network (DNN) architecture (e.g., Transformers, GANs, autoencoders, LTSM, CNN, RNN%) for the problem to be modelled (clustering, regression, decision tree, etc..).
- the model sizing e.g., number of attention heads in attention-based models
- a specific recommendation machine learning model leveraging for instance a Decision Tree Algorithm or a LLM
- the chosen model architectures may be compatible with major deep learning frameworks such as PyTorch, TensorFlow, and Keras, which operate on tensor representations of input data.
- major deep learning frameworks such as PyTorch, TensorFlow, and Keras, which operate on tensor representations of input data.
- the semantic graph may first be projected onto a tabular structure, which is then tensorized. This projection does not constrain the choice of architecture but influences the structure of the input tensor and thus the inductive behavior of the model. Consequently, any DNN architecture supported by these frameworks may be used and trained using the tensorized form of the ontology-derived dataset.
- the ontology and/or its representation not only guides the selection of the model architecture but also constitutes the core input used for training. It ensures that the learning process is grounded in a structured and semantically coherent view of the data to be modeled.
- the ontology-based dataset captures both the semantic relationships among entities and their corresponding values, providing a rich and structured basis for training the numerical twin.
- the structure of the ontology has a direct impact on how the data is vectorized and subsequently used for training.
- clinical, biological, and omics variables are modeled as attributes of the "Patient” entity, leading to a projection where each patient is represented by a dense feature vector (e.g., 10,000 patients x 2,000 variables).
- the key entity is "TimePoint”
- the data is structured as a 3D tensor (e.g., 500 patients x 12 timepoints x 300 features), suitable for temporal models like LSTM or Transformers.
- Each projection induces specific structural biases (e.g., variable repetition, density, context preservation), which influence model learning and performance.
- the device 1 may further comprise a module 15 for obtaining optimal hyperparameters for the chosen machine learning predictive model architecture for the numerical twin 30.
- the hyperparameters may either be selected directly by the user via the user interface 120.
- module 14 may be configured to automatically determine the optimal hyperparameters.
- An auto-tuning model such as Bayesian Optimization or Random Search, Genetic Algorithms, Gradient based optimization, meta- learning or reinforcement learning may be applied to a subset of the representation of the defined ontology. Auto-tuning works by iteratively evaluating different combinations of hyperparameters within predefined ranges and selecting the set that maximizes the performance metric of the chosen model architecture. Typically, cross-validation may be used. This iterative process continues until the algorithm converges to the hyperparameters that yield the best performances for the chosen machine learning predictive model.
- the device 1 may further comprise a module 16 for training the chosen machine learning predictive model architecture using the optimal hyperparameters and the training dataset derived from the ontology and/or its representation.
- the training dataset structured from the ontology, provides not only raw features but embeds domain- specific semantic relationships that enhance model learning. For instance, a Patient-rooted ontology will emphasize individualized representation, potentially improving model specificity, whereas a TimePoint-centric ontology favors the capture of temporal dynamics. These different ontology-driven geometries of the input space influence the performance and generalization of the trained model. The use of an ontology thus enables incorporation of domain knowledge into the training process by structurally guiding the way data is represented and learned.
- All previously described modules may be further configured to generate metadata for the purpose of helping with the certification of the generated numerical twin 30.
- the metadata generated may include at least one of the following: Metadata linked to the entity ensemble: metadata related to the statistical quality of the entity ensemble, and metadata associated with the production of the entity ensemble; Metadata associated with the ontology ensemble 23, including semantic search capabilities within the ontology ensemble 23, and metadata related to previously generated ontologies;
- Metadata linked to the data augmentation and anonymization processes statistical metadata related to the quality of data augmentation and anonymization processes and metadata regarding the version of the model used, augmentation data sources, and nature of standardization or normalization processes;
- Metadata associated with the machine learning generative model used for generation of the numerical twin 30 Quality control of the machine learning generative model, including metadata inspection for conformity to legal standards (e.g. states, return codes, production metadata, statistical metadata, and predictive maintenance);
- legal standards e.g. states, return codes, production metadata, statistical metadata, and predictive maintenance
- Metadata on the device 1 outputs including device typology, versioning, statistical quality control metadata, storage location/path, and availability metadata.
- Figure 5 shows examples of metadata 104 associated with the entities 101, 102, 103.
- modules 11, 12, 13, 14, 15, and 16 are not necessarily successive in time, and may overlap, proceed in parallel or alternate, in any appropriate manner.
- a new input may be progressively received over time and preprocessed, while the module 13 is dealing with the previously obtained inputs.
- a batch of inputs may be fully received and preprocessed before it is submitted to module 13.
- the device 1 may for example execute the following computer implemented method ( Figure 2): receiving said user request 21, user database 22 and at least one entity and at least one instance associated with said user database, said ontology ensemble 23, said ensemble of models 20 and optionally said enrichment database 24 (step 41); optionally preprocessing said user request 21 and user database 22 (step 42); constructing said training dataset (step 43) by: o extracting at least one entity associated to said process based on said user request 21 (step 431); o defining an ontology comprising at least two entities connected by a relation using said at least one ontology of reference from said ontology ensemble, said ontology being defined using:
- step 432 ⁇ said at least one entity extracted from said at least one user request 21 received as input (step 432); o generating a representation of the defined ontology, said representation comprising at least one feature, each feature corresponding to an entity of said defined ontology (step 433), o entering the at least one instance from said user database in the generated representation and defining the obtained representation as the training dataset (step 434), choosing one architecture of machine learning predictive model from the ensemble of models (step 44), obtaining optimal hyperparameters for said chosen architecture (step 45), training said chosen architecture using said optimal hyperparameters and said training dataset so as to obtain said numerical twin and generating statistical quality control metadata 31 relative to said numerical twin 30 (step 46), outputting said numerical twin 30 and said statistical quality control metadata 31.
- the present invention also relates to the device 2 for modeling a process using a numerical twin 30 generated using the device 1.
- the device 2 will be described in reference to a particular function embodiment as illustrated in Figure 3.
- the device 2 is adapted to receive as input at least one input variable 35 of the numerical twin 30 and to output at least one output variable 61 of the numerical twin 30.
- the device 2 comprises a module 17 for receiving at least one input variable 35 of said numerical twin 30 and said numerical twin 30, that may be stored in one or more local or remote database(s) 10.
- the latter can take the form of storage resources available from any kind of appropriate storage means, which can be notably a RAM or an EEPROM (Electrically-Erasable Programmable Read-Only Memory) such as a Flash memory, possibly within an SSD (Solid-State Disk).
- the numerical twin 30 and all its hyperparameters may have been previously generated by a system including the device 1.
- the trained numerical twin 30 and its hyperparameters may be received from a communication network.
- the device 2 further comprises optionally a module 18 for preprocessing the at least one input variable 35.
- the device 2 may also comprise a module 19 configured to feed said at least one input variable 35 to said numerical twin 30 so as to generate at least one output variable 61.
- the numerical twin 30 may be used for population modelling.
- a population according to the invention may be understood as a statistical definition of a group of variables that describes and actual system (static or dynamic) and the relationships between these variables that represent and explain the internal/extemal dynamic behavior of the system.
- the numerical twin 30 architecture may be akin to a multidimensional mesh of data points within an N-dimensional space, where N denotes the number of entities. These entities are interconnected by relationships, some of which can be conceptual or mathematically modeled.
- the input variables 35 to this numerical twin 30 may encompass demographic information, socio-economic indicators, geographic attributes, and any other relevant factors influencing the behavior or characteristics of the population being modeled.
- a distribution may be associated (e.g. the distribution is an instance).
- the output variables 61 may be the same as the input variables 35 or new information deduced from the input variables 35.
- a synthetic distribution may be associated (e.g. the synthetic distribution is an instance).
- the output variables 61 of the numerical twin 30 may reflect how the population being modeled changes or behaves in response to at least one parameter.
- the device 2 may interact with a user interface 119, via which information can be entered and retrieved by a user.
- the user interface 119 includes any means appropriate for entering or retrieving data, information or instructions, notably visual, tactile and/or audio capacities that can encompass any or several of the following means as well known by a person skilled in the art: a screen, a keyboard, a trackball, a touchpad, a touchscreen, a loudspeaker, a voice recognition system.
- a particular apparatus 9, visible on Figure 4, is embodying the device 1 as well as the device 2 described above. It corresponds for example to a workstation, a laptop, a tablet, a smartphone, or a head- mounted display (HMD).
- HMD head- mounted display
- That apparatus 9 is suited to numerical twin generation and process modeling and associated ML training. It comprises the following elements, connected to each other by a bus 95 of addresses and data that also transports a clock signal:
- microprocessor 91 or CPU
- a graphics card 92 comprising several Graphical Processing Units (or GPUs) 920 and a Graphical Random Access Memory (GRAM) 921;
- GPUs Graphical Processing Units
- GRAM Graphical Random Access Memory
- I/O devices 94 such as for example a keyboard, a mouse, a trackball, a webcam; other modes for introduction of commands such as for example vocal recognition are also possible;
- the apparatus 9 also comprises a display device 93 of display screen type directly connected to the graphics card 92 to display synthesized images calculated and composed in the graphics card.
- a dedicated bus to connect the display device 93 to the graphics card 92 offers the advantage of having much greater data transmission bitrates and thus reducing the latency time for the displaying of images composed by the graphics card.
- a display device is external to apparatus 9 and is connected thereto by a cable or wirelessly for transmitting the display signals.
- the apparatus 9, for example through the graphics card 92 comprises an interface for transmission or connection adapted to transmit a display signal to an external display means such as for example an LCD or plasma screen or a video-projector.
- the RF unit 99 can be used for wireless transmissions.
- the word “register” used hereinafter in the description of memories 97 and 921 can designate in each of the memories mentioned, a memory zone of low capacity (some binary data) as well as a memory zone of large capacity (enabling a whole program to be stored or all or part of the data representative of data calculated or to be displayed).
- the registers represented for the RAM 97 and the GRAM 921 can be arranged and constituted in any manner, and each of them does not necessarily correspond to adjacent memory locations and can be distributed otherwise (which covers notably the situation in which one register includes several smaller registers).
- the microprocessor 91 loads and executes the instructions of the program contained in the RAM 97.
- the apparatus 9 may include only the functionalities of the device 1, and not those of the device 2.
- the device 1 and/or the device 2 may be implemented differently than a standalone software, and an apparatus or set of apparatus comprising only parts of the apparatus 9 may be exploited through an API call or via a cloud interface.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
La présente invention concerne un procédé et un dispositif (1) et un procédé de génération d'au moins un jumeau numérique (30) configuré pour modéliser un processus sur la base d'au moins une demande (21) exprimée par un utilisateur. L'invention concerne en outre un dispositif de modélisation d'un processus utilisant un jumeau numérique (30) généré à l'aide du dispositif (1) pour générer au moins un jumeau numérique (30).
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24305699.1 | 2024-05-02 | ||
| EP24305699 | 2024-05-02 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025229197A1 true WO2025229197A1 (fr) | 2025-11-06 |
Family
ID=92458305
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2025/062101 Pending WO2025229197A1 (fr) | 2024-05-02 | 2025-05-02 | Dispositif et procédé de génération d'un jumeau numérique |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025229197A1 (fr) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180165604A1 (en) * | 2016-12-09 | 2018-06-14 | U2 Science Labs A Montana | Systems and methods for automating data science machine learning analytical workflows |
| WO2021051031A1 (fr) * | 2019-09-14 | 2021-03-18 | Oracle International Corporation | Techniques de composition de service automatisée adaptative et sensible au contexte pour l'apprentissage automatique (ml) |
-
2025
- 2025-05-02 WO PCT/EP2025/062101 patent/WO2025229197A1/fr active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180165604A1 (en) * | 2016-12-09 | 2018-06-14 | U2 Science Labs A Montana | Systems and methods for automating data science machine learning analytical workflows |
| WO2021051031A1 (fr) * | 2019-09-14 | 2021-03-18 | Oracle International Corporation | Techniques de composition de service automatisée adaptative et sensible au contexte pour l'apprentissage automatique (ml) |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Xiao et al. | Virtual knowledge graphs: An overview of systems and use cases | |
| Correia et al. | Data management in digital twins: a systematic literature review | |
| Khan et al. | Crowd intelligence in requirements engineering: Current status and future directions | |
| Lan et al. | A semantic web technology index | |
| Triandini et al. | Software similarity measurements using UML diagrams: A systematic literature review | |
| El-Sappagh et al. | An ontological case base engineering methodology for diabetes management | |
| Ekaputra et al. | Describing and organizing semantic web and machine learning systems in the SWeMLS-KG | |
| Lundberg | Bibliometric mining of research directions and trends for big data | |
| Dunlap et al. | Earth system curator: metadata infrastructure for climate modeling | |
| WO2025029343A1 (fr) | Partage sécurisé et évolutif de documents numériques d'ingénierie | |
| Rahrooh et al. | Towards a framework for interoperability and reproducibility of predictive models | |
| Drobnjakovic et al. | An introduction to machine learning lifecycle ontology and its applications | |
| Nickles | A tool for probabilistic reasoning based on logic programming and first-order theories under stable model semantics | |
| Spiekermann et al. | Implementations of fine-grained automated data provenance to support transparent environmental modelling | |
| Goossens et al. | Comparing the performance of GPT-3 with BERT for decision requirements modeling | |
| Wen et al. | A systematic knowledge graph-based smart management method for operations: A case study of standardized management | |
| Apriliani et al. | SentiHotel: a sentiment analysis application of hotel services using an optimized neural network | |
| WO2025229197A1 (fr) | Dispositif et procédé de génération d'un jumeau numérique | |
| Jammalamadaka et al. | Building an expert system through machine learning for predicting the quality of a WEB site based on its completion | |
| Christley et al. | A proposal for augmenting biological model construction with a semi-intelligent computational modeling assistant | |
| Tianxing et al. | Domain-oriented multilevel ontology for adaptive data processing | |
| US20250173550A1 (en) | Artificial intelligence-driven data classification | |
| Ören et al. | M&S Bok Core areas and the big picture | |
| Esmaeili et al. | Towards Data Mining and Knowledge Discovery for AECO Applications Using BIM Embedded Data: A Systematic Review | |
| Zhang | Enhancing Large Language Models with Reliable Knowledge Graphs |