WO2024115777A1 - Synthetic data generation - Google Patents
Synthetic data generation Download PDFInfo
- Publication number
- WO2024115777A1 WO2024115777A1 PCT/EP2023/084012 EP2023084012W WO2024115777A1 WO 2024115777 A1 WO2024115777 A1 WO 2024115777A1 EP 2023084012 W EP2023084012 W EP 2023084012W WO 2024115777 A1 WO2024115777 A1 WO 2024115777A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- synthetic
- computer
- surgical
- implemented method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/03—Recognition of patterns in medical or anatomical images
Definitions
- the disclosure relates in general to computing technology and relates more particularly to computing technology for a synthetic data generation.
- Computer-assisted systems particularly computer-assisted surgery systems (CASs)
- CASs computer-assisted surgery systems
- video data can be stored and/or streamed.
- the video data can be used to augment a person’s physical sensing, perception, and reaction capabilities.
- such systems can effectively provide the information corresponding to an expanded field of vision, both temporal and spatial, that enables a person to adjust current and future actions based on the part of an environment not included in his or her physical field of view.
- the video data can be stored and/or transmitted for several purposes, such as archival, training, post-surgery analysis, and/or patient consultation.
- training data is needed to train various models.
- the trained models may not perform as well as models trained with more robust sets of training data.
- a computer-implemented method includes receiving one or more inputs each defining one or more conditions for synthetic data generation and applying one or more encoders to the one or more inputs to generate a shared feature space of feature representations.
- a generator generates synthetic data based on the feature representations of the shared feature space.
- a discriminator compares real data and the synthetic data to distinguish one or more differences between the real data and the synthetic data.
- Adversarial training is applied using adversarial loss information based on the one or more differences to train the one or more encoders, the generator, and the discriminator to modify one or more components to more closely align the synthetic data with the one or more conditions and in a more similar distribution as the real data.
- a system includes a data store including video data associated with a surgical procedure.
- the system also includes a machine learning training system configured to train a generative adversarial network model to generate synthetic surgical data, combine the synthetic surgical data with real surgical data to form an enhanced training data set in the data store, and train one or more machine learning models using the enhanced training data set.
- a computer program product includes a memory device having computer executable instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a plurality of operations.
- the operations include applying one or more encoders to one or more inputs associated with surgical data to generate a shared feature space of feature representations and generating, by a generator, synthetic data based on the feature representations of the shared feature space, where the generator includes a decoder with sequentially nested blocks of normalization layers including at least one pixel-level normalization block and at least one channel-level normalization block.
- FIG. 1 depicts a computer-assisted surgery (CAS) system according to one or more aspects
- FIG. 2 depicts a surgical procedure system according to one or more aspects
- FIG. 3 depicts a system for analyzing video captured by a video recording system according to one or more aspects
- FIG. 4 depicts a block diagram of a synthetic image generator according to one or more aspects
- FIG. 5 depicts a block diagram of a synthetic image generator according to one or more aspects
- FIG. 6 depicts a block diagram of a synthetic video generator according to one or more aspects
- FIG. 7 depicts a block diagram of a synthetic image generator according to one or more aspects
- FIG. 8 depicts a block diagram of a discriminator architecture according to one or more aspects
- FIG. 9 depicts examples of synthetic images according to one or more aspects
- FIGS. 10A and 10B depict examples of synthetic image comparisons according to one or more aspects
- FIG. 11 depicts example results of segmentation model performance according to one or more aspects
- FIG. 12 depict an example of synthetic video generation applied to a simulation video according to one or more aspects
- FIG. 13 depicts a flowchart of a method of synthetic data generation according to one or more aspects.
- FIG. 14 depicts a block diagram of a computer system according to one or more aspects.
- Exemplary aspects of the technical solutions described herein include systems and methods for a synthetic data generation.
- Training Machine Learning (ML) models that are accurate, robust, and can be used for informing users, such as surgeons, typically requires of large amounts of annotated data.
- images such as surgical images
- Manual annotations can be time consuming and prone to errors.
- fine-grained annotations can be desired, the process can become even more complicated.
- annotation can be externalized, which may be complicated or even unfeasible where the process could compromise protected health information of patients under some privacy regulations.
- aspects as further described herein include a framework that can generate synthetic surgical data, such as images or video, conditioned on inputs.
- the generated synthetic data can appear both realistic and diverse.
- the input can be used as annotation, avoiding annotation problems associated with separate annotation generation.
- the synthetic data together with the input can be used to train machine learning models for downstream tasks, such as semantic segmentation, surgical phase, and text-to-image generation.
- a lack of large datasets and high-quality annotated data can limit the development of accurate and robust machine-learning models within medical and surgical domains.
- Generative models can produce novel and diverse synthetic images that closely resemble reality while controlling content with various types of annotations.
- generative models have not been yet fully explored in the surgical domain, partially due to the lack of large datasets and specific challenges present in the surgical domain, such as the large anatomical diversity.
- a generative model can produce synthetic images from segmentation maps.
- An architecture can produce surgical images with improved quality when compared to early generative models based on a combination of channel and pixellevel normalization layers that boost image quality while granting adherence to the input segmentation map.
- SuGAN can generate realistic and diverse surgical images in different surgical datasets, such as: cholecystectomy, partial nephrectomy, and radical prostatectomy.
- the use of synthetic images together with real ones can be used to improve the performance of other machine-learning models.
- SuGAN can be used to generate large synthetic datasets which can be used to train different segmentation models. Results illustrate that using synthetic images can improve mean segmentation performance with respect to only using real images. For example, when considering radical prostatectomy, mean segmentation performance can be boosted by up to 5.43%. Further, a performance improvement can be larger in classes that are underrepresented in the training sets, for example, where the performance boost of specific classes reaches up to 61.6%. Other levels of improvement can be achieved in according to various aspects.
- SuGAN can synthesize images or video using an architecture conditioned to segmentation maps that embraces recent advances in conditional and unconditional pipelines to support the generation of multi-modal (i.e., diverse) surgical images, and to prevent overfitting, lack of diversity, and cartooning, that are often present in synthetic images generated by state-of-the-art models.
- multi-modal i.e., diverse
- cartooning that are often present in synthetic images generated by state-of-the-art models.
- channel- and pixel-level normalization blocks can be used, where the former allow for realistic image generation and multimodality through latent space manipulation, while the latter enforces the adherence of the synthetic images to an input segmentation map.
- the CAS system 100 includes at least a computing system 102, a video recording system 104, and a surgical instrumentation system 106.
- an actor 112 can be medical personnel that uses the CAS system 100 to perform a surgical procedure on a patient 110. Medical personnel can be a surgeon, assistant, nurse, administrator, or any other actor that interacts with the CAS system 100 in a surgical environment.
- the surgical procedure can be any type of surgery, such as but not limited to cataract surgery, laparoscopic cholecystectomy, endoscopic endonasal transsphenoidal approach (eTSA) to resection of pituitary adenomas, or any other surgical procedure.
- actor 112 can be a technician, an administrator, an engineer, or any other such personnel that interacts with the CAS system 100.
- actor 112 can record data from the CAS system 100, configure/update one or more attributes of the CAS system 100, review past performance of the CAS system 100, repair the CAS system 100, and/or the like including combinations and/or multiples thereof.
- a surgical procedure can include multiple phases, and each phase can include one or more surgical actions.
- a “surgical action” can include an incision, a compression, a stapling, a clipping, a suturing, a cauterization, a sealing, or any other such actions performed to complete a phase in the surgical procedure.
- a “phase” represents a surgical event that is composed of a series of steps (e.g., closure).
- a “step” refers to the completion of a named surgical objective (e.g., hemostasis).
- certain surgical instruments 108 e.g., forceps
- a particular anatomical structure of the patient may be the target of the surgical action(s).
- the video recording system 104 includes one or more cameras 105, such as operating room cameras, endoscopic cameras, and/or the like including combinations and/or multiples thereof.
- the cameras 105 capture video data of the surgical procedure being performed.
- the video recording system 104 includes one or more video capture devices that can include cameras 105 placed in the surgical room to capture events surrounding (i.e., outside) the patient being operated upon.
- the video recording system 104 further includes cameras 105 that are passed inside (e.g., endoscopic cameras) the patient 110 to capture endoscopic data.
- the endoscopic data provides video and images of the surgical procedure.
- the computing system 102 includes one or more memory devices, one or more processors, a user interface device, among other components. All or a portion of the computing system 102 shown in FIG. 1 can be implemented for example, by all or a portion of computer system 1400 of FIG. 14. Computing system 102 can execute one or more computer-executable instructions. The execution of the instructions facilitates the computing system 102 to perform one or more methods, including those described herein.
- the computing system 102 can communicate with other computing systems via a wired and/or a wireless network.
- the computing system 102 includes one or more trained machine learning models that can detect and/or predict features of/from the surgical procedure that is being performed or has been performed earlier.
- Features can include structures, such as anatomical structures, surgical instruments 108 in the captured video of the surgical procedure.
- Features can further include events, such as phases and/or actions in the surgical procedure.
- Features that are detected can further include the actor 112 and/or patient 110.
- the computing system 102 in one or more examples, can provide recommendations for subsequent actions to be taken by the actor 112.
- the computing system 102 can provide one or more reports based on the detections.
- the detections by the machine learning models can be performed in an autonomous or semi-autonomous manner.
- the machine learning models can include artificial neural networks, such as deep neural networks, convolutional neural networks, recurrent neural networks, vision transformers, encoders, decoders, or any other type of machine learning model.
- the machine learning models can be trained in a supervised, unsupervised, or hybrid manner.
- the machine learning models can be trained to perform detection and/or prediction using one or more types of data acquired by the CAS system 100.
- the machine learning models can use the video data captured via the video recording system 104.
- the machine learning models use the surgical instrumentation data from the surgical instrumentation system 106.
- the machine learning models use a combination of video data and surgical instrumentation data.
- the machine learning models can also use audio data captured during the surgical procedure.
- the audio data can include sounds emitted by the surgical instrumentation system 106 while activating one or more surgical instruments 108.
- the audio data can include voice commands, snippets, or dialog from one or more actors 112.
- the audio data can further include sounds made by the surgical instruments 108 during their use.
- the machine learning models can detect surgical actions, surgical phases, anatomical structures, surgical instruments, and various other features from the data associated with a surgical procedure. The detection can be performed in real-time in some examples.
- the computing system 102 analyzes the surgical data, i.e., the various types of data captured during the surgical procedure, in an offline manner (e.g., post-surgery).
- the machine learning models detect surgical phases based on detecting some of the features, such as the anatomical structure, surgical instruments, and/or the like including combinations and/or multiples thereof.
- a data collection system 150 can be employed to store the surgical data, including the video(s) captured during the surgical procedures.
- the data collection system 150 includes one or more storage devices 152.
- the data collection system 150 can be a local storage system, a cloud-based storage system, or a combination thereof. Further, the data collection system 150 can use any type of cloud-based storage architecture, for example, public cloud, private cloud, hybrid cloud, and/or the like including combinations and/or multiples thereof.
- the data collection system can use a distributed storage, i.e., the storage devices 152 are located at different geographic locations.
- the storage devices 152 can include any type of electronic data storage media used for recording machine-readable data, such as semiconductor-based, magnetic-based, optical-based storage media, and/or the like including combinations and/or multiples thereof.
- the data storage media can include flash-based solid-state drives (SSDs), magnetic-based hard disk drives, magnetic tape, optical discs, and/or the like including combinations and/or multiples thereof.
- the data collection system 150 can be part of the video recording system 104, or vice-versa.
- the data collection system 150, the video recording system 104, and the computing system 102 can communicate with each other via a communication network, which can be wired, wireless, or a combination thereof.
- the communication between the systems can include the transfer of data (e.g., video data, instrumentation data, and/or the like including combinations and/or multiples thereof), data manipulation commands (e.g., browse, copy, paste, move, delete, create, compress, and/or the like including combinations and/or multiples thereof), data manipulation results, and/or the like including combinations and/or multiples thereof.
- the computing system 102 can manipulate the data already stored/being stored in the data collection system 150 based on outputs from the one or more machine learning models (e.g., phase detection, anatomical structure detection, surgical tool detection, and/or the like including combinations and/or multiples thereof). Alternatively, or in addition, the computing system 102 can manipulate the data already stored/being stored in the data collection system 150 based on information from the surgical instrumentation system 106.
- the one or more machine learning models e.g., phase detection, anatomical structure detection, surgical tool detection, and/or the like including combinations and/or multiples thereof.
- the computing system 102 can manipulate the data already stored/being stored in the data collection system 150 based on information from the surgical instrumentation system 106.
- the video captured by the video recording system 104 is stored on the data collection system 150.
- the computing system 102 curates parts of the video data being stored on the data collection system 150.
- the computing system 102 filters the video captured by the video recording system 104 before it is stored on the data collection system 150.
- the computing system 102 filters the video captured by the video recording system 104 after it is stored on the data collection system 150.
- FIG. 2 a surgical procedure system 200 is generally shown according to one or more aspects.
- the example of FIG. 2 depicts a surgical procedure support system 202 that can include or may be coupled to the CAS system 100 of FIG. 1.
- the surgical procedure support system 202 can acquire image or video data using one or more cameras 204.
- the surgical procedure support system 202 can also interface with one or more sensors 206 and/or one or more effectors 208.
- the sensors 206 may be associated with surgical support equipment and/or patient monitoring.
- the effectors 208 can be robotic components or other equipment controllable through the surgical procedure support system 202.
- the surgical procedure support system 202 can also interact with one or more user interfaces 210, such as various input and/or output devices.
- the surgical procedure support system 202 can store, access, and/or update surgical data 214 associated with a training dataset and/or live data as a surgical procedure is being performed on patient 110 of FIG. 1.
- the surgical procedure support system 202 can store, access, and/or update surgical objectives 216 to assist in training and guidance for one or more surgical procedures.
- User configurations 218 can track and store user preferences. [0039] Turning now to FIG. 3, a system 300 for analyzing video and data is generally shown according to one or more aspects. In accordance with aspects, the video and data is captured from video recording system 104 of FIG. 1.
- System 300 can be the computing system 102 of FIG. 1, or a part thereof in one or more examples.
- System 300 uses data streams in the surgical data to identify procedural states according to some aspects.
- System 300 includes a data reception system 305 that collects surgical data, including the video data and surgical instrumentation data.
- the data reception system 305 can include one or more devices (e.g., one or more user devices and/or servers) located within and/or associated with a surgical operating room and/or control center.
- the data reception system 305 can receive surgical data in real-time, i.e., as the surgical procedure is being performed. Alternatively, or in addition, the data reception system 305 can receive or access surgical data in an offline manner, for example, by accessing data that is stored in the data collection system 150 of FIG. 1.
- System 300 further includes a machine learning processing system 310 that processes the surgical data using one or more machine learning models to identify one or more features, such as surgical phase, instrument, anatomical structure, and/or the like including combinations and/or multiples thereof, in the surgical data.
- machine learning processing system 310 can include one or more devices (e.g., one or more servers), each of which can be configured to include part or all of one or more of the depicted components of the machine learning processing system 310.
- a part or all of the machine learning processing system 310 is cloudbased and/or remote from an operating room and/or physical location corresponding to a part or all of data reception system 305.
- machine learning processing system 310 includes several components of the machine learning processing system 310. However, the components are just one example structure of the machine learning processing system 310, and that in other examples, the machine learning processing system 310 can be structured using a different combination of the components. Such variations in the combination of the components are encompassed by the technical solutions described herein.
- the machine learning processing system 310 includes a machine learning training system 325, which can be a separate device (e.g., server) that stores its output as one or more trained machine learning models 330.
- the trained machine learning models 330 are accessible by a machine learning execution system 340.
- the machine learning execution system 340 can be separate from the machine learning training system 325 in some examples.
- devices that “train” the models are separate from devices that “infer,” i.e., perform real-time processing of surgical data using the trained machine learning models 330.
- Machine learning processing system 310 further includes a data generator 315 to generate simulated surgical data, such as a set of synthetic images and/or synthetic video as synthetic data 317, in combination with real image and video data from the video recording system 104 (e.g., real data 322), to generate trained machine learning models 330.
- Data generator 315 can access (read/write) a data store 320 to record data, including multiple images and/or multiple videos.
- the images and/or videos can include images and/or videos collected during one or more procedures (e.g., one or more surgical procedures).
- the images and/or video may have been collected by a user device worn by the actor 112 of FIG.
- the data store 320 is separate from the data collection system 150 of FIG. 1 in some examples. In other examples, the data store 320 is part of the data collection system 150.
- Each of the images and/or videos recorded in the data store 320 for performing training can be defined as a base image and can be associated with other data that characterizes an associated procedure and/or rendering specifications.
- the other data can identify a type of procedure, a location of a procedure, one or more people involved in performing the procedure, surgical objectives, and/or an outcome of the procedure.
- the other data can indicate a stage of the procedure with which the image or video corresponds, rendering specification with which the image or video corresponds and/or a type of imaging device that captured the image or video (e.g., and/or, if the device is a wearable device, a role of a particular person wearing the device, and/or the like including combinations and/or multiples thereof).
- the other data can include image-segmentation data that identifies and/or characterizes one or more objects (e.g., tools, anatomical objects, and/or the like including combinations and/or multiples thereof) that are depicted in the image or video.
- the characterization can indicate the position, orientation, or pose of the object in the image.
- the characterization can indicate a set of pixels that correspond to the object and/or a state of the object resulting from a past or current user handling. Localization can be performed using a variety of techniques for identifying objects in one or more coordinate systems.
- the machine learning training system 325 can use the recorded data in the data store 320, which can include the simulated surgical data (e.g., set of synthetic images and/or synthetic video) and/or actual surgical data to generate the trained machine learning models 330.
- the trained machine learning models 330 can be defined based on a type of model and a set of hyperparameters (e.g., defined based on input from a client device).
- the trained machine learning models 330 can be configured based on a set of parameters that can be dynamically defined based on (e.g., continuous or repeated) training (i.e., learning, parameter tuning).
- Machine learning training system 325 can use one or more optimization algorithms to define the set of parameters to minimize or maximize one or more loss functions.
- the set of (learned) parameters can be stored as part of the trained machine learning models 330 using a specific data structure for a particular trained machine learning model of the trained machine learning models 330.
- the data structure can also include one or more non-learnable variables (e.g., hyperparameters and/or model definitions).
- the machine learning training system 325 can train a generative adversarial network model, such as SuGAN 332, to generate synthetic surgical data, such as synthetic data 317, as part of the data generator 315.
- the synthetic surgical data can be combined with real surgical data (e.g., real data 322) to form an enhanced training data set 324 in the data store 320.
- the machine learning training system 325 can use the enhanced training data set 324 to train one or more machine learning models 334A-334N, which can be stored in the trained machine learning models 330 for further use by the machine learning execution system 340.
- Machine learning execution system 340 can access the data structure(s) of the trained machine learning models 330 and accordingly configure the trained machine learning models 330 for inference (e.g., prediction, classification, and/or the like including combinations and/or multiples thereof).
- the trained machine learning models 330 can include, for example, a fully convolutional network adaptation, an adversarial network model, an encoder, a decoder, or other types of machine learning models.
- the type of the trained machine learning models 330 can be indicated in the corresponding data structures.
- the trained machine learning models 330 can be configured in accordance with one or more hyperparameters and the set of learned parameters.
- the trained machine learning models 330 receive, as input, surgical data to be processed and subsequently generate one or more inferences according to the training.
- the video data captured by the video recording system 104 of FIG. 1 can include data streams (e.g., an array of intensity, depth, and/or RGB values) for a single image or for each of a set of frames (e.g., including multiple images or an image with sequencing data) representing a temporal window of fixed or variable length in a video.
- the video data that is captured by the video recording system 104 can be received by the data reception system 305, which can include one or more devices located within an operating room where the surgical procedure is being performed.
- the data reception system 305 can include devices that are located remotely, to which the captured video data is streamed live during the performance of the surgical procedure. Alternatively, or in addition, the data reception system 305 accesses the data in an offline manner from the data collection system 150 or from any other data source (e.g., local or remote storage device).
- any other data source e.g., local or remote storage device.
- the data reception system 305 can process the video and/or data received.
- the processing can include decoding when a video stream is received in an encoded format such that data for a sequence of images can be extracted and processed.
- the data reception system 305 can also process other types of data included in the input surgical data.
- the surgical data can include additional data streams, such as audio data, RFID data, textual data, measurements from one or more surgical instrum ents/sensors, and/or the like including combinations and/or multiples thereof, that can represent stimuli/procedural states from the operating room.
- the data reception system 305 synchronizes the different inputs from the different devices/sensors before inputting them in the machine learning processing system 310.
- the trained machine learning models 330 can analyze the input surgical data, and in one or more aspects, predict and/or characterize features (e.g., structures) included in the video data included with the surgical data.
- the video data can include sequential images and/or encoded video data (e.g., using digital video file/stream formats and/or codecs, such as MP4, MOV, AVI, WEBM, AVCHD, OGG, and/or the like including combinations and/or multiples thereof).
- the prediction and/or characterization of the features can include segmenting the video data or predicting the localization of the structures with a probabilistic heatmap.
- the one or more trained machine learning models 330 include or are associated with a preprocessing or augmentation (e.g., intensity normalization, resizing, cropping, and/or the like including combinations and/or multiples thereof) that is performed prior to segmenting the video data.
- An output of the one or more trained machine learning models 330 can include image-segmentation or probabilistic heatmap data that indicates which (if any) of a defined set of structures are predicted within the video data, a location and/or position and/or pose of the structure(s) within the video data, and/or state of the structure(s).
- the location can be a set of coordinates in an image/frame in the video data. For example, the coordinates can provide a bounding box.
- the coordinates can provide boundaries that surround the structure(s) being predicted.
- the trained machine learning models 330 are trained to perform higher-level predictions and tracking, such as predicting a phase of a surgical procedure and tracking one or more surgical instruments used in the surgical procedure.
- the machine learning processing system 310 includes a detector 350 that uses the trained machine learning models 330 to identify various items or states within the surgical procedure (“procedure”).
- the detector 350 can use a particular procedural tracking data structure 355 from a list of procedural tracking data structures.
- the detector 350 can select the procedural tracking data structure 355 based on the type of surgical procedure that is being performed.
- the type of surgical procedure can be predetermined or input by actor 112.
- the procedural tracking data structure 355 can identify a set of potential phases that can correspond to a part of the specific type of procedure as “phase predictions”, where the detector 350 is a phase detector.
- the procedural tracking data structure 355 can be a graph that includes a set of nodes and a set of edges, with each node corresponding to a potential phase.
- the edges can provide directional connections between nodes that indicate (via the direction) an expected order during which the phases will be encountered throughout an iteration of the procedure.
- the procedural tracking data structure 355 may include one or more branching nodes that feed to multiple next nodes and/or can include one or more points of divergence and/or convergence between the nodes.
- a phase indicates a procedural action (e.g., surgical action) that is being performed or has been performed and/or indicates a combination of actions that have been performed.
- a phase relates to a biological state of a patient undergoing a surgical procedure.
- the biological state can indicate a complication (e.g., blood clots, clogged arteries/veins, and/or the like including combinations and/or multiples thereof), pre-condition (e.g., lesions, polyps, and/or the like including combinations and/or multiples thereof).
- the trained machine learning models 330 are trained to detect an “abnormal condition,” such as hemorrhaging, arrhythmias, blood vessel abnormality, and/or the like including combinations and/or multiples thereof.
- Each node within the procedural tracking data structure 355 can identify one or more characteristics of the phase corresponding to that node.
- the characteristics can include visual characteristics.
- the node identifies one or more tools that are typically in use or available for use (e.g., on a tool tray) during the phase.
- the node also identifies one or more roles of people who are typically performing a surgical task, a typical type of movement (e.g., of a hand or tool), and/or the like including combinations and/or multiples thereof.
- detector 350 can use the segmented data generated by machine learning execution system 340 that indicates the presence and/or characteristics of particular objects within a field of view to identify an estimated node to which the real image data corresponds.
- Identification of the node can further be based upon previously detected phases for a given procedural iteration and/or other detected input (e.g., verbal audio data that includes person-to-person requests or comments, explicit identifications of a current or past phase, information requests, and/or the like including combinations and/or multiples thereof).
- other detected input e.g., verbal audio data that includes person-to-person requests or comments, explicit identifications of a current or past phase, information requests, and/or the like including combinations and/or multiples thereof.
- the detector 350 can output predictions, such as a phase prediction associated with a portion of the video data that is analyzed by the machine learning processing system 310.
- the phase prediction is associated with the portion of the video data by identifying a start time and an end time of the portion of the video that is analyzed by the machine learning execution system 340.
- the phase prediction that is output can include segments of the video where each segment corresponds to and includes an identity of a surgical phase as detected by the detector 350 based on the output of the machine learning execution system 340.
- phase prediction in one or more examples, can include additional data dimensions, such as, but not limited to, identities of the structures (e.g., instrument, anatomy, and/or the like including combinations and/or multiples thereof) that are identified by the machine learning execution system 340 in the portion of the video that is analyzed.
- the phase prediction can also include a confidence score of the prediction.
- Other examples can include various other types of information in the phase prediction that is output.
- other types of outputs of the detector 350 can include state information or other information used to generate audio output, visual output, and/or commands.
- the output can trigger an alert, an augmented visualization, identify a predicted current condition, identify a predicted future condition, command control of equipment, and/or result in other such data/commands being transmitted to a support system component, e.g., through surgical procedure support system 202 of FIG. 2.
- the technical solutions described herein can be applied to analyze video and image data captured by cameras that are not endoscopic (i.e., cameras external to the patient’s body) when performing open surgeries (i.e., not laparoscopic surgeries).
- the video and image data can be captured by cameras that are mounted on one or more personnel in the operating room (e.g., surgeon).
- the cameras can be mounted on surgical instruments, walls, or other locations in the operating room.
- the video can be images captured by other imaging modalities, such as ultrasound.
- the synthetic image generator 400 can be part of the data generator 315 of FIG. 3.
- the synthetic image generator 400 includes one or more encoders 404A, 404B, . . . , 404N configured to be applied to one or more inputs 402A, 402B, . . 402N to generate a shared feature space 406 of feature representations.
- the inputs 402A-402N can be user inputs that define one or more conditions for synthetic data generation.
- a generator 408 can generate synthetic data 409 based on the feature representations of the shared feature space 406.
- a discriminator 410 can compare real data 412 (e.g., a real image) and the synthetic data 409 (e.g., a synthetic image) to distinguish one or more differences between the real data 412 and the synthetic data 409.
- the discriminator 410 can output logits 414 that can be converted into probabilities indicating whether the synthetic data 409 is aligned to inputs 402A-402N (e.g., as conditional logits) and/or whether the synthetic data 409 is similar to the real data 412 (e.g., as unconditional logits) in terms of distribution.
- the synthetic image generator 400 is configured to apply adversarial training using adversarial loss information based on the one or more differences to train the one or more encoders 404A-404N, the generator 408, and the discriminator 410 to create the synthetic data 409 to more closely align with the real data 412.
- the discriminator 410 can use adversarial training in which the generator 408 and discriminator 410 compete to fool each other, which can result in realistic and diverse synthetic images.
- the synthetic data 409 can be stored in the data store 320 as synthetic data 317 for use by the machine learning training system 325. Annotations in the inputs can also be stored in the data store 320 to use as labels associated with the synthetic data during machine learning training.
- FIG. 5 depicts a block diagram of a synthetic image generator 500 according to one or more aspects.
- the synthetic image generator 500 is another example synthetic image generator that can be part of the data generator 315 of FIG. 3.
- Various types guiding can be provided in the input 502A, 502B, 502C, 502D, . . ., 502N and each encoder 504A, 504B, 504C, 504D, . . ., 504N can be trained to process particular types of input 502A-502N for mapping to a shared feature space 506.
- text descriptions can be used to describe medical conditions, instruments, anatomy, phases, and the like in input 502A.
- input 502B can be in the form or bounding boxes with key points and labels, sketches with labels in input 502C, segmentation regions with labels in input 502D, surgical phases in input 502N, and other such inputs 502.
- a noise generator 508, such as a Gaussian noise generator or other types of noise generation, can also be used to populate the shared feature space 506 as encoded by an encoder 505.
- An image or video generator 510 can generate a synthetic image 512 (or synthetic video) as trained to map features of the shared feature space 506 to image or video portions that are blended together.
- content can be modified in a base image and in other aspects style can be modified in a base image.
- FIG. 6 depicts a block diagram of a synthetic video generator 600 according to one or more aspects.
- the synthetic video generator 600 can be part of the data generator 315 of FIG. 3. Similar to the example of FIG. 4, the synthetic video generator 600 can include one or more encoders 604A, 604B, . . . , 604N configured to be applied to one or more inputs 602A, 602B, . . . , 602N to generate a shared feature space 606 of feature representations.
- the inputs 602A-602N can be user inputs that define one or more conditions for synthetic data generation.
- a generator 608 can generate synthetic data 609 based on the feature representations of the shared feature space 606.
- a discriminator 610 can compare real data 612 (e.g., a real video) and the synthetic data 609 (e.g., a synthetic video) to distinguish one or more differences between the real data 612 and the synthetic data 609.
- the discriminator 610 can output logits 614 that can be converted into probabilities indicating whether the synthetic data 609 is aligned to inputs 602A-602N (e.g., as conditional logits) and/or whether the synthetic data 609 is similar to the real data 612 (e.g., as unconditional logits) with respect to distribution.
- the generator 608 can be a temporal model generator. In addition to generating static frames, the synthetic video generator 600 can also generate short video clips conditioned to the input 602A- 602N.
- an encoder 604 can generate a feature representation into shared feature space 606 that contains the information for representing the expected generated video.
- the feature can be inputted to the temporal model generator which creates a synthetic video 614.
- the temporal model generator can be in the form of a convolutional neural network, transformers, long-short term memory (LSTM), convolutional LSTM, temporal convolutional network, etc.
- the discriminator 610 aims at distinguishing between real and synthetic videos.
- Video clip synthesis can be conditioned to a temporal series of labels (e.g., semantic segmentation masks). This can result in producing time-consistent surgical videos to train temporal ML models.
- a video sequence can be generated from an input image and label set pairing.
- FIG. 7 depicts a block diagram of a synthetic image generator 700 according to one or more aspects.
- the synthetic image generator 700 is an example of a SuGAN generator architecture. Given a segmentation map 706 as input, an encoder 702 can embed the information into a latent representation y.
- a decoder 704 can be composed of channel-level normalization blocks 712 and pixel-level normalization blocks 714 to enable image synthesis leveraging both latent representations as well as segmentation maps.
- SuGAN is a type of Generative Adversarial Network (GAN). In GANs, a generator and a discriminator compete with each other on antagonistic tasks, one to generate data that cannot be distinguished from real ones, and the other to tell synthetic and real data apart.
- StyleGAN represents a framework with the ability to produce realistic high-resolution images.
- StyleGAN can be configured to support multi-modal image generation, where the appearance, colors, and luminance of the objects in the image can be modified while leaving semantics unaltered.
- Other architectures besides GANs include variational autoencoders (VAE) and diffusion models, although these approaches may be unconditional.
- VAE variational autoencoders
- cGAN conditional- GAN
- cGAN can use a conditional approach where synthetic and real images are used by the discriminator along with their condition to learn a joint annotation-image mapping.
- SPatially Adaptive Denormalization SPADE
- OASIS Adversarial Semantic Image Synthesis
- segmentation maps are encoded as normalization parameters at different levels in the network to serve a base for feature translation.
- cGAN based models may produce images of limited quality. This may be intrinsic to model formulation, where models focus on translating visual features according to the annotation, rather than evaluating the image at a global scale.
- CollageGAN can employ pre-trained, class-specific StyleGAN models to improve the generation of finer details on the synthetic images.
- Other approaches can use a learnable encoder jointly with a frozen pre-trained StyleGAN to learn latent space mappings from segmentation maps.
- the quality of these models is bounded to StyleGAN capability to fully cover the training distribution.
- the synthetic image generator 700 of FIG. 7 can be formed in consideration of multiple goals.
- x E ⁇ 0, 255 ⁇ W,H ’ 3 be an RGB image with width W, height H, and 3 color channels
- y E ⁇ 0,C - 1 ⁇ W,H be a pixel-wise segmentation map with C different semantic classes.
- G( ) : y — > x s be a GAN that, given as input a segmentation map y, generates a synthetic image, x s ⁇ X, where X is the distribution of real images.
- One goal is to design G( ) so that it can generate realistic multi-modal images conditioned to an input annotation y. This directly translates to the problem of preserving the image content while varying the style, where content refers to semantic features in the image, namely objects’ shape and location, while style relates to the appearance of the object such as colors, texture, and luminance.
- SuGAN is an example of an adversarially trained model that can produce novel, realistic, and diverse images conditioned to semantic segmentation maps.
- An encoder-decoder-discriminator structure is used, where the task of the encoder 702 is to extract the essential information from the input segmentation map 706 while the decoder 704 generates synthetic images 718 from the input segmentation maps 706 and the output of the encoder 702.
- a discriminator 800 of FIG. 8 determines whether generated synthetic images 718 are real or synthetic for adversarial training.
- the synthetic image generator 700 can include encoders 702 for conditional image generation.
- An encoder E( ) can be expressed according to equation 1 as:
- E( ) y ⁇ r, (1) to project segmentation maps annotations y into a latent feature y G R MX512 where M is the number of feature vectors.
- the encoder can use a map y that is first processed into three consecutive residual blocks 708A, 708B, 708C and the output of each block 708 is further refined by E2Style efficient heads 710A, 71 OB, 710C, which can include an average pooling layer followed by a dense layer, to obtain y.
- y is composed of three sets of latent vectors, each one produced by a different head and controlling a different set of features in the final synthetic image, namely coarse (ycoarse), medium (ymedium), and fine features (yrme).
- the decoder 704 of the synthetic image generator 700 can support image generation conditioned to segmentation maps 716.
- the decoder architecture D( ) is defined according to equation 2 as:
- Decoder D can be built by sequentially nesting two blocks of normalization layers: the first can be a pixel-level normalization block 714, where a normalization parameter (mean and standard deviation) for each feature pixel is computed by processing the input segmentation map with 3x3 convolutional layers and then used to modulate sets of features at different depths in the model.
- the second block can be a channel-level normalization block 712 composed of two sets of 3x3 convolutional layers with weights that are modulated using two 512 latent vectors from the encoder 702 as input.
- each filter in the 3x3 convolution can be first modulated to have variance equal to 1 and then demodulated by multiplying it by one element in the latent vector.
- all layers inside a channel-level normalization block can have 512 channels. This can provide separability and manipulability of content and style features in the synthetic image using to channel-level normalization while allowing for precise matching of the generated image with the input segmentation map via pixel-level normalization layers.
- segmentation maps 716A can be a time sequence of the segmentation map 706 at a first scale
- segmentation maps 716B can be a time sequence of the segmentation map 706 at a second scale
- segmentation maps 716C can be a time sequence of the segmentation map 706 at a third scale, where the second scale is less than the first scale and greater than the third scale (e.g., 4x4, 16x16, 256x256, etc.).
- Channel-level normalization block 712A can receive input from head 710A
- channel-level normalization block 712B can receive input from head 710B
- channel-level normalization block 712C can receive input from head 710C.
- Pixel-level normalization block 714A can operate on segmentation maps 716A
- pixel-level normalization block 714B can operate on segmentation maps 716B
- pixel-level normalization block 714C can operate on segmentation maps 716C.
- the scales are provided as examples and various resolutions can be used depending on the desired image or video resolution beyond those specifically noted in this example.
- FIG. 8 depicts a block diagram of a discriminator 800 architecture according to one or more aspects.
- the discriminator 800 architecture is an example of a projection discriminator to support adversarial training.
- the discriminator 800 can include two feature extractors D x and D y (also referred to as discriminator modules) defined as
- each discriminator module can include several residual blocks 804A and 804B for D x in series and residual blocks 808A and 808B for Dy in series.
- the extracted features dx and d y can be used to determine whether x is real or synthetic and whether the image satisfies the segmentation condition or not.
- conditional adversarial loss can be the main objective function for training.
- the generator 700 and discriminator 800 can be trained on antagonistic losses defined as: where x s and x r refer to synthetic and real images respectively, and S is the softplus function defined as
- L R1 HWII (10) where W are the model weights. Loen and Loisc can be computed and applied in an alternative fashion for the generator and the discriminator.
- pixel-level normalization blocks 714 can be embedded between channel -level normalization blocks 712 to inject semantic maps directly into the model at different depths.
- SuGAN can control content features via pixel-level normalization layers while supporting multi-modal image generation by manipulating latent vectors at an inference stage.
- channel-level normalization blocks 712 can support multi-modal image generation via latent space manipulation, giving the capability of generating images with different styles while maintaining unaltered the content defined by the input segmentation map 706.
- a style randomization procedure can be used.
- Example datasets can include CholecSeg8k (LC), Robotic Partial Nephrectomy (PN), and Robotic Prostatectomy (RP).
- LC is a public dataset of laparoscopic cholecystectomy surgeries focusing on the resection of the gallbladder. The dataset is composed of about 8,080 frames (17 videos) from the widely used Cholec80 dataset.
- LC provides pixel -wise segmentation annotations of about 13 classes, including background, ten anatomical classes, and two surgical instruments.
- PN is a dataset of robotic partial nephrectomy surgeries focusing on the resection of the kidney.
- the dataset is composed of about 52,205 frames from 137 videos.
- PN provides pixel-wise segmentation annotations of five anatomical classes and a background class which also includes surgical instruments.
- RP is a dataset of robotic prostatectomy surgeries for the resection of the prostate.
- the dataset is composed of about 194,393 frames from 1,801 videos.
- RP provides pixel-wise segmentation annotations of 10 anatomical classes and a background class which does not include any surgical instrument. Table 1 shows further dataset statistics.
- Performance measures of the model can be observed for two tasks. Images can be evaluated for quality and diversity of synthetic images using a variation of Frechet Inception Distance (FID). This metric is widely used for assessing image quality and diversity of synthetic images as it captures the similarity between synthetic and real data at a dataset level in terms of distribution. Notably, FID infinite can be used as a less biased metric. Performance of segmentation models can be evaluated per each class and image using Intersection over Union (loU) is the annotated segmentation map and y c is the segmentation map predicted by the model for class c on image i. Per-class loU is computed by averaging across images as the total number of images. In addition, mean Intersection over Union (mloU) aggregates scores across classes and reports a single performance value, calculated as where C is the total number of classes in the dataset.
- Intersection over Union mloU
- an optimization algorithm such as Adam (e.g., using adaptive moment estimation) can be used as an optimizer with a learning of 0.0025 rate and training for up to 50 epochs, for example.
- the batch size can be 16.
- Experiments can be performed with images at 256x256 resolution, but other resolutions can be supported (e.g., 512x512).
- vertical and horizontal flips as well as 90° rotations can be used in order to not introduce black edges artifacts in the synthetic images.
- Nonleaking augmentations can be used to stabilize training.
- a class-balanced sampler can be used.
- the indexes can group latent space.
- the SPADE and OASIS examples can be trained with default settings. SPADE can be trained along with a VAE to support multimodal generation, following a variant of its original formulation.
- the number of components can be chosen as a trade-off between style randomization and reduction of artifacts arising from content modification.
- the models can be trained for 25 epochs using AdamW optimizer, with a OneCycle scheduler that reaches the maximum learning rate of 0.0005 after 1.25 epochs, as an example. Inference can be run using models at the last epoch for all experiments.
- all trainings can use the same number of images in each batch, epoch, and full training.
- the batch size can be set to 32, with all images being real when training with only-real images, and half of the images being real and the other half synthetic, when using both types of images.
- the same augmenter can be used for all experiments, with images being resized to 256x256 resolution, maintaining the aspect ratio, performing random rotations between -20 and 20 degrees, applying 25% of the times motion blur, and modifying the hue and saturation by a random value between - 10 and 10.
- Real images can be scaled randomly between 0.7 and 2.0 times, while synthetic images are not scaled.
- a third of the training dataset can be sampled in each epoch using a random sampler, while for PN and RP 12,000 images can be sampled in each epoch with a class-balanced sampler, as an example.
- FIG. 9 depicts an example of synthetic images generated according to one or more aspects.
- the example of FIG. 9 illustrates the effects of using a pixel-level normalization (P-LN) 902 with an associated segmentation map 904 for SuGAN 332 to generate synthetic images trained over multiple epochs.
- P-LN pixel-level normalization
- the checkmarks indicate where P-LN 902 was used during training at 10 epochs 906 and 20 epochs 908 for a partial nephrectomy dataset.
- the partial nephrectomy dataset can delineate regions of the segmentation map 904 to highlight areas such as kidney, liver, renal vein, renal artery, and ignore and/or background.
- FIGS. 10A and 10B depict examples of synthetic image comparisons according to one or more aspects. Evaluation and comparison of the quality and diversity of the synthetic images generated by the SuGAN (e.g., SuGAN 332 of FIG. 3) with the ones generated by SPADE and OASIS can be performed. Once trained, two sets of synthetic images can be generated for each model and dataset, namely from-test (FIG. 10 A) and from-train (FIG. 10B). A from-test set can be generated using segmentation maps not used for training of the SuGAN. A from-train set can be produced using segmentation maps from the training set as input and randomizing the style 10 times, thus creating 10 times larger training sets, for example.
- FIG. 10A depicts real images 1002, segmentation maps 1004, SPADE synthetic images 1006, OASIS synthetic images 1008, and SuGAN synthetic images 1010 generated using from-test maps based on LC, PN, and RP data sets.
- FIG. 10B depicts real images 1052, segmentation maps 1054, SPADE synthetic images 1056, OASIS synthetic images 1058, and SuGAN synthetic images 1060 generated using from-train maps based on LC, PN, and RP data sets.
- both SuGAN and OASIS models manage to produce original multi-modal images, while SPADE appears constrained to mono-modal generation.
- SPADE and OASIS appeared to obtain the worst FID, 137.99 and 115.08, respectively, while SuGAN can achieve an FID of 107.70, for instance.
- This may indicate a performance advantage of SuGAN in terms of image quality and diversity, as it can be appreciated in FIG. 10A (1 st and 2 nd rows), where SuGAN is the only model that may generate diverse images on LC when using from-test input maps.
- OASIS may obtain a favorable FID, indicating a better generalization capability compared to SPADE.
- OASIS may generate cartoon-like instance blending within images (e.g., FIG. 10A, 3 rd and 4 th rows). This is not the case with SuGAN, with a discriminator that appears to discard such images from the synthetic image distribution. SPADE may obtain the worst FID in the same dataset and show mono-modal images, and thus may result in reduced quality generation capabilities compared to SuGAN and OASIS. Further, on an RP dataset, SuGAN can lead with a score of 8.03, followed by OASIS (11.87) and SPADE (22.35). In aspects, possibly due to the large size of the dataset, some models can show the ability to produce images dissimilar to the training images, although with different levels of diversity, and still reflecting issues described above for OASIS and SPADE.
- OASIS may generate diverse images in both PN and RP datasets; however, the images can contain cartoonlike artifacts, as may be seen in OASIS images 1058 in FIG. 10B.
- a potential cause the artifacts can be related to the local approach of the OASIS discriminator, which may not allow for an evaluation of an image at a global scale, thus accepting as real images those that appear as an unnatural mosaic of different image parts.
- OASIS can also show an inability to generate diverse samples on LC, which may be due to the small size of the dataset.
- SuGAN can produce diverse, realistic, and artefact-free images in all datasets, including the small LC at SuGAN images 1060.
- Training segmentation models with synthetic images, together with real images, can boost segmentation performance when evaluated on test real images.
- Two different approaches can be used for synthetic images (S) along with real images (R), namely S>R and SR.
- S>R i.e., synthetic > real
- SR i.e., synthetic and real
- An example experiment can be performed in PN dataset. In the middle section of Table 3, results indicate that synthetic data can successfully be used to improve segmentation performance with either approach.
- FIG. 11 depicts a example results of segmentation model performance according to one or more aspects in charts 1100, 1102, and 1104. For models and datasets, a large majority of classes improved with a small number of exceptions, where performance slightly decreased.
- the results may indicate a negative correlation between the amount of real data per class and the average improvement when using synthetic images, particularly in RP. This suggests that synthetic data can effectively be used to boost segmentation training in under-represented or unbalanced datasets.
- SR experiments can use synthetic images generated by OASIS or SPADE. This experiment was performed by training DeepLab and Swin Small on LC, PN, and LC datasets. Note that these two architectures can cover both a CNN and a transformer-based model. The results are in Table 4.
- SuGAN When training Swin Small, the results can follow the same behavior, where SuGAN outperforms SPADE and OASIS on LC and PN, while showing more limited, although positive, performances on RP. On final analysis, it appears that the images produced by SuGAN have more impact when training segmentation models on datasets of small (LC) and medium (PN) sizes, while frames from SPADE and OASIS have more significant effect on large datasets (RP). [0085] As a further example, it may be determined whether SuGAN can produce images at double resolution (512x512). To achieve this, one more set of pixel-level and channel-level normalization blocks can be added on top of the architecture of decoder 704 of FIG.
- the number of latent vectors can be increased to 16 to support one more up-sampling step.
- the model can be trained on PN and, similarly to previous experiments, segmentation results (Deeplab) can be compared when training in R and SR configurations. It appears that using synthetic images generated from SuGAN can enhance segmentation results (Table 5).
- FIG. 12 depicts an example of synthetic video generation 1200 applied to a simulation video according to one or more aspects.
- the example of FIG. 12 illustrates that an artificial intelligence based quality improvement 1204 can enhance simulation videos 1202 of a simulation environment to appear more realistic using synthetic video generation, such as SuGAN 332 of FIG. 3, to generate a synthetic video 1206 from a simulation.
- an artificial intelligence based quality improvement 1204 can enhance simulation videos 1202 of a simulation environment to appear more realistic using synthetic video generation, such as SuGAN 332 of FIG. 3, to generate a synthetic video 1206 from a simulation.
- FIG. 13 depicts a flowchart of a method of synthetic data generation according to one or more aspects.
- the method 1300 can be executed by a system, such as system 300 of FIG. 3 as a computer-implemented method.
- the method 1300 can be implemented by the machine learning training system 325 and/or data generator 315 performed by a processing system, such as processing system 1400 of FIG. 14.
- one or more inputs can be received each defining one or more conditions for synthetic data generation.
- one or more encoders can be applied to the one or more inputs to generate a shared feature space of feature representations.
- a generator can generate synthetic data based on the feature representations of the shared feature space.
- a discriminator can compare real data and the synthetic data to distinguish one or more differences between the real data and the synthetic data.
- adversarial training using adversarial loss information can be applied based on the one or more differences to train the one or more encoders, the generator, and the discriminator to modify one or more components to more closely align the synthetic data with the one or more conditions and in a more similar distribution as the real data.
- Modifications can include adjusting weights within the encoders and/or generators as the one or more components, for example.
- the synthetic data can include a synthetic surgical image.
- the synthetic data can include synthetic surgical video.
- the generator can be a temporal model generator that creates the synthetic surgical video.
- the one or more inputs can include a text description of one or more of: anatomy, a surgical instrument, and a surgical phase.
- the one or more inputs can include one or more bounding boxes and key points with labels.
- the one or more inputs can include one or more sketches with labels.
- the one or more inputs can include one or more segmentation regions with labels.
- the one or more inputs can include one or more surgical phases.
- the one or more inputs can include a temporal series of labels.
- the one or more inputs can include an input image paired with one or more labels.
- the synthetic data can be added to a data store, and one or more machine learning models can be trained using the synthetic data from the data store.
- the synthetic data can include one or more annotations.
- the feature representations can include a condition mask, and the generator can be configured to preserve content and modify a style based on a user-defined preference.
- the user-defined preference can define one or more of a degree of blood present, a degree of fatty tissue present, or other tissue variation.
- the generator can be configured to preserve a style and modify content in the synthetic data.
- the synthetic data can be generated based on an image or video of a simulation environment.
- one or more of the feature representations of the shared feature space can include random variations.
- the random variations can include content randomization.
- the random variations can include style randomization.
- a system can include a data store including video data associated with a surgical procedure.
- the system can also include a machine learning training system configured to train a generative adversarial network model to generate synthetic surgical data, combine the synthetic surgical data with real surgical data to form an enhanced training data set in the data store, and train one or more machine learning models using the enhanced training data set.
- the generative adversarial network model can include an encoder that projects a plurality of segmentation maps into latent features and a decoder including sequentially nested blocks of normalization layers including at least one pixel-level normalization block and at least one channel-level normalization block.
- a discriminator can be configured to perform adversarial training using a first feature extractor of image data and a second feature extractor of the segmentation maps.
- a computer program product can include a memory device having computer executable instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a plurality of operations.
- the operations can include applying one or more encoders to one or more inputs associated with surgical data to generate a shared feature space of feature representations and generating, by a generator, synthetic data based on the feature representations of the shared feature space, where the generator includes a decoder with sequentially nested blocks of normalization layers including at least one pixel-level normalization block and at least one channel-level normalization block.
- the at least one pixel -level normalization block can compute a normalization parameter for each feature pixel based on an input segmentation map
- the at least one channel-level normalization block can include at least two convolutional layers with weight modulated using a plurality of vectors from the one or more encoders.
- Synthetic data generation can use visual data for generating conditional synthetic images that include both surgical instruments and anatomical structures without the need for a 3D simulator or real anatomical background images.
- FIG. 14 depicts a block diagram of a processing system 1400 for implementing the techniques described herein.
- processing system 1400 has one or more central processing units (“processors” or “processing resources” or “processing devices”) 1421a, 1421b, 1421c, etc. (collectively or generically referred to as processor(s) 1421 and/or as processing device(s)).
- processors or “processing resources” or “processing devices”
- processor(s) 1421 can include a reduced instruction set computer (RISC) microprocessor.
- RISC reduced instruction set computer
- Processors 1421 are coupled to system memory (e.g., random access memory (RAM) 1424) and various other components via a system bus 1433.
- RAM random access memory
- ROM Read only memory
- BIOS basic input/output system
- I/O adapter 1427 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 1423 and/or a storage device 1425 or any other similar component.
- I/O adapter 1427, hard disk 1423, and storage device 1425 are collectively referred to herein as mass storage 1434.
- Operating system 1440 for execution on processing system 1400 may be stored in mass storage 1434.
- the network adapter 1426 interconnects system bus 1433 with an outside network 1436 enabling processing system 1400 to communicate with other such systems.
- a display 1435 (e.g., a display monitor) is connected to system bus 1433 by display adapter 1432, which may include a graphics adapter to improve the performance of graphics intensive applications and a video controller.
- adapters 1426, 1427, and/or 1432 may be connected to one or more I/O busses that are connected to system bus 1433 via an intermediate bus bridge (not shown).
- Suitable I/O buses for connecting peripheral devices, such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI).
- PCI Peripheral Component Interconnect
- Additional input/output devices are shown as connected to system bus 1433 via user interface adapter 1428 and display adapter 1432.
- a keyboard 1429, mouse 1430, and speaker 1431 may be interconnected to system bus 1433 via user interface adapter 1428, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit.
- processing system 1400 includes a graphics processing unit 1437.
- Graphics processing unit 1437 is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display.
- Graphics processing unit 1437 is very efficient at manipulating computer graphics and image processing, and has a highly parallel structure that makes it more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.
- processing system 1400 includes processing capability in the form of processors 1421, storage capability including system memory (e.g., RAM 1424), and mass storage 1434, input means, such as keyboard 1429 and mouse 1430, and output capability including speaker 1431 and display 1435.
- system memory e.g., RAM 1424
- mass storage 1434 e.g., RAM 1434
- input means such as keyboard 1429 and mouse 1430
- output capability including speaker 1431 and display 1435.
- a portion of system memory (e.g., RAM 1424) and mass storage 1434 collectively store the operating system 1440 to coordinate the functions of the various components shown in processing system 1400.
- FIG. 14 is not intended to indicate that the computer system 1400 is to include all of the components shown in FIG. 14. Rather, the computer system 1400 can include any appropriate fewer or additional components not illustrated in FIG. 14 (e.g., additional memory components, embedded controllers, modules, additional network interfaces, etc.). Further, the aspects described herein with respect to computer system 1400 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application-specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various aspects.
- suitable hardware e.g., a processor, an embedded controller, or an application-specific integrated circuit, among others
- software e.g., an application, among others
- firmware e.g., an application, among others
- the present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration
- the computer program product may include a computer-readable storage medium (or media) having computer- readable program instructions thereon for causing a processor to carry out aspects of the present invention
- the computer-readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer-readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non- exhaustive list of more specific examples of the computer-readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device, such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- a computer-readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer-readable program instructions described herein can be downloaded to respective computing/processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network, and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer- readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.
- Computer-readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source-code or object code written in any combination of one or more programming languages, including an object-oriented programming language, such as Smalltalk, C++, high-level languages such as Python, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer-readable program instructions may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer, or entirely on the remote computer or server.
- the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer-readable program instruction by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer-readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the blocks may occur out of the order noted in the Figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- exemplary is used herein to mean “serving as an example, instance or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs.
- the terms “at least one” and “one or more” may be understood to include any integer number greater than or equal to one, i.e., one, two, three, four, etc.
- the terms “a plurality” may be understood to include any integer number greater than or equal to two, i.e., two, three, four, five, etc.
- connection may include both an indirect “connection” and a direct “connection.”
- Computer-readable media may include non-transitory computer-readable media, which corresponds to a tangible medium, such as data storage media (e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer).
- data storage media e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer.
- processors such as one or more digital signal processors (DSPs), graphics processing units (GPUs), microprocessors, application-specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
- DSPs digital signal processors
- GPUs graphics processing units
- ASICs application-specific integrated circuits
- FPGAs field programmable logic arrays
- processors may refer to any of the foregoing structure or any other physical structure suitable for implementation of the described techniques. Also, the techniques could be fully implemented in one or more circuits or logic elements.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23817994.9A EP4627552A1 (en) | 2022-12-02 | 2023-12-01 | Synthetic data generation |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263429786P | 2022-12-02 | 2022-12-02 | |
| US63/429,786 | 2022-12-02 | ||
| US202363462733P | 2023-04-28 | 2023-04-28 | |
| US63/462,733 | 2023-04-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024115777A1 true WO2024115777A1 (en) | 2024-06-06 |
Family
ID=89119601
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2023/084012 Ceased WO2024115777A1 (en) | 2022-12-02 | 2023-12-01 | Synthetic data generation |
Country Status (2)
| Country | Link |
|---|---|
| EP (1) | EP4627552A1 (en) |
| WO (1) | WO2024115777A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119989420A (en) * | 2025-04-16 | 2025-05-13 | 北京航空航天大学 | Data privacy protection method and system |
-
2023
- 2023-12-01 WO PCT/EP2023/084012 patent/WO2024115777A1/en not_active Ceased
- 2023-12-01 EP EP23817994.9A patent/EP4627552A1/en active Pending
Non-Patent Citations (3)
| Title |
|---|
| ALDAUSARI NUHA ET AL: "Video Generative Adversarial Networks: A Review", ACM COMPUTING SURVEYS, ACM, NEW YORK, NY, US, US, vol. 55, no. 2, 18 January 2022 (2022-01-18), pages 1 - 25, XP058675861, ISSN: 0360-0300, DOI: 10.1145/3487891 * |
| HAN LIGONG ET AL: "Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning", 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 18 June 2022 (2022-06-18), pages 3605 - 3615, XP034194455, DOI: 10.1109/CVPR52688.2022.00360 * |
| MARZULLO ALDO ET AL: "Towards realistic laparoscopic image generation using image-domain translation", COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE., vol. 200, 14 November 2020 (2020-11-14), NL, pages 105834, XP093118920, ISSN: 0169-2607, Retrieved from the Internet <URL:https://pdf.sciencedirectassets.com/271322/1-s2.0-S0169260721X00028/1-s2.0-S0169260720316679/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEJ3//////////wEaCXVzLWVhc3QtMSJGMEQCIGvPSuGjTzL6MdLY6wOt+WnfrYGdfRTVSLV1Y1UQ/aLZAiBSODelHE5PD1/CCHMjySyLtf9o1BhM7SP7uMTU/uKf6SqzBQhFEAUaDDA1OTAwMzU0Njg2NSIMiNPvo> [retrieved on 20240112], DOI: 10.1016/j.cmpb.2020.105834 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119989420A (en) * | 2025-04-16 | 2025-05-13 | 北京航空航天大学 | Data privacy protection method and system |
| CN119989420B (en) * | 2025-04-16 | 2025-10-03 | 北京航空航天大学 | Data privacy protection method and system |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4627552A1 (en) | 2025-10-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240156547A1 (en) | Generating augmented visualizations of surgical sites using semantic surgical representations | |
| US20240161497A1 (en) | Detection of surgical states and instruments | |
| EP4309142B1 (en) | Adaptive visualization of contextual targets in surgical video | |
| US20250143806A1 (en) | Detecting and distinguishing critical structures in surgical procedures using machine learning | |
| US20240206989A1 (en) | Detection of surgical phases and instruments | |
| EP4616332A1 (en) | Action segmentation with shared-private representation of multiple data sources | |
| EP4619949A1 (en) | Spatio-temporal network for video semantic segmentation in surgical videos | |
| WO2024115777A1 (en) | Synthetic data generation | |
| US20240252263A1 (en) | Pose estimation for surgical instruments | |
| CN118216156A (en) | Feature-based surgical video compression | |
| US20230326207A1 (en) | Cascade stage boundary awareness networks for surgical workflow analysis | |
| US20250014717A1 (en) | Removing redundant data from catalogue of surgical video | |
| EP4355247B1 (en) | Joint identification and pose estimation of surgical instruments | |
| WO2025252636A1 (en) | Multi-task learning for organ surface and landmark prediction for rigid and deformable registration in augmented reality pipelines | |
| WO2024224221A1 (en) | Intra-operative spatio-temporal prediction of critical structures | |
| US20250371858A1 (en) | Generating spatial-temporal features for video processing applications | |
| WO2025233489A1 (en) | Pre-trained diffusion model for downstream medical vision tasks | |
| WO2025186384A1 (en) | Hierarchical object detection in surgical images | |
| WO2024105054A1 (en) | Hierarchical segmentation of surgical scenes | |
| WO2025186372A1 (en) | Spatial-temporal neural architecture search for fast surgical segmentation | |
| WO2025252777A1 (en) | Generic encoder for text and images | |
| WO2024013030A1 (en) | User interface for structures detected in surgical procedures | |
| WO2024110547A1 (en) | Video analysis dashboard for case review | |
| WO2024189115A1 (en) | Markov transition matrices for identifying deviation points for surgical procedures | |
| WO2025078368A1 (en) | Procedure agnostic architecture for surgical analytics |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23817994 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023817994 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023817994 Country of ref document: EP Effective date: 20250702 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023817994 Country of ref document: EP |