WO2025129013A1

WO2025129013A1 - Machine learning based medical procedure identification and segmentation

Info

Publication number: WO2025129013A1
Application number: PCT/US2024/060054
Authority: WO
Inventors: Conor Perreault; Rui Guo; Casey TROXLER; Benjamin Mueller; Ziheng Wang; Kiran BHATTACHARYYA; Aneeq ZIA; Xi Liu; Alfred SONG; Anthony M. JARC; Sree Ram KAMABATTULA; Shukai Chen
Original assignee: Intuitive Surgical Operations Inc
Current assignee: Intuitive Surgical Operations Inc
Priority date: 2023-12-14
Filing date: 2024-12-13
Publication date: 2025-06-19
Anticipated expiration: 2026-06-14

Abstract

A system of a technical solution to implement ML based procedure identification and segmentation can receive a data stream capturing a procedure performed with a robotic medical system. The system can input the data stream into ML models trained to generate a first prediction of an anatomy, a second prediction of an instrument, and a third prediction of an object associated with the procedure. The system can apply a mapping function to the first, the second and the third predictions to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure. The system can provide, for overlay in a display, a portion of the data stream, an indication of the temporal boundary, the surgical activity and the first, the second and the third predictions used to determine the surgical activity.

Description

MACHINE LEARNING BASED

MEDICAL PROCEDURE IDENTIFICATION AND SEGMENTATION

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/610,340, filed December 14, 2023, which is hereby incorporated by reference herein in its entirety.

BACKGROUND

[0002] Medical procedures can be performed in an operating room. As the amount and variety of equipment in the operating room increases, or medical procedures become increasingly complex, it can be challenging to perform such medical procedures efficiently, reliably, or without incident.

SUMMARY

[0003] This disclosure is generally directed to technical solutions for identifying and segmenting medical procedures into surgical segments using a combination of machine learning (ML) and clinical rule-based mapping. This technology can use multi-modal data stream (e.g., video, kinematics, events) as inputs into different ML models trained to make specific detections or predictions of features reflected in the data stream, such as anatomical parts (e.g., patient’s organs or tissues), medical instrument used in the procedure or objects involved. These ML model predictions can be combined according to clinical workflow rules or mapping according to medical ontologies to determine specific locations in the data stream where surgical activities performed start and stop. By leveraging different ML models, this technology can facilitate explaining which of the ML models (anatomy, instrument, object, action, or any other) contributed most towards the correct or incorrect classification of the specific medical (e.g., surgical) activity or temporal boundary between such activities.

Technical solutions can generate and provide objective performance indicators for the specific temporal boundary of the surgical activity, which can facilitate various use cases from intraoperative to post-operative case review or skills training.

[0004] At least one aspect of the technical solutions is directed to a system. The system can include one or more processors, coupled with memory. The one or more processors can be configured to receive a data stream that captures a procedure performed with a robotic medical system. The one or more processors can be configured to input the data stream into a plurality of models trained with machine learning to generate a first prediction of an anatomy, a second prediction of an instrument, and a third prediction of an object associated with the procedure. The one or more processors can be configured to apply a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure. The one or more processors can be configured to provide, for overlay in a graphical user interface configured to display at least a portion of the data stream, an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity.

[0005] The system can include the one or more processors configured to receive the data stream comprising data in a plurality of modalities. The one or more processors can be configured to receive the data stream comprising at least two of a video stream, a kinematics stream or event stream. The one or more processors can be configured to apply the mapping function to identify the surgical activity from a hierarchical ontology of entities with increasing granularity. The entities of the hierarchical ontology can comprise a gesture, an action, a step formed from the anatomy and the action, a phase, and a procedure type.

[0006] The one or more processors can be configured to provide, for display, a graphical representation of values for each entity in the hierarchical ontology. The one or more processors can be configured to identify the plurality of models comprising an anatomy presence recognition model, an instrument presence model, and an objection model. The one or more processors can be configured to use the anatomy presence recognition model, the instrument presence model, and the objection model to generate the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object. The one or more processors can be configured to identify the plurality of models comprising an anatomy state detection model, a tool-tissue interaction model, and an energy use model.

[0007] The mapping function can be based on a surgical workflow structure. To apply the mapping function, the one or more processors can be configured to identify a plurality of weights corresponding to the plurality of models and fuse outputs of the plurality of models using the plurality of weights. [0008] The one or more processors can be configured to identify a confidence score associated with a prediction of the surgical activity. The one or more processors can be configured to determine the confidence score is less than or equal to a threshold. The one or more processors can be configured to provide, responsive to the confidence score being less than or equal to the threshold, a prompt via the graphical user interface.

[0009] The one or more processors can be configured to determine a second one or more confidence scores associated with at least one of the first prediction, the second prediction, or the third prediction that is less than or equal to a second threshold. The one or more processors can be configured to generate the prompt with an indication of the at least one of the first prediction, the second prediction, or the third prediction.

[0010] The one or more processors can be configured to receive input that indicates at least one of the first prediction of the anatomy, the second prediction of the instrument, or the third prediction of the object is erroneous. The one or more processors can be configured to update, responsive to the input, at least one model of the plurality of models related to the at least one of the first prediction of the anatomy, the second prediction of the instrument, or the third prediction of the object that is erroneous.

[0011] The one or more processors can be configured to determine a metric indicative of performance of the surgical activity during the temporal boundary. The one or more processors can be configured to compare the metric with a historical benchmark established for the surgical activity and provide a second indication of performance based on the comparison. The one or more processors can be configured to provide the indication for overlay in the graphical user interface during performance of the procedure. The one or more processors can be configured to provide the indication for overlay in the graphical user interface subsequent to performance of the procedure.

[0012] An aspect of the technical solutions is directed to a method. The method can include receiving, by one or more processors coupled with memory, a data stream that captures a procedure performed with a robotic medical system. The method can include identifying, by the one or more processors, a plurality of models trained with machine learning to make predictions related to an anatomy, an instrument, and an object associated with the procedure. The method can include using, by the one or more processors, the plurality of models to generate a first prediction of the anatomy, a second prediction of the instrument, and a third prediction of the object based on the data stream. The method can include applying, by the one or more processors, a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure. The method can include displaying, by the one or more processors alongside at least a portion of the data stream, an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity. The method can include receiving, by the one or more processors, the data stream comprising at least two of a video stream, a kinematics stream or event stream.

[0013] An aspect of the technical solutions is directed to a non-transitory computer- readable medium storing processor executable instructions, that when executed by one or more processors, cause the one or more processors to identify a data file that captures a procedure performed with a robotic medical system. The instructions, when executed by the one or more processors, can cause the one or more processors to input the data file into a plurality of models trained with machine learning to generate a first prediction of an anatomy, a second prediction of an instrument, and a third prediction of an object associated with the procedure. The instructions, when executed by the one or more processors, can cause the one or more processors to apply a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure. The instructions, when executed by the one or more processors, can cause the one or more processors to present an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity. The data file can be generated from a video stream of the procedure, a kinematics stream of the procedure, and an event stream of the procedure.

[0014] These and other aspects and implementations are discussed in detail below. The foregoing information and the following detailed description include illustrative examples of various aspects and implementations and provide an overview or framework for understanding the nature and character of the claimed aspects and implementations. The drawings provide illustration and a further understanding of the various aspects and implementations and are incorporated in and constitute a part of this specification. The foregoing information and the following detailed description and drawings include illustrative examples and should not be considered as limiting. BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component can be labeled in every drawing. In the drawings:

[0016] FIG. 1 depicts an example system to implement machine learning based medical procedure identification and segmentation.

[0017] FIG. 2 illustrates an example configuration for identifying activities and temporal boundaries the activities using a combination of ML models and rules of a mapping function.

[0018] FIG. 3 illustrates an example system configuration for identifying or detecting surgical output and spatial output using ML models and rules.

[0019] FIG. 4 illustrates an example flow diagram of a method for implementing machine learning based medical procedure identification and segmentation.

[0020] FIG. 5 illustrates an example of a surgical system, in accordance with some aspects of the technical solutions.

[0021] FIG. 6 illustrates an example block diagram of an example computer system is shown, in accordance with some aspects of the technical solutions.

DETAILED DESCRIPTION

[0022] Following below are more detailed descriptions of various concepts related to, and implementations of, systems, methods, apparatuses for machine learning based medical procedure identification and segmentation. The various concepts introduced above and discussed in greater detail below can be implemented in any of numerous ways.

[0023] Although the present disclosure is discussed in the context of a surgical procedure, in various aspects, the technical solutions of this disclosure can be applicable to other medical treatments, sessions, environments or activities, as well as non-medical activities where object based procedure identification and segmentation is desired. For instance, technical solutions can be applied in any environment or application in which series or chains or activities of a process are performed to use ML modeling and rules to recognize, from one or more data streams, individual activities of the process and the temporal boundaries between such individual activities. [0024] The technical solutions described herein facilitate identifying segments corresponding to individual activities of a procedure, such as a medical procedure (e.g., a surgery) using a combination of machine learning and clinical rule-based mappings. This technology can detect a set of key features of an activity to infer, using a set of rules and ontologies describing the procedures, particular steps (e.g., clinical steps) and the intent of the person (e.g., a surgeon) by analyzing the data stream (e.g., a surgical data or video stream). For instance, the technical solutions can use ML models to detect any combination of a surgical patient’s specific anatomy involved in the procedure, medical tools or instruments used, objects encountered during the procedure or actions performed by a person (e.g., the surgeon) to infer the type of surgical activity performed as well as the temporal boundary of when such surgical activity stops and ends, based on the data stream (e.g., video or sensor data) input into the ML models.

[0025] Automated recognition of individual activities (e.g., medical or surgical segments of a procedure) using for example, video recordings of robot-assisted surgeries, can be challenging. For instance, detection or recognition of individual activities via ML models trained with labels can fail to identify temporal boundaries between the individual activities in the process when such activities are defined by a person’s (e.g., surgeon’s) intent. As a result, ML models trained to recognize boundary definitions between individual surgical tasks may lack the context into the range of time within which to define the activity. Since decisions by deep learning models may be hard to trace or explain, it can be challenging to understand the root causes of any model errors that may occur. As mislabeling of segments can occur in ways that is not clinically expected, this can lead to erroneous temporal boundary predictions, which can be better understood with an improved understanding of the clinical context.

[0026] The technical solutions of the present disclosure overcome these challenges by providing a series of deep learning models trained on specific surgery related tasks to recognize surgical activities (e.g., segments) and temporal boundaries between them based on ground truth labels (e.g., detected and recognized objects, tasks, anatomies or actions). This technology can incorporate clinical knowledge to define rules to be used on the outputs of the models to infer, detect, identify or mark the segments of interest within the surgical video. In doing so, the technical solutions can apply a specific definition of the start and stop times of a surgical segment and recognize each of the components of these definitions with individual models trained for the particular actions. [0027] The technical solutions can use a variety of task-specific models to provide the context for the temporal boundary recognition. For instance, the technical solutions can include a machine learning (ML) model trained to identify or detect a particular anatomy of a patient (e.g., a part of a patient’s body) during a surgery. The technical solutions can include an ML model trained to identify or detect particular medical tools used in the surgery, as well as an ML model trained to detect objects, such as a mesh, a needle or an ultrasound tool. The technical solutions can include a tool-tissue interaction model to determine tool-to-tissue interaction (e.g., a touch or a cut) or an energy use model or model for other system events (e.g., clip application or staple fire). The combination of one or more such models may not be limited to recognizing any single or set of anatomies, actions, objects, or instruments, but rather can be combined to recognize clinically relevant surgical actions or segments of surgeries within the context of surgical ontologies.

[0028] Each of the plurality of individual ML models can include a specific machine learning neural network that can be trained offline using clinical data and human expert annotated labels. The trained models can take the procedure videos/system data as input. By processing through multiple layers of a neural network, the models can output the predicted labels that can be converted into one or more recognized anatomies (e.g., patient body parts), surgical instruments used in relation to the anatomies, object encountered or involved in the surgery, actions taken in the course of surgery, or any other types of clinical predictions.

[0029] For example, technical solutions can utilize an anatomy presence recognition model that can involve or work with spatial and temporal correlations to identify or detect representative semantics of the surgical events and objects using an attention mechanism. An attention mechanism of ML models can include a selective weighting technology that applies or assigns different weights to the spatial and temporal segments or chunks of data. The weights can help emphasize different pieces of information, such as interactions between particular objects or medical instruments and particular anatomical parts of the patient, resulting in finding out the most suitable or most accurate compact representation that fulfills a particular task requirement. Attention mechanism of ML models can be used to allow the use of noisy, or sometimes incorrect training labels and still reaching correct determinations. For instance, network structures used for attention mechanisms can include, for example 3D vision transformers. The network structures can include a deep learning architecture designed for video understanding, applying a transformer's self-attention mechanism across spatial and temporal dimensions to effectively model spatiotemporal patterns for tasks or action recognition (e.g., TimeSformer). In doing so, the technical solutions can utilize transformerbased attention mechanisms to combine outputs from various models (e.g., detected objects, instruments, patient’s body parts or surgeon’s movements) to identify the surgical action (e.g., segment of surgery) or infer the surgeon’s intent during the procedure.

[0030] The technical solutions can treat the anatomy presence identification as a classification issue. The classification head can be created to regress the feature vectors to a length-n vector of anatomies. The objective can be to minimize the difference between labels and the predictions of the model. For instance, in the inferencing/deployment stage, the learned configuration and parameters of the neural network can be transferred to the processing unit. A procedure video can be discretized similarly to the training data and fed into the machine learning core. The output can include the final predictions of which anatomies are present at any point in time. For other surgical objects and event recognitions, a similar machine learning model individually can be applied, based on the task.

[0031] The models can be combined or integrated using predefined mappings (e.g., similar to a voting system or a decision tree) which can use rules to map individual characteristics (e.g., outputs of various models) onto a well-defined ontology with a clinical meaning. For example, in segmentectomies and lobectomies, a combination of outputs of one or more models detecting a lung, artery, vein, or bronchus combined with dissector or stapler recognition, cartridge color, staple fire or energy application can indicate the corresponding structure’s dissection and division step. For example, in a partial nephrectomy procedure, model outputs recognizing structures or objects applications, such as an ultrasound probe, clamp, needle, and clip application, in combination with one or more unique motions, energy applications, and certain anatomy (e.g., detected body parts) can be indicative of a procedure task, such as a tumor boundary using ultrasound probe recognition, or a circular path the surgeon can mark with energy around the tumor. For example, a surgical task of an excision of the tumor step can be detected based on the detection or recognition of the clamp on the renal artery for a major key start moment (e.g., due to kidney being deprived of a blood supply), along with a medical tool positioning and a freed tumor section marking the excision complete. In another example, in a hysterectomy procedure, an object recognition of a colpotomy ring combined with a last energy application and a colpotomy ring in view can be used to determine or identify a step of a colpotomy dissection. In another example, a detection of a needle, needle drivers install times, and a vaginal cuff identification can be used to detect or determine the first interaction or intention to close the vaginal cuff. For suture and stapler techniques, the strategies similar to those listed above can be applied to various procedures with slight variations to model and detect the various suture steps across different procedures.

[0032] The technical solutions can include a set of rules that can consider the history of model outputs throughout a case. For example, when a task is not identified without detection of an anatomy of the patient, a rule can be added to consider outputs from models in the same procedure to make determinations without usage of the anatomy model’s output. In such instances, a smaller memory footprint can be utilized in comparison with the models as only final outputs from the models can be used, rather than all of the images passed through the models. Technical solutions can include kinematic analysis to recognize gestures (e.g. suturing) which can be used as an input to recognizing surgical actions or segments.

[0033] Technical solutions can provide final outputs based on rules. For example, technical solutions can use models to recognize the usage of a medical tool (e.g., a clip applier), on a particular anatomy of a patient (e.g., cystic artery), and based on the rule identify that the stated combination of this particular medical tool and this particular anatomy corresponds to a specific surgical task (e.g., ligation and division of cystic artery). For example, technical solutions can recognize an object (e.g., the entrance of mesh) in the course of a hernia repair and use a rule to detect a surgical task or a procedure of a mesh placement surgery. As the technical solutions are aware of the source of the outputs (e.g., which models provide which outputs), the technical solutions can more accurately trace any errors back to an exact issue with model performance. For instance, a confusion of a cystic artery with a cystic duct can be traced back to data of these body parts in an anatomy model, while a confusion of a mesh with a gauze can be traced to the object model. In doing so the technical solutions can more accurately detect and describe to end users the intended tasks performed by the surgeon, while also allowing for more granular identification of any errors or issues in the event of a failure case.

[0034] The technical solutions can use deterministic approach of final classification across different model versions with improved reproducibility. In a layered approach, component model outputs can be provided and used independently for different purposes, even if rules are not useful to an end user for a particular determination. The technical solutions can incorporate clinical knowledge of best practice, feasibility as clinical knowledge can be useful in correctly annotating a case, thereby improving the accuracy and allowing for more complex final outputs. Computer vision (CV) models can be trained to recognize visible objects, anatomies, and other features that can be used in conjunction to imply surgical intent. As a result, the technical solutions can identify, infer, detect or define a start and stop time of a surgical task or a segment from a video recording of the surgery. Segment rules can be applied and displayed to an end user, allowing the user to monitor the machine annotation and understand the context in which the decision is made. In doing so, the technical solutions can allow the user to observe the specific rules that are used to allow the user to follow the decision making in the event that the user disagrees with the annotation. The technical solutions also facilitate or allow for redefining of rules or annotations without losing historical data. Therefore, if segment definitions change over time, all the component labels can still remain unchanged and can be used to recreate the new segment definitions.

[0035] FIG. 1 depicts an example system 100 for machine learning based medical procedure segmentation. Example system 100 can include a robotic medical system 120 which can be used by a surgeon to perform a surgery on a patient. Robotic medical system 120 can be deployed in a medical environment 102, which can include any facility for performing medical procedures, such as a surgical facility, or an operating room. Medical environment 102 can include various medical instruments and tools that the robotic medical system 120 can use for performing surgical patient procedures, whether invasive, non-invasive, in-patient, or outpatient procedures.

[0036] The medical environment 102 can include one or more data capture devices 110 (e.g., optical devices, such as cameras or sensors or other types of sensors or detectors) for capturing data streams 162 (e.g., images or videos of a surgery). The medical environment 102 can include one or more visualization tools 114 to gather the captured data streams 162 and process it for display to the user (e.g., a surgeon or other medical professional) at one or more displays 116. A display 116 can present data stream 162 (e.g., images or video frames) of a medical procedure (e.g., surgery) being performed using the robotic medical system 120 handling, manipulating, holding or otherwise utilizing medical tools 112 to perform surgical tasks at the surgical site. Coupled with the robotic medical system 120, via a network 101, can be a data processing system (DPS) 130. DPS 130 can include one or more machine learning (ML) models 140, data repositories 160, mapping functions 170 and interfaces 190.

[0037] ML models 140 can include one or more anatomy models 142 for detecting, identifying or providing anatomy predictions 152 of various anatomical parts of a patient, such as a patient’s organs, bones, tissues, muscles, blood vessels, airways, urinary and reproductive systems glands and other anatomical features. ML models 140 can include one or more instrument models 144 for detecting, identifying or providing instrument predictions 154 of medical instruments 112 or tools used during the procedure (e.g., a scalpel, scissors, forceps, stapler, surgical suctions or specula). ML models 140 can include one or more object models 146 for detecting, identifying or providing object predictions 156 of various objects encountered during the surgery (e.g., a mesh, a probe or a surgical screw). ML models 140 can include one or more action models 148 for detecting, identifying or providing action predictions 158 of medical instruments 112 or tools used during the procedure (e.g., a scalpel, scissors, forceps, stapler, surgical suctions or specula). ML models 140 can include and utilize weights 150 to prioritize assign higher or lower significance to various predictions (e.g., 152- 158) for different models (e.g., 142-148).

[0038] Data repository 160 of the DPS 130 can include one or more data streams 162, such as video data, force or torque data, biometric data of a patient, haptic feedback data, endoscopic data, ultrasound imaging or communication and command data streams. Data repository 160 can include ontologies 164 for identifying surgical tasks in structured knowledge representations of procedures, instruments, and anatomical structures. Data repository 160 can include entities 166 corresponding to movements, gestures or actions corresponding to particular activities within a surgical procedure.

[0039] Mapping function 170 can rules 172 for identifying specific surgical activities 176 corresponding to, or in the context of, entities 166 or ontologies 164 as well as boundaries 174 between identified surgical activities 176. Mapping function 170 can include and assign confidence scores 178 with respect to particular detections, determinations or recognitions of various models 140 according to their corresponding thresholds 180. DPS 130 can include an interface 190 (e.g., user interface) to communicate with the user and provide indications 192 of the determinations made by the DPS 130, such as various ML model-based and rule-based detected surgical activities 176 and the corresponding temporal boundaries 174 between them.

[0040] The system 100 can include one or more data capture devices 110 (e.g., video cameras, sensors or detectors) for collecting any data stream 162, that can be used for ML and rules based detection and identification of surgical activities 176 and temporal boundaries 174 (e.g., between two or more such surgical activities 176 of a surgical procedure). Data capture devices 110 can include cameras or other image capture devices for capturing videos or images from a particular viewpoint within the medical environment 102. The data capture devices 110 can be positioned, mounted, or otherwise located to capture content from any viewpoint that facilitates the data processing system capturing various surgical tasks or actions. [0041] Data capture devices 110 can include any of a variety of sensors, cameras, video imaging devices, infrared imaging devices, visible light imaging devices, intensity imaging devices (e.g., black, color, grayscale imaging devices, etc.), depth imaging devices (e.g., stereoscopic imaging devices, time-of-flight imaging devices, etc.), medical imaging devices such as endoscopic imaging devices, ultrasound imaging devices, etc., non-visible light imaging devices, any combination or sub-combination of the above mentioned imaging devices, or any other type of imaging devices that can be suitable for the purposes described herein. Data capture devices 110 can include cameras that a surgeon can use to perform a surgery and observe manipulation components within a purview of field of view suitable for the given task performance.

[0042] Data capture devices 110 can capture, detect, or acquire sensor data, such as videos or images, including for example, still images, video images, vector images, bitmap images, other types of images, or combinations thereof. The data capture devices 110 can capture the images at any suitable predetermined capture rate or frequency. Settings, such as zoom settings or resolution, of each of the data capture devices 110 can vary as desired to capture suitable images from any viewpoint. For instance, data capture devices 110 can have fixed viewpoints, locations, positions, or orientations. The data capture devices 110 can be portable, or otherwise configured to change orientation or telescope in various directions. The data capture devices 110 can be part of a multi-sensor architecture including multiple sensors, with each sensor being configured to detect, measure, or otherwise capture a particular parameter (e.g., sound, images, or pressure).

[0043] Data capture devices 110 can include any type and form of a sensor, such as a positioning sensor, a biometric sensor, a velocity sensor, an acceleration sensor, a vibration sensor, a motion sensor, a pressure sensor, a light sensor, a distance sensor, a current sensor, a focus sensor, a temperature or pressure sensor or any other type and form of sensor used for providing data on medical tools 112, or data capture devices (e.g., optical devices). For example, a data capture device 110 can include a location sensor, a distance sensor or a positioning sensor providing coordinate locations of a medical tool 112 or a data capture device 110. Data capture device 110 can include a sensor providing information or data on a location, position or spatial orientation of an object (e.g., medical tool 112 or a lens of data capture device 110) with respect to a reference point. The reference point can include any fixed, defined location used as the starting point for measuring distances and positions in a specific direction, serving as the origin from which all other points or locations can be determined. [0044] Display 116 can show, illustrate or play data streams 162 (e.g., images or videos) in which medical tools 112 and location of the surgery are presented. For example, display 116 can display a rectangular image (e.g., one or more video frames) of the surgical site along with at least a portion of medical tools 112 (e.g., instruments) being used to perform surgical activities 176 (e.g., entities 166 within the context of various surgical ontologies 164). Display 116 can provide compiled or composite images generated by the visualization tool 114 from a plurality of data capture devices 110 to provide a visual feedback from one or more points of view.

[0045] The visualization tool 114 that can be configured or designed to receive any number of different sensor data streams 162 from any number of data capture devices 110 and combine them into a single data stream displayed on a display 116. The visualization tool 114 can be configured to receive a plurality of data stream components and combine the plurality of data stream components into a single data stream 162. For instance, the visualization tool 114 can receive a visual sensor data from one or more medical tools 112, sensors or cameras with respect to a surgical site or an area in which a surgery is performed. The visualization tool 114 can incorporate, combine or utilize multiple types of data (e.g., positioning data of a medical tool 112 along sensor readings of pressure, temperature, vibration or any other data) to generate an output to present on a display 116. Visualization tool 114 can present locations of medical tools 112 along with locations of any reference points or surgical sites, including locations of anatomical parts of the patient (e.g., organs, glands or bones).

[0046] Medical tools 112 can be any type and form of tool or instrument used for surgery, medical procedures or a tool in an operating room or environment. Medical tool 112 can be imaged by, associated with or include an image capture device. For instance, a medical tool 112 can be a tool for making incisions, a tool for suturing a wound, an endoscope for visualizing organs or tissues, an imaging device, a needle and a thread for stitching a wound, a surgical scalpel, forceps, scissors, retractors, graspers, or any other tool or instrument to be used during a surgery. Medical tools 112 can include hemostats, trocars, surgical drills, suction devices or any instruments for use during a surgery. The medical tool 112 can include other or additional types of therapeutic or diagnostic medical imaging implements. The medical tool 112 can be configured to be installed in, coupled with, or manipulated by a robotic medical system 120, such as by manipulator arms or other components for holding, using and manipulating the medical instruments or tools 112. [0047] The robotic medical system 120 can be a computer-assisted system configured to perform a surgical or medical procedure or activity on a patient via or using or with the assistance of one or more robotic components or medical tools 112. The robotic medical system 120 can include any number of manipulator arms for grasping, holding or manipulating various medical tools 112 and performing computer-assisted medical tasks using medical tools 112 controlled by the manipulator arms.

[0048] The images (e.g., video images) captured by a medical tool 112 can be sent to the visualization tool 114. The robotic medical system 120 can include one or more input ports to receive direct or indirect connection of one or more auxiliary devices. For example, the visualization tool 114 can be connected to the robotic medical system 120 to receive the images from the medical tool when the medical tool is installed in the robotic medical system (e.g., on a manipulator arm of the robotic medical system). The visualization tool 114 can combine the data stream components from the data capture devices 110 and the medical tool 112 into a single combined data stream for presenting on a display 116.

[0049] The system 100 can include a data processing system 130. The data processing system 130 can be deployed in or associated with the medical environment 102, or it can be provided by a remote server or be cloud-based. The data processing system 130 can include an interface 190 designed, constructed and operational to communicate with one or more component of system 100 via network 101, including, for example, the robotic medical system 120. Data processing system 130 can be implemented using instructions stored in memory locations and processed by one or more processors, controllers or integrated circuitry. Data processing system 130 can include functionalities, computer codes or programs for executing or implementing ML models 140 (e.g., 142-148) to detect, identify or predict anatomies, medical instruments, objects or actions involved in the surgical activities 176 or entities 166 corresponding to surgical ontologies 164 that can specify particular surgical procedures or operations.

[0050] ML models 140 can include any variety or combination of machine learning architectures. For example, ML models 140 can include support vector machines (SVMs) that can facilitate predictions (e.g., anatomical, instrument, object, action or any other) in relation to class boundaries, random forests for classification and regression tasks, decision trees for prediction trees with respect to distinct decision points, K-nearest neighbors (KNNs) that can use similarity measures for predictions based on characteristics of neighboring data points, Naive Bayes functions for probabilistic classifications, logistic or linear regressions, or gradient boosting models. ML models 140 can include neural networks, such as deep neural networks configured for hierarchical representations of features, convolutional neural networks (CNNs) for image-based classifications and predictions, as well as spatial relations and hierarchies, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for determining structures and processes unfolding over time or multimodal data integration in which medical images can be combined with patient’s data or history.

[0051] ML models 140 can include or utilize transformers or transformer-based architectures (e.g., graphical neural networks with transformers) which can be configured to make predictions related to medical imaging, including predictions or anatomies, objects, instruments, spatial relations between different features (e.g., spatial arrangement between anatomical parts of a patient and a medical instrument) to infer actions or activities performed. ML models 140 can include transformers utilizing or facilitating an attention mechanism that can allow transformers to focus different parts of input data to make predictions, such as highlighting relevant regions of interest in images or aiding in tasks, such as anomaly detection. Transformers can be used for multimodal integration in which data streams 162 from multiple types of sources (e.g., data from various detectors, sensors and cameras) can be combined for predictions. ML models 140 can include spatial transformer networks which can be applied on medical imaging data to facilitate spatial relations, alignment or normalization of features across different images or data sources (e.g., 110).

[0052] Machine learning (ML) models 140 can include any one or more machine learning (e.g., deep neural network) models trained on diverse datasets to learn to recognize intricate details (e.g., objects, anatomical parts, medical instruments, actions), patterns and relationships for analysis of medical procedures. ML models 140 can include deep learning models for medical applications, particularly in anatomical identification, medical instrument recognition, and surgical action detection. ML models 140 can include computational architectures trained on extensive datasets to autonomously identify and recognize anatomical structures, medical instruments, surgical objects and medical actions (e.g., doctor’s movements or actions). ML models 140 can be trained, configured or designed for any medical domain, such as surgery, radiology, medical imaging, pathology, diagnostics, telemedicine, radiation therapy, rehabilitation and physical therapy, drug discovery and development, genomics or epidemiology.

[0053] ML models 140 can utilize hierarchical ontology of any medical field to monitor, recognize or map objects, medical instruments, anatomies or activities with respect to any medical treatment or procedure in any type of a medical field. ML models 140 can leverage hierarchical representations in one or more layers, enabling them to discern complex patterns and relationships. For instance, one or more deep learning ML models 140 can be trained on a diverse dataset containing medical images, videos, and associated metadata to learn the distinct features of anatomies, instruments, objects, and actions. Once trained, these one or more ML models 140 can generalize its understanding to new, unseen data, providing automated assistance in tasks such as anatomical part segmentation, instrument or object detection, and action recognition in medical settings (e.g., surgical room, epidemiological facility, medical imaging area, pathology or any other). For instance, one or more deep learning ML models 140 can identify and locate specific anatomical landmarks in medical images, identify entities 166 (e.g., actions or tasks) in a medical procedure being performed according to medical ontologies 164 and identify temporal boundaries 174 diagnosis between different surgical activities 176 (e.g., segments of surgery) being performed.

[0054] ML models 140 can be stored in a data repository 160. ML models 140 can be trained, established, configured, updated, or otherwise provided by a model generator or a trainer. The trainer can train the ML models on one or more specific training datasets, which can include labeled images or video frames of medical instruments, anatomical parts, medical objects, movements or actions by medical professionals during a medical treatment or procedure or any other related medical activity. ML models 140 can be configured to identify, predict, classify, categorize, or otherwise score aspects of a medical procedure, including determining confidence scores 178 with respect to particular model outputs (e.g., predictions 152-158). The data processing system 130 can utilize a single, multi-modal machine learning model, or multiple machine learning models. The multiple machine learning models can each be multi-model, for example.

[0055] ML models 140 can include an anatomy model 142 to make anatomy predictions 152 also referred to as anatomies 152. Anatomy model 142 can include, for example, a deep learning model configured to identify, detect or recognize anatomies 152, which can include or correspond to any body part of a patient undergoing medical treatment or a procedure.

Anatomy model 142 can be configured to receive any combination of image, video, sensor or any data stream 162 to identify a particular anatomy 152 that is involved in a medical procedure, treatment (e.g., surgery). Anatomy model 142 can be configured to identify, detect and provide as output indications of detected organs, tissues, bones, muscles, nodes, parts of a cardiovascular system, respiratory system, nervous system, digestive system, endocrine system, urinary system, digestive system, integumentary system or any other anatomical part of a patient.

[0056] ML models 140 can include an instrument model 144 to make instrument predictions 154. Instrument model 144 can include, for example, a deep learning model configured to identify, detect or recognize instrument predictions 154, which can include or correspond to predictions or recognition of any medical instrument 112 that can be used in medical treatment (e.g., a surgery) being captured by data capture devices 110 (e.g., cameras or other sensors). Instrument predictions 154 detected or recognized by the instrument model 144 can include, for example, shears, needles, threads, scalpels, clips, rings, bone screws, graspers, retractors, saws, forceps, imaging devices, or any other medical instrument 112 or a tool used in a medical procedure.

[0057] ML models 140 can include an object model 146, also referred to as an objection model 145, to make object predictions 156, also referred to as objects 156. Object model 146 can include, for example, a deep learning model configured to identify, detect or recognize object 156, which can include or correspond to any structures or devices encountered during the medical procedure (e.g., surgery) as captured, reflected or indicated by the data stream 162. Object predictions 156 can include, for example, a mesh, a needle or a clip, a hip or a knee replacement part, a brace, a clip, a colpotomy ring or any other device, structure or instrument encountered or indicated in the images, videos or sensor readings of data stream 162. Object model 146 can be configured to receive any combination of image, video, sensor or any data stream 162 to identify a particular object 156 that is involved, indicated, recorded or detected in a medical procedure, treatment (e.g., surgery).

[0058] ML models 140 can include an action model 148 to make action predictions 158, also referred to as actions 158. Action model 148 can include, for example, a deep learning model configured to identify, detect or recognize actions 158, which can include any actions (e.g., gestures or movements) performed by a surgeon. Action model 148 can be configured to receive any combination of image, video, sensor or any data stream 162 to identify a particular action 158 of the surgeon, such as actions, such as ligations, sutures, clip applications, applications of a medical instrument 112 on a particular anatomy 152, or any other gestures, interactions or movements involving anatomies 152, instrument predictions 154 (e.g., medical instruments 112) or objects 156. Action model 148 can be configured to identify, detect and provide as output indications of tasks in a medical procedure, such as slicing an organ of a patient, suturing a surgical site, clipping a surgical site, applying a scalpel to a particular tissue or an organ, or performing any other action or activity in the course of a medical procedure.

[0059] The data repository 160 can include one or more data files, data structures, arrays, values, or other information that facilitates operation of the data processing system 130. The data repository 160 can include one or more local or distributed databases and can include a database management system. The data repository 160 can include, maintain, or manage a data stream 162. The data stream 162 can include or be formed from one or more of a video stream, image stream, stream of sensor measurements, event stream, or kinematics stream. The data stream 162 can include data collected by one or more data capture devices 110, such as a set of 3D sensors from a variety of angles or vantage points with respect to the procedure activity (e.g., point or area of surgery).

[0060] Data stream 162 can include an event stream which can include a stream of event data or information, such as packets, that identify or convey a state of the robotic medical system 120 or an event that occurred in association with the robotic medical system 120 or a surgery being performed with the robotic medical system. Data of the event stream can be captured by the robotic medical system 120 or a data capture device 110. Event stream can include a state of the robotic medical system 120 indicating whether the medical tool or instrument 112 is calibrated, adjusted or includes a manipulator arm installed on a robotic medical system 120. Event stream can include data on whether a robotic medical system 120 was fully functional (e.g., without errors) during the procedure. For example, when the medical instrument 112 can be installed on a manipulator arm of the robotic medical system 120, a signal or data packet(s) can be generated indicating that the medical instrument 112 has been installed on the manipulator arm of the robotic medical system 120. Another example state of the robotic medical system 120 can indicate whether the visualization tool 114 is connected, whether directly to the robotic medical system 120 or indirectly through another auxiliary system that is connected to the robotic medical system 120.

[0061] Data stream 162 can include a kinematics stream data which can refer to or include data associated with one or more of the manipulator arms or medical tools 112 (e.g., instruments) attached to the manipulator arms. Data corresponding to medical tools 112 can be captured or detected by one or more displacement transducers, orientational sensors, positional sensors, or other types of sensors and devices to measure parameters or generate kinematics information. The kinematics data can include sensor data along with time stamps and an indication of the medical tool 112 or type of medical tool 112 associated with the data stream 162.

[0062] Data repository 160 can store ontologies 164 of medical procedures that can include, correspond to, or relate to entities 166 of such procedures. Ontologies 164 can provide a structured framework for understanding the hierarchical relationships, contextual meanings, and dependencies among various procedures, anatomical structures, instruments, and actions involved in a medical procedure. Ontology 164 can include any knowledge representation system or a structure that organizes and defines concepts, relationships, and entities 166 within the healthcare domain. Entities 166 can include any features or characteristics of a task or activity in a medical procedure, such as a particular gesture, movement or action 158 applied to a particular anatomy 152 of a patient using a particular instrument 154 or an object 156. Ontologies 164 can define or describe various entities 166 (e.g., gestures, movements or actions of a surgeon or a doctor) with respect to any one or more (e.g., a series of) tasks in a particular medical procedure (e.g., a surgery). Ontologies 164 can provide structured descriptions of entities (e.g., tasks) of any medical procedures which ML models 140 can recognize, determine, detect, identify or classify. ML models 140 can identify and classify surgical or medical activities 176 (e.g., and boundaries 174 therebetween) by comparing detected predictions 152-158 with entities 166 within ontologies 164 to identify matching surgical activities 176 within ontologies 164 of various medical procedures or tasks.

[0063] Mapping function 170 can include any combination of hardware and software for processing outputs of ML models 140 according to rules 172 to determine surgical activities 176 and boundaries 174 between the surgical activities 176. Surgical activities 176 can include any activities of a medical procedure, such as a task or a portion of surgery or a treatment. Surgical activity 176 can include application of an anesthetic, ligation or hemostasis, application of a drainage or any other task in a surgery. Surgical activities 176 can be specific to a particular type of surgery or a procedure and can be described by the patient’s anatomy, medical instruments 112 used, objects encountered or actions of the surgeon. The combination of the ML based predictions, such as the anatomy, instruments objects or actions can be used to specify or identify a particular surgical activity 176 of a particular surgery or a treatment.

[0064] Boundary 174 can include any indication of a start or end of a surgical activity 176. Boundary 174 can include an indication of a start or end of a particular surgical procedure. Boundaries 174 can be defined or characterized according to a particular timestamp in a data stream 162 or according to a reference point (e.g., start or end of a video file, video fragment or frame). Boundary 174 can indicate a point where one surgical activity 176 ends and another surgical activity 176 begins. Boundary 174 can include an indication of a time of occurrence of the boundary between two surgical activities 176 or a point in a video file (e.g., a video frame) or a range of frames (e.g., one or more video fragments over a duration of one or more seconds) during which one surgical activity 176 ends and another surgical activity 176 begins.

[0065] Mapping function 170 can include the functionality to map the outputs from the ML models 140 (e.g., 152-158) to entities 166 of medical (e.g., surgical) ontologies 164 to identify the surgical activities 176. For instance, a mapping function 170 can utilize predictions 152-158 input into a rules engine having any number of rules 172 that can map the outputs against entities 166 of medical ontologies 164. Upon identifying a match between the predictions 152- 158 and a particular task in an ontology 164, mapping function 170 can determine or identify the surgical activity 176 based on the match with the given task in a particular ontology 164. Mapping function 170 can include the functionality (e.g., codes, algorithms or instructions) to apply anatomy predictions 152, instrument predictions 154, object predictions 156 or action predictions 158 to rules 172 to identify surgical activities 176 or boundaries 174 according to ontologies 164. Mapping function 170 can determine confidence scores 178 for any determined surgical activities 176 or boundaries 174.

[0066] Mapping function 170 include rules 172 that can map ML model 140 outputs (e.g., any combination of anatomy predictions 152, instrument predictions 154, object predictions 156 or action predictions 158) to particular medical procedures (e.g., as described in ontologies 164). Mapping function 170 can include a plurality of rules 172 correlating or associating particular combination of ML model 140 outputs with particular medical procedures, surgeries or treatments. Mapping function 170 can include, for example, a rule 172 associating a particular combination of certain anatomical parts of a patient, certain medical instruments 112 and particular actions by a surgeon with a certain task in a surgical procedure (e.g., opening a cut on an eye with a scalpel or suturing a wound on a right knee).

[0067] Rules 172 can include any conditional statements using ML models 140 outputs (e.g., 152-158) to map and identify matches with medical procedures within ontologies 164. Rules 172 can include any logical conditions using predictions 152-158 as decision criteria to map or identify matches with particular surgical activities 176 described in any one or more ontologies 164. Rules 172 can define logical conditions and relationships based on the predictions or outputs generated by ML models 140 and can provide a framework for associating specific medical activities, instruments or objects with relevant concepts in the ontological structure. By applying rules 172, technical solutions can navigate different tasks (e.g., entities 166) within the ontologies, matching identified patterns or features from the ML model 140 outputs to the corresponding medical procedures, facilitating recognition or detection of surgical activities 176 and boundaries 174.

[0068] Mapping function 170 can compare the scores 178 against thresholds 180 for any particular score 178. For instance, anatomy model 142 can output an anatomy prediction 152 based on the image or video input into the anatomy model 142. For instance, mapping function 170 can determine that the score 178 (e.g., confidence with respect to the level of certainty that the anatomy prediction 152 is accurate) exceeds a threshold 180 for the anatomy predictions 152. For example, when an instrument model 144 outputs an instrument prediction 154 based on the image or video input into the instrument model 144, a mapping function 170 can determine that the score 178 (e.g., confidence with respect to the level of certainty that the instrument prediction 154 is accurate) exceeds a threshold 180 for the instrument predictions 154. Likewise, scores 178 can be generated or determined for object predictions 156 and action predictions 158 and can be compared with the corresponding respective thresholds 180 of the objects and actions, respectively. Mapping function 170 can include the functionality to generate a score 178 corresponding to a confidence score (e.g., probability that the determination is accurate) with respect to the determined surgical activity 176 that can be determined based on the predictions 152-158 applied to the rules 172.

[0069] DPS 130 can include an interface 190 designed, constructed and operational to communicate with one or more component of system 100 via network 101, including, for example, the robotic medical system 120 or another device, such as a client’s personal computer. The interface 190 can include a network interface. The interface 190 can include or provide a user interface, such as a graphical user interface. Interface 190 can provide data for presentation via a display, such as a display 116, and can depict, illustrate, render, present, or otherwise provide a indications 192 indicating determinations (e.g., outputs) by ML models 140 (e.g., predictions 152-158) or mapping function 170 (e.g., surgical activities 176 or boundaries 174). Indications 192 can include messages, indications or notifications of predictions 152-158, boundaries 174, surgical activities 176, scores 178 or thresholds 180. Indications 192 can include overlaid texts or images. For instance, indications 192 can include confidence scores 178 with respect to boundaries 174 or surgical activities 176 determined or provided by the mapping function 170. For example, indications 192 can include entities 166 matching or corresponding to surgical activities 176 or boundaries 174, along with any combination of predictions 152-158.

[0070] The data processing system 130 can interface with, communicate with, or otherwise receive or provide information with one or more component of system 100 via network 101, including, for example, the robotic medical system 120. The data processing system 130, robotic medical system 120 and devices in the medical environment 102 can each include at least one logic device such as a computing device having a processor to communicate via the network 101. The data processing system 130, robotic medical system 120 or client device coupled to the network 101 can include at least one computation resource, server, processor or memory. For example, the data processing system 130 can include a plurality of computation resources or processors coupled with memory.

[0071] The data processing system 130 can be part of or include a cloud computing environment. The data processing system 130 can include multiple, logically-grouped servers and facilitate distributed computing techniques. The logical group of servers may be referred to as a data center, server farm or a machine farm. The servers can also be geographically dispersed. A data center or machine farm may be administered as a single entity, or the machine farm can include a plurality of machine farms. The servers within each machine farm can be heterogeneous - one or more of the servers or machines can operate according to one or more type of operating system platform.

[0072] The data processing system 130, or components thereof can include a physical or virtual computer system operatively coupled, or associated with, the medical environment 102. In some embodiments, the data processing system 130, or components thereof can be coupled, or associated with, the medical environment 102 via a network 101, either directly or directly through an intermediate computing device or system. The network 101 can be any type or form of network. The geographical scope of the network can vary widely and can include a body area network (BAN), a personal area network (PAN), a local-area network (LAN) (e.g., Intranet), a metropolitan area network (MAN), a wide area network (WAN), or the Internet. The topology of the network 101 can assume any form such as point-to-point, bus, star, ring, mesh, tree, etc. The network 101 can utilize different techniques and layers or stacks of protocols, including, for example, the Ethernet protocol, the internet protocol suite (TCP/IP), the ATM (Asynchronous Transfer Mode) technique, the SONET (Synchronous Optical Networking) protocol, the SDH (Synchronous Digital Hierarchy) protocol, etc. The TCP/IP internet protocol suite can include application layer, transport layer, internet layer (including, e.g., IPv6), or the link layer. The network 101 can be a type of a broadcast network, a telecommunications network, a data communication network, a computer network, a Bluetooth network, or other types of wired and wireless networks.

[0073] The data processing system 130, or components thereof, can be located at least partially at the location of the surgical facility associated with the medical environment 102 or remotely therefrom. Elements of the data processing system 130, or components thereof can be accessible via portable devices such as laptops, mobile devices, wearable smart devices, etc. The data processing system 130, or components thereof, can include other or additional elements that can be considered desirable to have in performing the functions described herein. The data processing system 130, or components thereof, can include, or be associated with, one or more components or functionality of a computing including, for example, one or more processors coupled with memory that can store instructions, data or commands for implementing the functionalities of the DPS 130 discussed herein.

[0074] System 100 can include one or more processors (e.g., 610) that can be coupled with one or more memories (e.g., 615) and which can be included in or deployed on any computing device, such as a server, a virtual machine, or a cloud-based system. For example, a system 100 can include an analysis system of a process or a procedure to identify and delineate different activities within the process. Such as system can include a memory (e.g., 615 or 620) storing instructions, computer code or data for the one or more processors (e.g., 610) to implement the functionalities of a data processing system 130. For example, one or more processors 610 can be configured (e.g., programmed or instructed) to access computer code, instructions or data in storage to implement of ML models 140 and the mapping function 170. System 100 can include a non-transitory computer-readable medium (e.g., a ROM or a solid state storage device) storing processor-executable instructions, code or data, for access and execution by the one or more processors 610 to implement various functionalities of the data processing system 130.

[0075] In one aspect, system 100 can include one or more processors 610 configured to receive a data stream 162 that captures a procedure performed with a robotic medical system 120. The data stream 162 can include any combination of one or more video streams, control data streams, sensor measurements or description of kinematics or events. The data stream 162 can be received via a network 101 from a remote medical environment 102 and can be processed in real time or acquired from prior stored files in data repository 160 and processed subsequent to a procedure being performed. [0076] The one or more processors 610 can be configured to input the data stream 162 into a plurality of ML models 140 that can be trained with machine learning to generate a first prediction of an anatomy (e.g., 152), a second prediction of an instrument (e.g., 154, and a third prediction of an object (e.g., 156) which can be associated with the procedure. The first prediction can be an anatomy prediction 152 that identifies a body part of a patient at or near a surgical site, such as for example an organ, a tissue, a bone or a gland being operated on by a surgeon. The second prediction can be an instrument prediction 154 that identifies a medical instrument or tool (e.g., shears, scalpel or a needle) used by a surgeon at or near the surgical site. The third prediction can be an object prediction 156 that identifies an object (e.g., a mesh, a bone screw or an object implanted into the body of the patient) at or near the surgical site. An ML model 140 can be configured to generate a fourth prediction, which can be an action prediction 158 that identifies an action (e.g., a gesture or a movement of the surgeon or a medical tool or the object) at or near the surgical site.

[0077] The one or more processors 610 can be configured to apply a mapping function 170 to the first prediction (e.g., 152), the second prediction (e.g., 154) and the third prediction (e.g., 156) to determine a temporal boundary 174 of a surgical activity 176 performed via the robotic medical system 120 in the procedure. In some examples, the one or more processors 610 can be configured to apply the mapping function 170 to the fourth prediction (e.g., 158) to determine the temporal boundary 174 of the surgical activity 176 based at least on the action (e.g., movement or gesture) of the surgeon, medical instrument 112 or an object with respect to the predicted anatomy (e.g., body part of the patient). The mapping function 170 can input or apply one or more outputs of the ML models 140 (e.g., predictions 152-158) onto one or more rules 172 (e.g., a rules engine of a plurality of rules 172). The rules 172 can be configured to use the predictions 152-158 to identify specific activities in a compilation of activities of the procedure in one or more ontologies 164. The rules 172 can identify, based on the predictions 152-158, specific entities 166 identified in the ontologies 164 to detect or identify the activity 176 being performed or a boundary 174 of such an activity.

[0078] The one or more processors 610 can be configured to provide, for overlay in a graphical user interface 190 configured to display at least a portion of the data stream 162, an indication 192 of any combination of one or more of the temporal boundary 174, the surgical activity 176 and the first prediction of the anatomy 152, the second prediction of the instrument 154, and the third prediction of the object 156 used to determine, via the mapping function, the surgical activity. The one or more processors 610 can provide, for overlay in the user interface 190 to display the portion of the data stream 162, the indication 192 including the action prediction 158. For example, the indication 192 can include an overlay of text identifying the surgical activity 176 and when or at what point (e.g., at which temporal boundary 174) the surgical activity 176 starts or ends.

[0079] The one or more processors 610 can be configured to receive the data stream 162 comprising data in a plurality of modalities. For instance, the one or more processors 610 can be configured to receive the data stream comprising at least one or more (e.g., two) of a video stream, a kinematics stream or an event stream. The data stream 162 can include sensor data, one or more video feeds from one or more cameras or any data from data capture devices 110 in the medical environment 102.

[0080] The one or more processors 610 can be configured to apply the mapping function 170 to identify the surgical activity 176 from a hierarchical ontology 164 of entities 166 with increasing granularity. For example, an ontology 164 can include, reference, describe or indicate any number of surgical activities 176 according to a variety of entities 166. Each entity 166 can include a gesture or movement of a particular medical instrument 112 with respect to a patient’s anatomy or an object. Entities 166 can correspond to actions performed by the surgeon (e.g., or any other person) with respect to any predictions of the ML models 140 as a part of a procedure described or indicated by the ontology 164. The ontology 164 can describe activities 176 in a hierarchical order using the entities 166 that can correspond to particular arrangement or combination of outputs of ML models 140 (e.g., predictions 152-158). The entities 166 of the hierarchical ontology 164 can include a gesture, an action, a step formed from the anatomy and the action, a phase, and a procedure type.

[0081] The one or more processors 610 can be configured to provide, for display, a graphical representation of values for each entity 166 in the hierarchical ontology 164. The one or more processors 610 can be configured to identify the plurality of models 140 comprising an anatomy presence recognition model 142, an instrument presence model 144, and an objection or an object model 146. The one or more processors 610 can be configured to use the anatomy presence recognition model 142, the instrument presence model 144, and the objection or object model 146 to generate the first prediction of the anatomy 152, the second prediction of the instrument 154, and the third prediction of the object 156.

[0082] The one or more processors 610 can be configured to identify the plurality of models 140 comprising an anatomy state detection model, a tool-tissue interaction model, and an energy use model. The anatomy state detection model can detect or identify the state of a body part of the patient, such as a state of the liver, heart or gland. The tool-tissue interaction model can include a model detecting or identifying the interaction between a medical instrument 112 detected as an instrument prediction 154 and a body part of the patient detected as an anatomy prediction 152. For instance, instrument prediction 154 and the anatomy prediction 152 can be input into the tool -tissue interaction model to determine the action taken by the medical instrument 112 (e.g., a movement made, a force applied or an action performed) with respect to the anatomical part of the patient. The energy use model can include a model to determine the efficiency of the movement (e.g., action prediction 158) of the surgeon in the context of the procedure.

[0083] The mapping function 170 can be based on a surgical workflow structure. For example, to apply the mapping function, the one or more processors can be configured to identify a plurality of weights 150 corresponding to the plurality of models 140. For example, a particular anatomy prediction 152 can be assigned or given a particular first weight while an instrument prediction 154 can be given a different weight. For example, an action prediction 158 involving an interaction between an anatomy prediction 152 (e.g., recognized body part of the patient) and an instrument prediction 154 (e.g., recognized medical instrument 112) can be given or assigned a weight by the mapping function 170 based on the interaction (e.g., nature of the interaction) between the predictions. The one or more processors 610 can be configured to fuse outputs of the plurality of models 140 using the plurality of weights 150.

[0084] The one or more processors 610 can be configured to identify a confidence score 178 associated with a prediction (e.g., 152-158) of the surgical activity 176. The one or more processors 610 can be configured to determine the confidence score 178 is less than or equal to a threshold 180. The threshold 180 can include a threshold for the confidence score 178 for the detected or determined surgical activity 176, boundary 174 or any of the predictions 152-158. The one or more processors 610 can be configured to provide, responsive to the confidence score 178 being less than or equal to the threshold 180, a prompt via the graphical user interface 190. For instance, the interface 190 can send for display any one or more of the surgical activity 176, boundary 174 or predictions 152-158, responsive to the confidence score 178 of any of those individual determinations exceeding its corresponding individual threshold 180.

[0085] For example, one or more processors 610 can be configured to determine a second one or more confidence scores 178 associated with at least one of the first prediction 152, the second prediction 154, or the third prediction 156 that is less than or equal to a second threshold 180. The one or more processors 610 can be configured to generate the prompt with an indication 192 of the at least one of the first prediction 152, the second prediction 154, or the third prediction 156.

[0086] The one or more processors 610 can be configured to receive input that indicates at least one of the first prediction of the anatomy 152, the second prediction of the instrument 154, or the third prediction of the object 156 is erroneous. The one or more processors 610 can be configured to update, responsive to the input, at least one model 140 of the plurality of models 140 related to the at least one of the first prediction of the anatomy 152, the second prediction of the instrument 154, or the third prediction of the object 156 that is erroneous.

[0087] The one or more processors 610 can be configured to determine a metric indicative of performance of the surgical activity 176 during the temporal boundary 174. The one or more processors 610 can be configured to compare the metric with a historical benchmark established for the surgical activity 176. The one or more processors 610 can be configured to provide a second indication of performance based on the comparison. For example, the one or more processors 610 can be configured to provide the indication 192 for overlay in the graphical user interface 190 during performance of the procedure. The one or more processors 610 can be configured to provide the indication for overlay in the graphical user interface 190 subsequent to performance of the procedure.

[0088] FIG. 2 illustrates a system configuration 200 for identifying surgical activities 176 (e.g., segments) and temporal boundaries 174 between surgical activities 176 using a combination of ML models 140 and rules 172 of a mapping function 170. System configuration 200 can correspond to an aspect of a system 100 can include video frames 205 input as data stream 162 input into ML models 140. Video frames 205 can include one or more images of a video fragment (e.g., 180 video frames in a 3-second 60 frames per second video) which can indicate presence of various medical instruments or tools, objects, anatomical parts of a patient or actions performed by the surgeon.

[0089] The anatomy model 142 can detect within the video frames 205 an anatomical feature (e.g., a body part) of a patient, such as a liver, gallbladder, colon, lung, uterus, heart, lung, spinal cord, bone, artery or a cystic duct. Anatomy model 142 can generate an anatomy prediction 152 corresponding to the body part identified, along with a score 178 corresponding to the level of confidence (e.g., between 0-100% confidence) that the prediction is accurate. Similarly, instrument model 144 can detect, within the video frames 205, a medical tool 112, such as a stapler, sheers, needle driver cardiere forceps, bipolar forceps, scalpel, retractor or a grasper. Instrument model 144 can generate an instrument prediction 154 corresponding to the medical instrument or tool part identified, along with a score 178 corresponding to the level of confidence that the instrument prediction is accurate. Likewise, object model 146 detect can detect, within the video frames 205, a medical tool 112, such as a mesh, needle, clip, colpotomy ring, bone screw, rules or ultrasound probe. Object model 146 can generate an object prediction 156 corresponding to one or more objects in the vicinity of the surgical site, along with a score 178 corresponding to the level of confidence that the given object prediction is accurate. Further, action model 148 can identify, within the video frames 205, an action performed by a surgeon, doctor or any medical professional. The action can include, for example, a ligation, suture, clip application, incision, excision, dissection, cauterization, grafting, drainage, ablation, amputation or any other movement or action by a medical professional. Action model 148 can generate an action prediction 158 corresponding to the action identified, along with a score 178 corresponding to the level of confidence that the given action prediction is accurate.

[0090] Outputs (e.g., predictions 152-158) of the ML models 140 can be received and used by the mapping function 170 to apply the ML model 140 outputs to rules 172. Mapping function 170 can include a rules engine functionality in which rules 172 can be used along with ML model outputs (e.g., 152-158) to determine or identify surgical activities 176 corresponding to surgical or other medical ontologies 164 along with the temporal boundaries 174 in between different identified surgical activities 176.

[0091] FIG. 3 illustrates a system configuration 300 for identifying or detecting surgical output 302 and spatial output 304. Surgical output 302 can include any classification, recognition or identification of a surgical feature or characteristic, such as medical instruments 112 used in the procedure, objects 156 used, addressed, installed, removed or adjusted or anatomical parts of the patient involved. Surgical output 302 can include actions 158 of the surgeon detected, including gestures or movements, such as swipes or squeezes of medical instruments 112, touching or cutting or suturing of an anatomical part of a patient. Spatial output 304 can include spatial orientation between different features (e.g., instruments, anatomical parts or objects), or movements in reference to different features in the video stream 205. [0092] For example, a data stream 162 can include video frames 205 showing a motion (e.g., between the frames of the multi-frame video) of a surgeon’s hand holding a scalpel with respect to a particular gland (e.g., anatomy prediction 152). Surgical output 302 can include anatomical prediction 152 of the given gland, along with the scalpel being identified as the instrument prediction 154. Spatial output 304 can include the motion of the scalpel with respect to the identified gland or any other reference point in the video fragment 205 or with respect to any point in medical environment 102. Spatial output 304 can include location identifiers (e.g., coordinate system positions) of different features with respect to a reference point. Spatial output 304 of various identified predictions 152-158 can be applied to rules 172 to identify surgical activities 176 and boundaries 174.

[0093] FIG. 4 depicts an example flowchart outlining the operations of a method 400 for implementing ML and rule based medical procedure identification and segmentation. The method 400 can be performed by a system having one or more processors executing computer- readable instructions stored on a memory. The method 400 can be performed, for example, by system 100 and in accordance with any features or techniques discussed in connection with FIGS. 1-3 and 5-6. For instance, the method 400 can be implemented one or more processors 610 of a computing system 600 executing non-transitory computer-readable instructions stored on a memory (e.g., the memory 615, 620 or 625) and using data from a data repository 160 (e.g., storage device 625).

[0094] The method 400 can be used to detect, determine or identify, from a data stream (e.g., videos or sensor readings) surgical activities of a surgical procedure and boundaries between the individual surgical activities. At operation 405, a method can receive a data stream from a medical procedure. At operation 410, the method can determine predictions from one or more ML models. At operation 415, the method can apply predictions to mapping rules. At operation 420, the method can determine whether a score for rule determination using predictions from the ML models exceeds a threshold. At operation 425, the method displays surgical activity and temporal boundaries when the rule determination at 420 exceeds the threshold.

[0095] At operation 405, the method can receive a data stream from a medical procedure. For instance, the method can include one or more processors of a data processing system coupled with memory receiving a data stream that captures a procedure performed with a robotic medical system. The data stream can include a video stream having a plurality of video frames forming video fragments (e.g., of 30, 45 or 60 video frames per second). Data stream can include sensor or detector measurements. Data stream can include a kinematics stream or an event stream.

[0096] The method can include receiving the data stream that captures a procedure performed with a robotic medical system. The method can include the data processing system receiving receive the data stream comprising data in a plurality of modalities, such as the video stream, event stream, kinematics stream or stream of sensor readings. For example, the received data stream can include at least two of a video stream, a kinematics stream or event stream. For example, a data stream can include a video file (e.g., video recording of one or more hours), a video fragment (e.g., one or more seconds of video), video frames (e.g., individual still images forming a video fragment). The data stream can include sensor data or measurements, events data kinematics data or any other type of data captured or acquired at the medical environment 102.

[0097] At operation 410, the method can determine predictions from one or more ML models. The method can include the one or more processors identifying a plurality of models trained with machine learning to make predictions related to an anatomy, an instrument, and an object associated with the procedure. For example, data processing system can include one or more ML models configured (e.g., trained) to detect any combination of: medical instruments used in a medical procedure, anatomical parts of a patient, objects encountered during the procedure and actions (e.g., movements or gestures) of the doctor (e.g., surgeon) during the procedure.

[0098] The method can include the one or more processors using the plurality of models to generate a first prediction of the anatomy, a second prediction of the instrument, and a third prediction of the object based on the data stream. For example, a data processing system can identify the plurality of models, where the plurality includes an anatomy model for identifying presence of particular patient anatomies, an instrument model for identifying presence of medical instruments, and an objection model for detecting objects at or around the surgical site. The data processing system can identify the plurality of models comprising an anatomy state detection model, a tool-tissue interaction model, and an energy use model.

[0099] The method can include the data processing system using the anatomy presence recognition model (e.g., anatomy model), the instrument presence model (e.g., instrument model), and the objection model (e.g., object model) to generate the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object. The method can include the data processing system using the action model to generate the action prediction of the surgeon’s actions (e.g., gestures, motions, hand or body movements or positions). For example, the data processing system can input the data stream (e.g., one or more video frames, video fragments, event or kinematics streams) into a plurality of models trained with machine learning to generate a first prediction of an anatomy, a second prediction of an instrument, and a third prediction of an object associated with the procedure. For example, the data processing system can input the data stream into the action model to generate an action prediction associated with the procedure.

[00100] At operation 415, the method can apply predictions to mapping rules. The method can include the one or more processors applying a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure. The method can include applying the mapping function is based on a surgical workflow structure. The method can input the predictions from the ML models into a rules engine to identify a rule that satisfies or corresponds to the ML model outputs (e.g., a particular medical instrument at a particular identified anatomy and along with one or more objects or actions detected). Based on a rule being satisfied by the ML model predictions (e.g., matching a particular activity in an ontology of surgical activities), the mapping function can map the portion of the data stream to a particular surgical activity.

[00101] To apply the mapping function, the one or more processors can identify one or more weights corresponding to the one or more of the models used by the data processing system. The one or more processors can fuse outputs of the one or more models using the plurality of weights. The data processing system can apply a mapping function to the first prediction, the second prediction and the third prediction of the ML models to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure.

[00102] The data processing system can apply the mapping function to identify the surgical activity from a hierarchical ontology of entities. The hierarchical ontology of entities can include entities corresponding to particular medical procedures. The mapping function can identify the surgical activity from the hierarchical ontology of entities with increasing granularity. The entities of the hierarchical ontology can include a gesture, an action, a step formed from the anatomy and the action, a phase, and a procedure type. [00103] The data processing system can determine a metric indicative of performance of the surgical activity during the temporal boundary. For example, the data processing system can determine a score or a confidence value corresponding to the similarities between features detected or identified by the ML models (e.g., anatomies, instruments, objects or actions) from the data stream and the same features in the modeled medical procedure from the ontologies. Based on the similarities, the mapping function can determine the level of performance (e.g., similarity) between the current surgical activity (e.g., captured by the data stream) and the model surgical activity according to the entities of the surgical ontology for the given procedure. For example, the method can compare the metric with a historical benchmark established for the surgical activity and provide a second indication of performance based on the comparison.

[00104] At operation 420, the system can determine whether a score for a rule-based determination that uses ML model predictions exceeds a threshold. The method can include determining a confidence score of a determination of one or more surgical activities and their corresponding boundaries. For example, the mapping function can determine a confidence score for each of the predictions by the ML models (e.g., anatomy prediction, instrument prediction, object prediction or action prediction). The mapping function can determine a confidence score for the identified surgical activity and a confidence score for each temporal boundary delineating different surgical activities. The method can compare each of the scores with thresholds. The threshold can include, for example, a percentage of confidence or certainty that the prediction or determination is accurate, such as 70%, 80%, 90%, 95%, 99% or 99.9%. The method can include a determination to provide the determination for display in response to determining that the score exceeds the threshold. For example, a determination of the surgical activity and the corresponding boundaries can be made, established or generated in response to one or more confidence scores for one or more ML model predictions or rule-based determinations exceeding their respective thresholds.

[00105] For example, the method can include the one or more processors identifying a confidence score associated with a prediction of the surgical activity. The method can include the data processing system determining that the confidence score is less than or equal to a threshold. The method can include providing, responsive to the confidence score being less than or equal to the threshold, a prompt via the graphical user interface. The method can include the one or more processors determining a second one or more confidence scores associated with at least one of the first prediction, the second prediction, or the third prediction that is less than or equal to a second threshold. The method can include the data processing system generating the prompt with an indication of the at least one of the first prediction, the second prediction, or the third prediction.

[00106] At operation 425, the system displays surgical activity and temporal boundaries when the rule determination at 420 exceeds the threshold. For instance, if at operation 420 the score exceeds the threshold, the one or more processors can display, alongside at least a portion of the data stream, an indication of the temporal boundary, the surgical activity. The one or more processors can display the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity. For example, the one or more processors can overlay, on the video data stream of the surgery, the indication of the boundary between two different surgical activities. For example, the overlaid data displayed can include ML model determinations, along with one or more confidence scores (e.g., any combination of scores for the boundary, surgical procedures or ML model predictions).

[00107] For example, the method can include providing, for overlay in a graphical user interface configured to display at least a portion of the data stream, an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity. The method can include providing, for display, a graphical representation of values for each entity in the hierarchical ontology.

[00108] If the rule determination at 420 does not exceed the threshold, the method can go back to operation 405 to evaluate another part of a data stream. For example, the method can take a new data set and apply operations 405-425 on the new data set independently. For example, the method can take a new data set and apply operations 405-425 on the new data set along with the current data set, so as to make ML model predictions on an additional set of data stream (e.g., combining the prior data stream portion and a current data stream portion).

[00109] FIG. 5 depicts a surgical system 500, in accordance with some embodiments. The surgical system 500 may be an example of the medical environment 102. The surgical system 500 may include a robotic medical system 505 (e.g., the robotic medical system 120), a user control system 510, and an auxiliary system 515 communicatively coupled one to another. A visualization tool 520 (e.g., the visualization tool 114) may be connected to the auxiliary system 515, which in turn may be connected to the robotic medical system 505. Thus, when the visualization tool 520 is connected to the auxiliary system 515 and this auxiliary system is connected to the robotic medical system 505, the visualization tool may be considered connected to the robotic medical system. In some embodiments, the visualization tool 520 may additionally or alternatively be directly connected to the robotic medical system 505.

[00110] The surgical system 500 may be used to perform a computer-assisted medical procedure on a patient 525. In some embodiments, surgical team may include a surgeon 530A and additional medical personnel 530B-530D such as a medical assistant, nurse, and anesthesiologist, and other suitable team members who may assist with the surgical procedure or medical session. The medical session may include the surgical procedure being performed on the patient 525, as well as any pre-operative (e.g., which may include setup of the surgical system 500, including preparation of the patient 525 for the procedure), and post-operative (e.g., which may include clean up or post care of the patient), and/or other processes during the medical session. Although described in the context of a surgical procedure, the surgical system 500 may be implemented in a non-surgical procedure, or other types of medical procedures or diagnostics that may benefit from the accuracy and convenience of the surgical system.

[00111] The robotic medical system 505 can include a plurality of manipulator arms 535 A- 535D to which a plurality of medical tools (e.g., the medical tool 112) can be coupled or installed. Each medical tool can be any suitable surgical tool (e.g., a tool having tissueinteraction functions), imaging device (e.g., an endoscope, an ultrasound tool, etc.), sensing instrument (e.g., a force-sensing surgical instrument), diagnostic instrument, or other suitable instrument that can be used for a computer-assisted surgical procedure on the patient 525 (e.g., by being at least partially inserted into the patient and manipulated to perform a computer- assisted surgical procedure on the patient). Although the robotic medical system 505 is shown as including four manipulator arms (e.g., the manipulator arms 535A-535D), in other embodiments, the robotic medical system can include greater than or fewer than four manipulator arms. Further, not all manipulator arms can have a medical tool installed thereto at all times of the medical session. Moreover, in some embodiments, a medical tool installed on a manipulator arm can be replaced with another medical tool as suitable.

[00112] One or more of the manipulator arms 535A-535D and/or the medical tools attached to manipulator arms can include one or more displacement transducers, orientational sensors, positional sensors, and/or other types of sensors and devices to measure parameters and/or generate kinematics information. One or more components of the surgical system 500 can be configured to use the measured parameters and/or the kinematics information to track (e.g., determine poses of) and/or control the medical tools, as well as anything connected to the medical tools and/or the manipulator arms 535A-535D.

[00113] The user control system 510 can be used by the surgeon 530A to control (e.g., move) one or more of the manipulator arms 535A-535D and/or the medical tools connected to the manipulator arms. To facilitate control of the manipulator arms 535A-535D and track progression of the medical session, the user control system 510 can include a display (e.g., the display 116 or 1130) that can provide the surgeon 530A with imagery (e.g., high-definition 3D imagery) of a surgical site associated with the patient 525 as captured by a medical tool (e.g., the medical tool 112, which can be an endoscope) installed to one of the manipulator arms 535A-535D. The user control system 510 can include a stereo viewer having two or more displays where stereoscopic images of a surgical site associated with the patient 525 and generated by a stereoscopic imaging system can be viewed by the surgeon 530A. In some embodiments, the user control system 510 can also receive images from the auxiliary system 515 and the visualization tool 520.

[00114] The surgeon 530A can use the imagery displayed by the user control system 510 to perform one or more procedures with one or more medical tools attached to the manipulator arms 535A-535D. To facilitate control of the manipulator arms 535A-535D and/or the medical tools installed thereto, the user control system 510 can include a set of controls. These controls can be manipulated by the surgeon 530A to control movement of the manipulator arms 535A- 535D and/or the medical tools installed thereto. The controls can be configured to detect a wide variety of hand, wrist, and finger movements by the surgeon 530A to allow the surgeon to intuitively perform a procedure on the patient 525 using one or more medical tools installed to the manipulator arms 535A-535D.

[00115] The auxiliary system 515 can include one or more computing devices configured to perform processing operations within the surgical system 500. For example, the one or more computing devices can control and/or coordinate operations performed by various other components (e.g., the robotic medical system 505, the user control system 510) of the surgical system 500. A computing device included in the user control system 510 can transmit instructions to the robotic medical system 505 by way of the one or more computing devices of the auxiliary system 515. The auxiliary system 515 can receive and process image data representative of imagery captured by one or more imaging devices (e.g., medical tools) attached to the robotic medical system 505, as well as other data stream sources received from the visualization tool. For example, one or more image capture devices (e.g., the image capture devices 110) can be located within the surgical system 500. These image capture devices can capture images from various viewpoints within the surgical system 500. These images (e.g., video streams) can be transmitted to the visualization tool 520, which can then passthrough those images to the auxiliary system 515 as a single combined data stream. The auxiliary system 515 can then transmit the single video stream (including any data stream received from the medical tool(s) of the robotic medical system 505) to present on a display (e.g., the display 630) of the user control system 510.

[00116] In some embodiments, the auxiliary system 515 can be configured to present visual content (e.g., the single combined data stream) to other team members (e.g., the medical personnel 530B-530D) who might not have access to the user control system 510. Thus, the auxiliary system 515 can include a display 540 configured to display one or more user interfaces, such as images of the surgical site, information associated with the patient 525 and/or the surgical procedure, and/or any other visual content (e.g., the single combined data stream). In some embodiments, display 540 can be a touchscreen display and/or include other features to allow the medical personnel 530A-530D to interact with the auxiliary system 515.

[00117] The robotic medical system 505, the user control system 510, and the auxiliary system 515 can be communicatively coupled one to another in any suitable manner. For example, in some embodiments, the robotic medical system 505, the user control system 510, and the auxiliary system 515 can be communicatively coupled by way of control lines 545, which can represent any wired or wireless communication link that can serve a particular implementation. Thus, the robotic medical system 505, the user control system 510, and the auxiliary system 515 can each include one or more wired or wireless communication interfaces, such as one or more local area network interfaces, Wi-Fi network interfaces, cellular interfaces, etc. It is to be understood that the surgical system 500 can include other or additional components or elements that can be needed or considered desirable to have for the medical session for which the surgical system is being used.

[00118] FIG. 6 depicts an example block diagram of an example computer system 600 is shown, in accordance with some embodiments. The computer system 600 can be any computing device used herein and can include or be used to implement a data processing system or its components. The computer system 600 includes at least one bus 605 or other communication component or interface for communicating information between various elements of the computer system. The computer system further includes at least one processor 610 or processing circuit coupled to the bus 605 for processing information. The computer system 600 also includes at least one main memory 615, such as a random-access memory (RAM) or other dynamic storage device, coupled to the bus 605 for storing information, and instructions to be executed by the processor 610. The main memory 615 can be used for storing information during execution of instructions by the processor 610. The computer system 600 can further include at least one read only memory (ROM) 620 or other static storage device coupled to the bus 605 for storing static information and instructions for the processor 610. A storage device 625, such as a solid-state device, magnetic disk or optical disk, can be coupled to the bus 605 to persistently store information and instructions.

[00119] The computer system 600 can be coupled via the bus 605 to a display 630, such as a liquid crystal display, or active-matrix display, for displaying information. An input device 635, such as a keyboard or voice interface can be coupled to the bus 605 for communicating information and commands to the processor 610. The input device 635 can include a touch screen display (e.g., the display 630). The input device 635 can also include a cursor control, such as a mouse, a trackball, or cursor direction keys, for communicating direction information and command selections to the processor 610 and for controlling cursor movement on the display 630.

[00120] The processes, systems and methods described herein can be implemented by the computer system 600 in response to the processor 610 executing an arrangement of instructions contained in the main memory 615. Such instructions can be read into the main memory 615 from another computer-readable medium, such as the storage device 625. Execution of the arrangement of instructions contained in the main memory 615 causes the computer system 600 to perform the illustrative processes described herein. One or more processors in a multiprocessing arrangement can also be employed to execute the instructions contained in the main memory 615. Hard-wired circuitry can be used in place of or in combination with software instructions together with the systems and methods described herein. Systems and methods described herein are not limited to any specific combination of hardware circuitry and software.

[00121] Although an example computing system has been described in FIG. 6, the subject matter including the operations described in this specification can be implemented in other types of digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. [00122] The herein described subject matter sometimes illustrates different components contained within, or connected with, different other components. It is to be understood that such depicted architectures are illustrative, and that in fact many other architectures can be implemented which achieve the same functionality. In a conceptual sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality, and any two components capable of being so associated can also be viewed as being “operably couplable,” to each other to achieve the desired functionality. Specific examples of operably couplable include but are not limited to physically mateable or physically interacting components or wirelessly interactable or wirelessly interacting components or logically interacting or logically interactable components.

[00123] With respect to the use of plural or singular terms herein, those having skill in the art can translate from the plural to the singular or from the singular to the plural as is appropriate to the context or application. The various singular/plural permutations can be expressly set forth herein for sake of clarity.

[00124] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.).

[00125] Although the figures and description can illustrate a specific order of method steps, the order of such steps can differ from what is depicted and described, unless specified differently above. Also, two or more steps can be performed concurrently or with partial concurrence, unless specified differently above. Such variation can depend, for example, on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the disclosure. Likewise, software implementations of the described methods can be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various connection steps, processing steps, comparison steps, and decision steps. [00126] It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation, no such intent is present. For example, as an aid to understanding, the following appended claims can contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to inventions containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” or “an” should typically be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should typically be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, typically means at least two recitations, or two or more recitations).

[00127] Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general, such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”

[00128] Further, unless otherwise noted, the use of the words “approximate,” “about,” “around,” “substantially,” etc., mean plus or minus ten percent.

[00129] The foregoing description of illustrative implementations has been presented for purposes of illustration and of description. It is not intended to be exhaustive or limiting with respect to the precise form disclosed, and modifications and variations are possible in light of the above teachings or can be acquired from practice of the disclosed implementations. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

Claims

CLAIMS What is claimed is:

1. A system, comprising: one or more processors, coupled with memory, to: receive a data stream that captures a procedure performed with a robotic medical system; input the data stream into a plurality of models trained with machine learning to generate a first prediction of an anatomy, a second prediction of an instrument, and a third prediction of an object associated with the procedure; apply a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure; and provide, for overlay in a graphical user interface configured to display at least a portion of the data stream, an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity.

2. The system of claim 1, wherein the one or more processors are further configured to: receive the data stream comprising data in a plurality of modalities.

3. The system of claim 1, wherein the one or more processors are further configured to: receive the data stream comprising at least two of a video stream, a kinematics stream or event stream.

4. The system of claim 1, wherein the one or more processors are further configured to: apply the mapping function to identify the surgical activity from a hierarchical ontology of entities with increasing granularity, the entities of the hierarchical ontology comprising a gesture, an action, a step formed from the anatomy and the action, a phase, and a procedure type.

5. The system of claim 4, wherein the one or more processors are further configured to: provide, for display, a graphical representation of values for each entity in the hierarchical ontology.

6. The system of claim 1, wherein the one or more processors are further configured to: identify the plurality of models comprising an anatomy presence recognition model, an instrument presence model, and an objection model; and use the anatomy presence recognition model, the instrument presence model, and the objection model to generate the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object.

7. The system of claim 6, wherein the one or more processors are further configured to: identify the plurality of models comprising an anatomy state detection model, a tooltissue interaction model, and an energy use model.

8. The system of claim 1, wherein the mapping function is based on a surgical workflow structure.

9. The system of claim 1, wherein to apply the mapping function, the one or more processors are further configured to: identify a plurality of weights corresponding to the plurality of models; and fuse outputs of the plurality of models using the plurality of weights.

10. The system of claim 1, wherein the one or more processors are further configured to: identify a confidence score associated with a prediction of the surgical activity; determine the confidence score is less than or equal to a threshold; provide, responsive to the confidence score being less than or equal to the threshold, a prompt via the graphical user interface.

11. The system of claim 10, wherein the one or more processors are further configured to: determine a second one or more confidence scores associated with at least one of the first prediction, the second prediction, or the third prediction that is less than or equal to a second threshold; and generate the prompt with an indication of the at least one of the first prediction, the second prediction, or the third prediction.

12. The system of claim 1, wherein the one or more processors are further configured to: receive input that indicates at least one of the first prediction of the anatomy, the second prediction of the instrument, or the third prediction of the object is erroneous; and update, responsive to the input, at least one model of the plurality of models related to the at least one of the first prediction of the anatomy, the second prediction of the instrument, or the third prediction of the object that is erroneous.

13. The system of claim 1, wherein the one or more processors are further configured to: determine a metric indicative of performance of the surgical activity during the temporal boundary.

14. The system of claim 13, wherein the one or more processors are further configured to: compare the metric with a historical benchmark established for the surgical activity; and provide a second indication of performance based on the comparison.

15. The system of claim 1, wherein the one or more processors are further configured to: provide the indication for overlay in the graphical user interface during performance of the procedure.

16. The system of claim 1, wherein the one or more processors are further configured to: provide the indication for overlay in the graphical user interface subsequent to performance of the procedure.

17. A method, comprising: receiving, by one or more processors coupled with memory, a data stream that captures a procedure performed with a robotic medical system; identifying, by the one or more processors, a plurality of models trained with machine learning to make predictions related to an anatomy, an instrument, and an object associated with the procedure; using, by the one or more processors, the plurality of models to generate a first prediction of the anatomy, a second prediction of the instrument, and a third prediction of the object based on the data stream; applying, by the one or more processors, a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure; and displaying, by the one or more processors alongside at least a portion of the data stream, an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity.

18. The method of claim 17, comprising: receiving, by the one or more processors, the data stream comprising at least two of a video stream, a kinematics stream or event stream.

19. A non-transitory computer-readable medium storing processor executable instructions, that when executed by one or more processors, cause the one or more processors to: identify a data file that captures a procedure performed with a robotic medical system; input the data file into a plurality of models trained with machine learning to generate a first prediction of an anatomy, a second prediction of an instrument, and a third prediction of an object associated with the procedure; apply a mapping function to the first prediction, the second prediction and the third prediction to determine a temporal boundary of a surgical activity performed via the robotic medical system in the procedure; and present an indication of the temporal boundary, the surgical activity and the first prediction of the anatomy, the second prediction of the instrument, and the third prediction of the object used to determine, via the mapping function, the surgical activity.

20. The non-transitory computer-readable medium of claim 19, wherein the data file is generated from a video stream of the procedure, a kinematics stream of the procedure, and an event stream of the procedure.