WO2025136978A1

WO2025136978A1 - Surgical instrument presence detection with noisy label machine learning

Info

Publication number: WO2025136978A1
Application number: PCT/US2024/060572
Authority: WO
Inventors: Rui Guo; Anthony M. Jarc; Sue KULASON; Xi Liu; Benjamin Mueller; Conor Perreault; Ziheng Wang
Original assignee: Intuitive Surgical Operations Inc
Current assignee: Intuitive Surgical Operations Inc
Priority date: 2023-12-18
Filing date: 2024-12-17
Publication date: 2025-06-26
Anticipated expiration: 2026-06-18

Abstract

A technical solution provides a framework to generate and use noisy label-tolerant ML models to detect objects in robotic procedure videos. The framework can identify a series of frames of a video of a medical procedure captured by a robotic medical system. The framework can identify a model trained based on frames of a plurality of videos captured for medical procedures labeled with data identifying installation of instruments of robotic medical systems. The framework can determine, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of one or more types of instruments. The framework can display, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per- frame label of the series of frames determined via the model.

Description

SURGICAL INSTRUMENT PRESENCE DETECTION

WITH NOISY LABEL MACHINE LEARNING

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/611,636, filed December 18, 2023, which is hereby incorporated herein by reference in its entirety.

BACKGROUND

[0002] Medical procedures can be performed in an operating room. As the amount and variety of equipment in the operating room increases, or medical procedures become increasingly complex, it can be challenging to perform such medical procedures efficiently, reliably, or without incident.

SUMMARY

[0003] The technical solutions of the present disclosure provide noisy label-tolerant machine learning (ML) models for detection of objects in robotic procedure videos. For example, the technical solutions provide a framework for identifying surgical instruments captured by surgical videos using an ML model trained on noisy label dataset based on robotic system instrument installation logs, rather than supervised learning process. As a result, the technical solutions can scale up the available training dataset to provide a highly accurate identification of surgical instruments at a reduced amount of training resources. For example, the technical solutions can provide a spatial-temporal transformer neural network model that is tolerant to noisy label datasets and identifying and tagging medical instruments captured in videos of robotic surgeries.

[0004] At least one aspect of the technical solutions is directed to a system. The system can include one or more processors, coupled with memory. The one or more processors can identify a series of frames of a video of a medical procedure captured by a robotic medical system. The one or more processors can identify a model trained based at least in part on frames of a plurality of videos captured for one or more medical procedures that are labeled with data identifying installation of one or more instruments of one or more robotic medical systems. The one or more processors can determine, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of one or more types of instruments. The one or more processors can display, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model.

[0005] The one or more processors can be configured to receive a set of frames of the plurality of videos captured for one or more medical procedures. The one or more processors can identify, based on a final frame of the set of frames, a label for the set of frames. The label can include data indicative of a time of installation of the one or more instruments and a final time stamp of the final frame. The label can train the model using the label for the set of frames.

[0006] The one or more processors can be configured to identify, for the frames of the plurality of videos, a plurality of labels. Each label of the plurality of labels can include a vector of one or more of values corresponding to one or more instruments. The one or more processors can be configured to train the model using the plurality of labels.

[0007] The one or more processors can be configured to determine, for the frames of the plurality of videos, a plurality of labels for the frames, each label of the plurality of labels having a value indicative of whether the one or more instruments is installed at the one or more robotic medical systems at a time of each respective frame of the frames. The one or more processors can be configured to train the model using the plurality of labels.

[0008] The one or more processors can be configured to identify one or more logs of the one or more robotic medical systems. Each log of the one or more logs indicating a time of installation of the one or more instruments for a respective video of the plurality of videos. The one or more processors can be configured to assign, for each frame of the frames of the plurality of videos, a label of the plurality of labels indicative of the time of installation from a respective log of the one or more logs corresponding to the respective video of the plurality of videos.

[0009] The model can include a transformer neural network model that applies a first one or more weights to one or more spatial dimensions within an area of an image within a frame of the frames of the plurality of videos and a second one or more weights to temporal dimensions across a group of frames of the frames of the plurality of videos. The one or more processors can be configured to generate, based at least on the frames of the plurality of videos, a heat map indicative of an area within a subset of the frames for which the probability of presence of the type of instrument exceeds a threshold for the heat map. [0010] The one or more processors can be configured to identify, based at least on the frames of the plurality of videos, a second area within the subset of the frames for which the probability of presence of the type of instrument exceeds a second threshold exceeding the first threshold. The one or more processors can be configured to display, via the graphical user interface, at least two of the subset of the frames, the heat map, the first area and the second area.

[0011] The one or more processors can be configured to compare the probability of presence for each respective frame of the series of frames with a threshold for presence of the type of instrument. The one or more processors can be configured to determine, based at least in part on the comparison, a second per-frame label for each respective frame of the series of frames indicative of whether the type of instrument is present at the robotic medical system.

[0012] The one or more processors can be configured to identify, from the series of frames of the video, a subset of the frames corresponding to a portion of the video capturing the type of the instrument used in the medical procedure. The one or more processors can be configured to determine, based on the subset of the frames input into the model, the respective per frame label for the subset of the frames.

[0013] The one or more processors can be configured to receive, from a robotic medical system, a file comprising an indication of a time of the installation of the one or more instruments at the robotic medical system. The one or more processors can be configured to determine, based at least on the indication of the time input into the model, the respective per frame label for at least a frame of the frames.

[0014] The one or more processors can be configured to generate, based at least on the per- frame label for each frame of the series of frames, a series of per-frame labels. The one or more processors can be configured to adjust, a value of a first per-frame label for a first frame of the series of frames using at least a second value of a second per-frame label for a second frame adjacent to the first frame.

[0015] The one or more processors can be configured to determine, using the model, the per-frame label based at least on a time of installation of the one or more instruments at the one or more robotic medical systems. The one or more processors can be configured to determine, based on a comparison of the time stamp and the time of installation, the probability of the presence. [0016] The one or more processors can be configured to identify the type of instrument based at least on the probability of presence exceeding a threshold for the type of instrument. The one or more processors can be configured to display the indication identifying the type of instrument. The indication can be overlaid over a subset of the series of frames displayed on the graphical user interface, the subset of the series having the probability of presence that exceeds the threshold for the type of instrument.

[0017] At least one aspect of the technical solutions is directed to a method. The method can include one or more processors coupled with memory labeling frames of a plurality of videos captured for one or more medical procedures using data identifying installation of one or more types of instruments of one or more robotic medical systems. The method can include the one or more processors training a model using the labeled frames. The method can include the one or more processors determining, using the model, a per-frame label for each frame of the series of frames. The per-frame label can be indicative of a probability of presence of the one or more types of instruments. The method can include the one or more processors displaying, via a graphical user interface, an indication of a presence of a type of instrument.

[0018] The method can include the one or more processors determining, by the one or more processors, the presence of the type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model. The method can include the one or more processors receiving a set of frames of the plurality of videos captured for one or more medical procedures. The method can include the one or more processors identifying, based on a final frame of the set of frames, a label for the set of frames. The label can include data indicative of a time of installation of the one or more instruments and a final time stamp of the final frame. The method can include the one or more processors training the model using the label for the set of frames.

[0019] At least one aspect of the technical solutions is directed to a non-transitory computer-readable medium storing processor executable instructions, that when executed by one or more processors, cause the one or more processors to identify a series of frames of a video of a medical procedure captured by a robotic medical system. The instructions, when executed by the one or more processors can cause the one or more processors to identify a model trained based at least in part on frames of a plurality of videos captured for one or more medical procedures that are labeled with data identifying installation of one or more instruments of one or more robotic medical systems. The instructions, when executed by the one or more processors can cause the one or more processors to determine, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of one or more types of instruments. The instructions, when executed by the one or more processors can cause the one or more processors to display, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model. The indication can be overlaid over a subset of the series of frames displayed on the graphical user interface. The subset of the series can have the probability of presence that exceeds a threshold for the type of instrument.

[0020] These and other aspects and implementations are discussed in detail below. The foregoing information and the following detailed description include illustrative examples of various aspects and implementations and provide an overview or framework for understanding the nature and character of the claimed aspects and implementations. The drawings provide illustration and a further understanding of the various aspects and implementations and are incorporated in and constitute a part of this specification. The foregoing information and the following detailed description and drawings include illustrative examples and should not be considered as limiting.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component can be labeled in every drawing. In the drawings:

[0022] FIG. 1 depicts an example system for generating and deploying noisy label-tolerant ML models to detect objects in robotic procedure videos.

[0023] FIG. 2 illustrates an example of a graphical user interface providing indications for an ML model processed video frame capturing and identifying medical instruments.

[0024] FIG. 3 illustrates an example system configuration for generating and deploying noisy label-tolerant ML models to detect provide instrument presence and spatial context.

[0025] FIG. 4 illustrates an example flow diagram of a method for generating and using noisy lab el -tolerant ML models to detect objects in robotic procedure videos.

[0026] FIG. 5 illustrates an example of a surgical system, in accordance with some aspects of the technical solutions. [0027] FIG. 6 illustrates an example block diagram of an example computer system is shown, in accordance with some aspects of the technical solutions.

DETAILED DESCRIPTION

[0028] Following below are more detailed descriptions of various concepts related to, and implementations of, systems, methods, apparatuses for surgical instrument presence detection with noisy label machine learning. The various concepts introduced above and discussed in greater detail below can be implemented in any of numerous ways.

[0029] Although the present disclosure is discussed in the context of a surgical procedure, in various aspects, the technical solutions of this disclosure can be applicable to other medical treatments, sessions, environments or activities, as well as non-medical activities where object based procedure identification is desired. For instance, technical solutions can be applied in any environment, application or industry in which activities, operations, processes or acts are performed with tools or instruments that can be captured on video and for which ML modeling can be used to identify or recognize the tools or instruments in the videos.

[0030] Training ML models to identify surgical instruments from recorded procedure videos can be challenging and dependent on large and human-annotated training datasets and supervised learning. However, ML model training with such datasets can be time consuming and compute resource intensive. In addition, such model training processes can fail to represent the full diversity of scenarios the model could encounter and be prone to introducing annotator biases, all of which can impact the model performance. ML model training with large or human-annotated datasets can be hard to modify or scale up as annotation efforts for large datasets can introduce additional time delays adversely affecting the ability of the system to make timely corrections through data retraining.

[0031] In robotic medical systems however, system logs can be used to record surgical instrument installation events, capturing instances in which surgical instruments are installed in a robot arm of a robotic surgical system. Using such installation logs in combination with surgical videos to provide training samples for the machine-learning model can be challenging as the system logs can be temporally mismatched (e.g., offset in time) with the instrument occurrences in the video recording. For instance, a delay can exist between the time of the surgical tool installation in an installation log and a moment in which the installed instrument appears in a video frame. Such delays can vary depending on a circumstance, making it hard to estimate the delay duration. In addition, medical procedure video recordings can include instances in which the visibility or appearance of the surgical instruments is affected, such as visual obstructions of the camera’s viewing angle or occlusions (e.g., by another object, patient’s body or a surgeon’s hand). As a result, instrument installation logs can introduce noise in the datasets, which can result in a noisy labeling of the training data set, including for example missing labels or mislabeling of data.

[0032] The technical solution overcomes these challenges by providing a noise label- tolerant ML neural network model trained for detection of surgical instruments in robotic surgery videos using noisy label datasets having videos and instrument installation logs. The technical solutions can be implemented using a pipeline for ML model training having multiple training stages. For example, an input framing stage can include discretizing the surgical videos into sequential framed images. The frame rates can be varied and the number of images to formalize the input batch can be selected such that the training video clips capture a sufficient amount of temporal information (e.g., a sufficient time duration) to provide the sufficient surgical context (e.g., surgical task being performed in the clip). The label associated with the image fragment, clip or a batch can include a vector that can include a number of (n) binary elements representing a number of (n) classes of the instrument. The label vector can represent the presence of the instruments for the last frame of the image clip or batch. The binary label can come from, or can be created using, the system tool installation log and thus not include any human annotation or human labeling.

[0033] The technical solution can include a spatial-temporal transformer neural network model that can include a ML core that is a type of a neural network that adopts sequential frames. The model can take the numeric operations over the sequential input and convert the operations into a compact feature vector that represents the input. More specifically, the model can process or determine any spatial and temporal correlation identifying the representative semantics of the surgical event by applying an “attention” mechanism. The attention mechanism can include a selective weighting technology that can apply different weights to emphasize different parts of the information in the data, resulting in identifying the most effective compact representation that fulfills the task. The same weighting process can facilitate containing any errors (e.g., noise) in the labeling during the training process. For example, the attention mechanism of the model can use the assignment of weights to focus more on relevant spatial and temporal features while downplaying the impact of the noisy or mislabeled data. For example, the model can prioritize informative parts of the data and reduce the influence of the less relevant or erroneous information. The attention mechanism can include the weighting capabilities on both the spatial and temporal dimensions. The ML model structures can include, for example, a neural network architecture that leverages transformer-based models to process and understand three-dimensional visual data or a convolution-free approach to video classification using self-attention over space and time.

[0034] The technical solution can include classification functionality that can treat the identification of the presence of surgical instruments in videos as a classification problem. The classification functionality can be designed or configured to regress the feature vectors to a length (n) vector. The classification functionality can include the learning objective of comparing the elements with the data label and minimizing the difference between the two. The technical solutions can include a model training module that can fetch the training data from a database and process the data in a format compatible with the neural network module configuration.

[0035] In an inference or deployment stage, the learned configuration and parameters of the neural network model can be transferred to the processing unit. The inferencing procedure video can be discretized in the same or a similar way as videos in the training stage and can be fed into the ML model. The output from the model can include a length-n numerical vector representing the probability of the presence of each type of instrument. By comparing these probabilities with a threshold for determining presence of each type of instrument, the length-n vector can be binarized on each element, showing the existence of the corresponding instrument.

[0036] The technical solutions can perform the detection or identification of the surgical instruments on each of the inferencing frames. A smoothing post-process can be used and applied on the sequential labels over time of the sequence of images. The per-frame presence label can then be converted to an instrument tag, logging the instrument type and start time or end time.

[0037] Benefitting from the attention mechanism embedded in the neural network structure, at the inference stage, the technical solution can utilize the “attention” spatially in the format of a heatmap. The heat map can include a heat zone on an image frame of the video clip corresponding to the most likely appearing locations of the instrument. For example, the heat map can facilitate providing additional spatial information to tag the image, such as, distinguishing the left or right location or end of the instrument. [0038] The technical solutions can include components in the neural network model that can include varied implementations. For example, alternatively or additionally the instrument presence identification model can be trained on any combination of a noisy system log based label data or a human annotated labeling. For example, a detailed spatial -temporal neural network can be configured differently depending on use cases, such as convolutional and recurring neural network architectures, two-stream network architectures with separate streams for spatial and temporal processing or graph neural networks. In doing so, the technical solutions can leverage the large volume of the noisy label to train an accurate surgical tool presence identification model. The visualized attention can correspond to the active area in the spatial-temporal dimension, which can indicate the spatial location of the identified instrument. For example, the technical solutions can provide the heat map as a highlight presence zone providing a spatial context or indication of the image in the user interface by visualizing the model attention. For example, the technical solutions can include the post process on the visualized heat map that can facilitate distinguishing the instruments installed on left arm or right arm.

[0039] The technical solutions can include a user interface or a user experience functionality in which the identified instruments can show their presented time (e.g., start and end time) as well as a duration in a timeline bar. Such a functionality can provide support to navigate the user to check and review the instrument usage status. The user interface or experience functionality can include a visualized procedure video with a highlight heat map around the instrument and the identification results that can include the left arm or right arm label for identified instrument. The instrument identification can include a detailed performance assessment, individual status, and a confidence score.

[0040] FIG. 1 depicts an example system 100 for system for generating and deploying noisy label-tolerant ML models to detect objects in robotic procedure videos. Example system 100 can include a robotic system for performing tasks using tools or instruments, such as a robotic medical system 120 used by a surgeon to perform a surgery on a patient. Robotic medical system 120, also referred to as an RMS 120, can be deployed in a medical environment 102. Medical environment 102 can include any space or facility for performing medical procedures, such as a surgical facility, or an operating room. Medical environment 102 can include medical instruments 112 that the RMS 120 can use for performing surgical patient procedures, whether invasive, non-invasive, in-patient, or out-patient procedures. [0041] The medical environment 102 can include one or more data capture devices 110 (e.g., optical devices, such as cameras or sensors or other types of sensors or detectors) for capturing data streams 162 (e.g., images or videos of a surgery). The medical environment 102 can include one or more visualization tools 114 to gather the captured data streams 162 and process it for display to the user (e.g., a surgeon or other medical professional) at one or more displays 116. A display 116 can present data stream 162 (e.g., images or video frames) of a medical procedure (e.g., surgery) being performed using the robotic medical system 120 handling, manipulating, holding or otherwise utilizing medical tools 112 to perform surgical tasks at the surgical site. RMS 120 can include installation data 122 which can include system logs indicating installation times of the medical instruments 112 on various manipulator arms of the robotic medical system 120. Coupled with the RMS 120, via a network 101, can be a data processing system (DPS) 130. DPS 130 can include one or more machine learning (ML) trainers 140, data repositories 160, processing functions 170 and interfaces 180.

[0042] Machine learning (ML) trainer 140 can include or generate instrument models 144 which can be trained using training datasets 142 that can include video frames 164 and installation data 122 from the RMS 120. ML trainer 140 can use the training dataset 142 to label video frames 164 with labels 148 and improve the performance of the instrument models 144 using weights 146 to more accurately detect and identify instrument predictions 152 according to the attention mechanism 154.

[0043] Data repository 160 of the DPS 130 can include one or more data streams 162, such as a stream of video frames 164. Data streams 162 can include measurements or sensors (e.g., force, torque or biometric data, haptic feedback data, endoscopic images or data, ultrasound images or videos or communication and command data streams. Data repository 160 can include installation data, such as system files or logs including time stamps and data on installation, activation, calibration or use of particular medical instruments 112.

[0044] Processing functions 170 can include functionality for processing data, including for example, functionality for generating heat maps 172 and performing frame smoothing 174. Heat maps 172 can include heat zones or highlighting of areas in video frames 164 within which medical instruments 112 are determined by the instrument model 144 to be found within a video frame 164 of the data stream 162. Frame smoothing 174 can include correction of labels 148 for a video frame 164 based on labels 148 of other frames surrounding the given video frame 164. Interface 180 can include, for example, graphical user interface for providing indications 182 that can indicate features, such as labels 148, indications of instruments 112 identified in instrument predictions 152 or heat maps 172.

[0045] The system 100 can include one or more data capture devices 110 (e.g., video cameras, sensors or detectors) for collecting any data stream 162, that can be used for machine learning and detection of objects, such as medical instruments or tools. Data capture devices 110 can include cameras or other image capture devices for capturing videos or images from a particular viewpoint within the medical environment 102. The data capture devices 110 can be positioned, mounted, or otherwise located to capture content from any viewpoint that facilitates the data processing system capturing various surgical tasks or actions.

[0046] Data capture devices 110 can include any of a variety of sensors, cameras, video imaging devices, infrared imaging devices, visible light imaging devices, intensity imaging devices (e.g., black, color, grayscale imaging devices, etc.), depth imaging devices (e.g., stereoscopic imaging devices, time-of-flight imaging devices, etc.), medical imaging devices such as endoscopic imaging devices, ultrasound imaging devices, etc., non-visible light imaging devices, any combination or sub-combination of the above mentioned imaging devices, or any other type of imaging devices that can be suitable for the purposes described herein. Data capture devices 110 can include cameras that a surgeon can use to perform a surgery and observe manipulation components within a purview of field of view suitable for the given task performance.

[0047] Data capture devices 110 can capture, detect, or acquire sensor data, such as videos or images, including for example, still images, video images, vector images, bitmap images, other types of images, or combinations thereof. The data capture devices 110 can capture the images at any suitable predetermined capture rate or frequency. Settings, such as zoom settings or resolution, of each of the data capture devices 110 can vary as desired to capture suitable images from any viewpoint. For instance, data capture devices 110 can have fixed viewpoints, locations, positions, or orientations. The data capture devices 110 can be portable, or otherwise configured to change orientation or telescope in various directions. The data capture devices 110 can be part of a multi-sensor architecture including multiple sensors, with each sensor being configured to detect, measure, or otherwise capture a particular parameter (e.g., sound, images, or pressure).

[0048] Data capture devices 110 can include any type and form of a sensor, such as a positioning sensor, a biometric sensor, a velocity sensor, an acceleration sensor, a vibration sensor, a motion sensor, a pressure sensor, a light sensor, a distance sensor, a current sensor, a focus sensor, a temperature or pressure sensor or any other type and form of sensor used for providing data on medical tools 112, or data capture devices (e.g., optical devices). For example, a data capture device 110 can include a location sensor, a distance sensor or a positioning sensor providing coordinate locations of a medical tool 112 or a data capture device 110. Data capture device 110 can include a sensor providing information or data on a location, position or spatial orientation of an object (e.g., medical tool 112 or a lens of data capture device 110) with respect to a reference point. The reference point can include any fixed, defined location used as the starting point for measuring distances and positions in a specific direction, serving as the origin from which all other points or locations can be determined.

[0049] Display 116 can show, illustrate or play data streams 162 (e.g., video frames 164) in which medical tools 112 at or near surgical sites are shown. For example, display 116 can display a rectangular image (e.g., a video frame 164) of a surgical site along with at least a portion of medical tools 112 (e.g., instruments) being used to perform surgical tasks. Display 116 can provide compiled or composite images generated by the visualization tool 114 from a plurality of data capture devices 110 to provide a visual feedback from one or more points of view.

[0050] The visualization tool 114 that can be configured or designed to receive any number of different data streams 162 from any number of data capture devices 110 and combine them into a single data stream displayed on a display 116. The visualization tool 114 can be configured to receive a plurality of data stream components and combine the plurality of data stream components into a single data stream 162. For instance, the visualization tool 114 can receive a visual sensor data from one or more medical tools 112, sensors or cameras with respect to a surgical site or an area in which a surgery is performed. The visualization tool 114 can incorporate, combine or utilize multiple types of data (e.g., positioning data of a medical tool 112 along sensor readings of pressure, temperature, vibration or any other data) to generate an output to present on a display 116. Visualization tool 114 can present locations of medical tools 112 along with locations of any reference points or surgical sites, including locations of anatomical parts of the patient (e.g., organs, glands or bones).

[0051] Medical tools 112 can be any type and form of tool or instrument used for surgery, medical procedures or a tool in an operating room or environment. Medical tool 112 can be imaged by, associated with or include an image capture device. For instance, a medical tool 112 can be a tool for making incisions, a tool for suturing a wound, an endoscope for visualizing organs or tissues, an imaging device, a needle and a thread for stitching a wound, a surgical scalpel, forceps, scissors, retractors, graspers, or any other tool or instrument to be used during a surgery. Medical tools 112 can include hemostats, trocars, surgical drills, suction devices or any instruments for use during a surgery. The medical tool 112 can include other or additional types of therapeutic or diagnostic medical imaging implements. The medical tool 112 can be configured to be installed in, coupled with, or manipulated by an RMS 120, such as by manipulator arms or other components for holding, using and manipulating the medical instruments or tools 112.

[0052] RMS 120 can be a computer-assisted system configured to perform a surgical or medical procedure or activity on a patient via or using or with the assistance of one or more robotic components or medical tools 112. RMS 120 can include any number of manipulator arms for grasping, holding or manipulating various medical tools 112 and performing computer-assisted medical tasks using medical tools 112 controlled by the manipulator arms.

[0053] The images (e.g., video images) captured by a medical tool 112 can be sent to the visualization tool 114. The robotic medical system 120 can include one or more input ports to receive direct or indirect connection of one or more auxiliary devices. For example, the visualization tool 114 can be connected to the RMS 120 to receive the images from the medical tool when the medical tool is installed in the RMS 120 (e.g., on a manipulator arm for handing medical instruments 112). The visualization tool 114 can combine the data stream components from the data capture devices 110 and the medical tool 112 into a single combined data stream for presenting on a display 116.

[0054] The system 100 can include a data processing system 130. The data processing system 130 can be deployed in or associated with the medical environment 102, or it can be provided by a remote server or be cloud-based. The data processing system 130 can include an interface 180 designed, constructed and operational to communicate with one or more component of system 100 via network 101, including, for example, the robotic medical system 120. Data processing system 130 can be implemented using instructions stored in memory locations and processed by one or more processors, controllers or integrated circuitry. Data processing system 130 can include functionalities, computer codes or programs for executing or implementing ML trainer 140 and the instrument model 144 to identify, recognize, detect or indicate the location of medical instruments 112 in the video frames 164 of a surgical video recording. [0055] The ML trainer 140 can any combination of hardware and software for training ML models. ML trainer 140 can include a framework or functionality for training noisy label- tolerant machine learning models, such as a neural network spatial-temporal attention mechanism model designed for detecting medical instruments 112, or any other tools, in videos (e.g., videos of robotic surgeries). ML trainer 140 can utilize or leverage video recordings of robotic surgeries conducted with an RMS 120, paired with label data derived from instrument installation logs (e.g., installation data 122) from a RMS 120. ML trainer 140 can train ML models, such as instrument model 144, using installation data 122 that can include noise or discrepancies, such as temporal mismatches between the installation times of medical instruments 112 and the timing of appearances of the medical instruments 112 in the video files.

[0056] The ML trainer 140 can include an attention mechanism 154 that can be used to address the noise challenges in the data. Attention mechanism 154 can include a scheme or functionality that utilizes weights 146 to configure an instrument model 144 to selectively focus on certain parts of input data (e.g., video frames 164) assigning varying degrees of importance to each part of the input data during the learning process. For example, the attention mechanism 154 can include spatial -temporal attention mechanism 154 within the neural network architecture that configures the model to focus selectively on relevant spatial and temporal features in the video data. By assigning weights 146 to different segments of the input videos, the attention mechanism 154 allows the model to attenuate the impact of noisy labels 148, emphasizing more reliable cues for more accurate instrument detection. In doing so, the attention mechanism 154 can mitigate or reduce the effects of noise in the training dataset 142, enhancing the ability of the instrument model 144 to generalize and make accurate predictions on previously unseen surgical videos.

[0057] For instance, the attention mechanism 154 can facilitate assigning of higher weights 146 to particular portions of the surgical procedure where instrument presence is unambiguous, reducing the influence of potential mislabeling during less distinctive phases. For instance, the attention mechanism 154 can assign higher weights 146 to portions of data stream 162 whose video frames 164 include labels 148 having time stamps that are within the time range corresponding to the installation time identified in the installation data 122 for the RMS 120. The result of such weight assignments can be a noise-tolerant ML neural network capable of accurately discerning medical instruments 112. Such a training strategy shows the adaptability of ML methodologies to the intricacies of noisy training dataset 142. [0058] Instrument model 144 can include any variety or combination of machine learning architectures. For example, instrument model 144 can include support vector machines (SVMs) that can facilitate predictions (e.g., anatomical, instrument, object, action or any other) in relation to class boundaries, random forests for classification and regression tasks, decision trees for prediction trees with respect to distinct decision points, K-nearest neighbors (KNNs) that can use similarity measures for predictions based on characteristics of neighboring data points, Naive Bayes functions for probabilistic classifications, logistic or linear regressions, or gradient boosting models. Instrument model 144 can include neural networks, such as deep neural networks configured for hierarchical representations of features, convolutional neural networks (CNNs) for image-based classifications and predictions, as well as spatial relations and hierarchies, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for determining structures and processes unfolding over time or multimodal data integration in which medical images can be combined with patient’s data or history.

[0059] Instrument model 144 can include or utilize transformers or transformer-based architecture, such as a spatial-temporal transformer or a graphical neural networks with transformers which can be configured to make instrument predictions 152. Instrument predictions 152 can include any identifications, predictions, determinations or recognitions of a medical instrument 112 (e.g., instrument type) captured by a video. Spatial -temporal transformer can facilitate determinations of heat maps 172 or highlighting of particular regions of interest in video frames 164 corresponding to locations in which medical instruments 112 are being identified or detected. For example, transformers can be used for multimodal integration in which data streams 162 from multiple types of sources (e.g., data from various detectors, sensors and cameras) can be combined for predictions. Spatial-temporal transformer neural network can be applied to video frames 164 to facilitate spatial relations of features across different images or data sources (e.g., 110). Instrument model 144 can include any one or more machine learning (e.g., deep neural network) models trained on diverse datasets to learn to recognize intricate details of objects, instruments or tools, such as edges or shapes of instruments or specific instrument types.

[0060] Instrument model 144 can be stored in a data repository 160, along with training data sets 142, video frames 164 or installation data 122. Instrument model 144 can be trained, established, configured, updated, or otherwise provided by a ML trainer 140. Instrument model 144 can be configured to identify, predict, classify, categorize, or otherwise score various performance aspects. For example, instrument model 144 can be configured to determine a confidence score with respect to an instrument prediction 152 (e.g., instrument type). For example, a confidence score can indicate a score (e.g., percentage of confidence from 0 to 100%) indicative of the level of certainty or confidence that the instrument model 144 has with respect to a particular instrument prediction 152.

[0061] Instrument model 144 can be configured to make an instrument prediction 152 (e.g., prediction of any object for which the model is trained to identify). Instrument prediction 152 can include any determination, recognition, identification or prediction of an object, such as a medical instrument 112 (e.g., instrument type), or any other that the model may be trained to recognize. Instrument prediction 152 can include or correspond to a label 148. Label 148 can be used to indicate presence or absence of the recognized or identified object (e.g., medical instrument 112). Label 148 can be used, for example, to indicate a location in the video frame 164 at which medical instrument 112 is located. Label 148 can indicate a probability that an object (e.g., a medical instrument 112) is identified within a video frame 164 of an incoming (e.g., real-time streamed) video that can be input into the instrument model 144 to determine instrument predictions 152. Label 148 can include a vector of a plurality of values, each of which can correspond to a probability that a particular medical instrument 112 (e.g., instrument type) is present, identified or recognized.

[0062] Instrument model 144 can include, for example, a deep learning model configured to identify, detect or recognize instrument predictions 152 of a particular medical or surgical tool used in a surgery, such as any one or more of shears, needles, threads, scalpels, clips, rings, bone screws, graspers, retractors, saws, forceps, imaging devices, or any other medical instrument 112 or a tool used in a medical procedure. Instrument model 144 can be configured to detect or recognize any tool or an object, depending on a design, such as a machine tool, an electrical or a mechanical tool, a robotic machine or a feature or any other object or device.

[0063] The data repository 160 can include one or more data files, data structures, arrays, values, or other information that facilitates operation of the data processing system 130. The data repository 160 can include one or more local or distributed databases and can include a database management system. The data repository 160 can include, maintain, or manage a data stream 162. The data stream 162 can include or be formed from one or more of a video stream, image stream, stream of sensor measurements, event stream, or kinematics stream. The data stream 162 can include data collected by one or more data capture devices 110, such as a set of 3D sensors from a variety of angles or vantage points with respect to the procedure activity (e.g., point or area of surgery).

[0064] Data stream 162 can include any stream of data. Data stream 162 can include a video stream, including a series of video frames 164. Video frames 164 can be formed or organized into video fragments, such as video fragments of about 1, 2, 3, 4, 5, 10 or 15 seconds of a video. Each second of the video can include, for example, 30, 45, 60, 90 or 120 video frames 164 per second. Data stream 162 can include an event stream which can include a stream of event data or information, such as packets, that identify or convey a state of the robotic medical system 120 or an event that occurred in association with the robotic medical system 120. For example, data stream 162 can include any portion of installation data 122, including information or data on installation, uninstallation, calibration, set up, attachment, detachment or any other action performed by or on an RMS 120 with respect to a medical instrument 112.

[0065] Data stream 162 can include data about an event, such as a state of the robotic medical system 120 indicating whether the medical tool or instrument 112 is calibrated, adjusted or includes a manipulator arm installed on a robotic medical system 120. Event stream can include data on whether a robotic medical system 120 was fully functional (e.g., without errors) during the procedure. For example, when a medical instrument 112 is installed on a manipulator arm of the robotic medical system 120, a signal or data packet(s) can be generated indicating that the medical instrument 112 has been installed on the manipulator arm of the robotic medical system 120. The signal can be recorded in the installation data 122 along with a time stamp of the event occurrence.

[0066] Data stream 162 can include a kinematics stream data which can refer to or include data associated with one or more of the manipulator arms or medical tools 112 (e.g., instruments) attached to the manipulator arms, such as arm locations or positioning. Data corresponding to medical tools 112 can be captured or detected by one or more displacement transducers, orientational sensors, positional sensors, or other types of sensors and devices to measure parameters or generate kinematics information. The kinematics data can include sensor data along with time stamps and an indication of the medical tool 112 or type of medical tool 112 associated with the data stream 162.

[0067] Data repository 160 can store video frames 164. Video frame 164 can include a single static image extracted from a sequence of images of a video file. Video frame 164 can represent a specific moment in time and can be identified by a metadata including a time stamp. Video frame 164 can display a visual content of the video at a particular instant. For example, in a video file capturing a robotic surgical procedure, a video frame 164 can depict a snapshot of the surgical task, illustrating a movement or usage of a medical instrument 112 such as a robotic arm manipulating a surgical tool within the patient's body.

[0068] Data repository can store installation data 112. Installation data 122 for an RMS 120 can include any data or information documenting a setup, calibration, configuration, or attachment of a medical instrument 112 by a RMS 120. Installation data 122 can include a system or an installation file that can include or indicate events of installation, attachment, calibration, connection or setup of a medical instrument 112 to a manipulator arm of an RMS 120. For instance, an installation data file can include one or more listings with timestamps and details indicating the timing (e.g., seconds, minutes, hours or dates) when a medical instrument 112 was attached, calibrated, or configured by an RMS 120.

[0069] Processing function 170 can include any combination of hardware and software for processing data or outputs of instrument model 144. Processing function 170 can include the functionality or framework for handling data generated or determined by instrument model 144 and can serve to refine and enhance the information determined by the model. Processing function 170 can provide post-processing adjustments or operate concurrently and together with the instrument model 144. For instance, a processing function 170 can produce a heat map 172 illustrating the predicted locations of medical instruments 112 as determined by the instrument model 144. The heat map 172 can provide a visual representation, highlighting areas where the medical instruments 112 are predicted by the instrument prediction 152 to be present in the video frame 164. Processing function 170 can generate the heat map 172 with respect to particular threshold levels corresponding to certain threshold levels of certainty or confidence that the medical instrument 112 is to be found in the given location. The heat map 172 can include multiple layers corresponding to multiple threshold (e.g., confidence) levels being met.

[0070] Processing function 170 can include the framework or functionality to generate or implement a post-processing frame smoothing 174 for the model outputs. Frame smoothing 174 can include a function or functionality to adjust values of features or characteristics (e.g., labels 148 or instrument predictions 152) of particular video frames 164 based on values of the same features or characteristics on preceding or following video frames 164. For example, frame smoothing function 174 can make any corrections across, or in view of, multiple frames to facilitate that a given video frame 164 is processed in coherence with the labeling determined by the model with respect to other video frames 164. For example, a frame smoothing 174 function can determine that multiple preceding video frames 164 and multiple following frames 164 have a particular determination (e.g., label 148 or instrument prediction 152), while the video frame 164 in between them differs from all the other video frames 164 in that regard. In response to this determination, the frame smoothing 174 can make the correction to the instrument prediction 152 or label 148 of the given video frame 164 to conform it to the neighboring video frames 164. In doing so, frame smoothing 174 can improve performance by reducing noise or inconsistencies in predictions across frames, leading to a more coherent and accurate depiction of the presence and movements of medical instruments over a sequence of video frames 164.

[0071] DPS 130 can include an interface 180 designed, constructed and operational to communicate with one or more component of system 100 via network 101, including, for example, the robotic medical system 120 or another device, such as a client’s personal computer. The interface 180 can include a network interface. The interface 180 can include or provide a user interface, such as a graphical user interface. The graphical user interface can include, for example, a window for displaying video frames 164 of a video. Interface 180 can provide data for presentation via a display, such as a display 116, and can depict, illustrate, render, present, or otherwise provide indications 182 indicating determinations (e.g., outputs) of the instrument model 144, such as instrument prediction 152 or labels 148 identifying medical instruments 112.

[0072] The data processing system 130 can interface with, communicate with, or otherwise receive or provide information with one or more component of system 100 via network 101, including, for example, the robotic medical system 120. The data processing system 130, robotic medical system 120 and devices in the medical environment 102 can each include at least one logic device such as a computing device having a processor to communicate via the network 101. The data processing system 130, robotic medical system 120 or client device coupled to the network 101 can include at least one computation resource, server, processor or memory. For example, the data processing system 130 can include a plurality of computation resources or processors coupled with memory.

[0073] The data processing system 130 can be part of or include a cloud computing environment. The data processing system 130 can include multiple, logically grouped servers and facilitate distributed computing techniques. The logical group of servers may be referred to as a data center, server farm or a machine farm. The servers can also be geographically dispersed. A data center or machine farm may be administered as a single entity, or the machine farm can include a plurality of machine farms. The servers within each machine farm can be heterogeneous - one or more of the servers or machines can operate according to one or more type of operating system platform.

[0074] The data processing system 130, or components thereof can include a physical or virtual computer system operatively coupled, or associated with, the medical environment 102. In some embodiments, the data processing system 130, or components thereof can be coupled, or associated with, the medical environment 102 via a network 101, either directly or directly through an intermediate computing device or system. The network 101 can be any type or form of network. The geographical scope of the network can vary widely and can include a body area network (BAN), a personal area network (PAN), a local-area network (LAN) (e.g., Intranet), a metropolitan area network (MAN), a wide area network (WAN), or the Internet. The topology of the network 101 can assume any form such as point-to-point, bus, star, ring, mesh, tree, etc. The network 101 can utilize different techniques and layers or stacks of protocols, including, for example, the Ethernet protocol, the internet protocol suite (TCP/IP), the ATM (Asynchronous Transfer Mode) technique, the SONET (Synchronous Optical Networking) protocol, the SDH (Synchronous Digital Hierarchy) protocol, etc. The TCP/IP internet protocol suite can include application layer, transport layer, internet layer (including, e.g., IPv6), or the link layer. The network 101 can be a type of a broadcast network, a telecommunications network, a data communication network, a computer network, a Bluetooth network, or other types of wired and wireless networks.

[0075] The data processing system 130, or components thereof, can be located at least partially at the location of the surgical facility associated with the medical environment 102 or remotely therefrom. Elements of the data processing system 130, or components thereof can be accessible via portable devices such as laptops, mobile devices, wearable smart devices, etc. The data processing system 130, or components thereof, can include other or additional elements that can be considered desirable to have in performing the functions described herein. The data processing system 130, or components thereof, can include, or be associated with, one or more components or functionality of a computing including, for example, one or more processors coupled with memory that can store instructions, data or commands for implementing the functionalities of the DPS 130 discussed herein.

[0076] In one aspect, the technical solutions can include a system 100 that can include one or more processors (e.g., 610) that can be coupled with memory (e.g., 615 or 620). The memory 615 or 620 can store instructions, computer code or data that can cause the one or more processors 620 to implement any functionality of a DPS 130, including for example any functionality of a ML trainer 140, instrument model 144, processing functions 170 or interface 180. For example, instructions stored in memory 615 or 620 can configure or cause the one or more processors to perform various operations or tasks of the DPS 130.

[0077] The one or more processors 610 can identify a series of video frames 164 of a video of a medical procedure captured by a robotic medical system 120. For example, DPS 130 can receive a real-time stream of incoming video from an ongoing medical procedure performed by a surgeon via an RMS 120. Series of video frames 164 can include frames of a video fragment, that can include, for example, 30, 45, 60, 90 or 120 video frames 164 per second. Video fragment can include a length sufficient to determine or recognize an instrument prediction 152 (e.g., medical instrument 112) an action or activity performed by the instrument prediction 152. For example, the video fragment can be 1, 2, 3, 4, 5 or 10 seconds long.

[0078] The one or more processors 610 can identify an instrument model 144 trained based at least in part on video frames 164 of a plurality of videos (e.g., videos of previously performed surgeries). Such videos can be captured for one or more (e.g., prior performed) medical procedures that are labeled with data identifying installation (e.g., installation data 122) of one or more medical instruments 112 of one or more robotic medical systems 120. For example, a training dataset 142 can include hundreds, thousands, or more than tens of thousands of various videos that can last many hours and include 30, 45, 60, 90, 120 or more than 120 video frames per second. The videos can be labeled with labels 148 indicative of installation data 122. Labels 148 can indicate, for example, time (e.g., seconds, minutes, hours and dates) of installation, attachment, configuration, setup or otherwise use of medical instruments 112 on an RMS 120

[0079] The one or more processors 610 determine, using the instrument model 144, a perframe label (e.g., 148) for each video frame 164 of the series of video frames 164. The perframe label 148 can be indicative of a probability of presence of one or more types of medical instruments 112. For example, instrument model 144 can receive an input real-time video stream (e.g., 162) with video frames 164 input into the model for detection or identification of medical instruments 112. Each video frame 164 can be processed or determined individually. In some implementations, a plurality of video frames 164 forming a video fragment can have a single video frame 164 to label. For example, a video fragment of a plurality of video segments 164 spanning, for example 3 seconds, can include every third, fourth, 10^th, 30^th, 45^th, 90^th, 120^th, 240^th, 360^th or any other video segment 164 that can be selected as the representative video frame 164 for the video fragment to label with a label 148. The label 148 can indicate the probability of presence (e.g., confidence score or certainty level) that instrument prediction 152 that any particular medical instrument 112, of a plurality of medical instruments 112, is identified within the video fragment or a video segment 164.

[0080] The one or more processors 610 can facilitate or trigger the system 100 to display the video segment or the video fragment on a display, such as a display 116. For example, the system 100 can display, via a graphical user interface (e.g., 180), an indication 182. The indication 182 can include or illustrate a presence of a type of medical instrument 112. The presence of the medical instrument 112 can be determined based at least in part on a time stamp in the video on the per-frame label 148 of the series of video frames 164 determined via the instrument model 144.

[0081] The one or more processors 610 can be configured to receive a set of video frames 164 of the plurality of videos captured for one or more medical procedures. The video frames 164 can include, for example, video frames 164 from various cameras from various locations in medical environment 102. Video frames 164 can be captured by medical instruments 112, such as for example an endoscope, which can include an endoscopic camera. The video frames 164 can be use as training dataset 142 along with each video’s corresponding installation data 122 (e.g., system instrument installation log) with events (e.g., installations, setups, configurations) of various medical instruments 112.

[0082] The one or more processors 610 can be configured to identify, based on a final video frame 164 of the set of video frames 164, a label 148 for the set of video frames 164. The label 148 can include data indicative of a time of installation (e.g., 122) of the one or more medical instruments 112 and a final time stamp of the final video frame of the set of video frames 164. The one or more processors can be configured to facilitate or trigger training of the instrument model 144 using the label 148 for the set of video frames 164.

[0083] The one or more processors 610 can be configured to identify, for the video frames 164 of the plurality of videos, a plurality of labels 148. Each label of the plurality of labels 148 can have a vector of one or more of values corresponding to one or more medical instruments 112. For example, each value in the vector of the label 148 can correspond to a probability or confidence that a particular medical instrument type (e.g., 112) is present or identified in the video frame 164 or a video fragment. The one or more processors 610 can be configured to train the instrument model 144 using the plurality of labels 148.

[0084] The one or more processors 610 can be configured to determine, for the video frames 164 of the plurality of videos, a plurality of labels 148 for the video frames 164. Each label 148 of the plurality of labels 148 can include a value indicative of whether the one or more medical instruments 112 are installed at the one or more robotic medical systems 120 at a time of each respective video frame 164 of the video frames 164. The one or more processors 610 can be configured to train the instrument model 144 using the plurality of labels 148.

[0085] The one or more processors 610 can be configured to identify one or more installation logs (e.g., 122) of the one or more robotic medical systems 120. Each log of the one or more logs (e.g., 122) can indicate a time of installation of the one or more medical instruments 112 for a respective video of the plurality of videos. The one or more processors 610 can be configured to assign, for each video frame 164 of the frames of the plurality of videos, a label 148 of the plurality of labels 148 indicative of the time of installation from a respective log (e.g., 122) of the one or more logs corresponding to the respective video of the plurality of videos.

[0086] The instrument model 144 can include a transformer neural network model that can apply a first one or more weights 146 to one or more spatial dimensions within an area of an image within a video frame 164 of the video frames 164 of the plurality of videos and a second one or more weights to temporal dimensions across a group of video frames 164 of the video frames 164 of the plurality of videos. For example, weights 146 can emphasize the importance or prioritize one region of a video frame 164 over other regions when particular conditions are met, such as when a particular timing occurs, or a particular feature is detected.

[0087] The one or more processors 610 can be configured to generate, based at least on the video frames 164 of the plurality of videos, a heat map 172. The heat map 172 can be indicative of an area within a subset of the video frames 164 for which the probability of presence of the type of medical instruments 112 exceeds a threshold for the heat map 172. For example, a heat map 172 can include a highlighted portion of a video frame 164 for which the probability of presence of the medical instrument 112 exceeds a particular threshold (e.g., 75% or 90%). The one or more processors 610 can be configured to identify, based at least on the video frames 164 of the plurality of videos, a second area within the subset of the video frames 164 for which the probability of presence of the type of medical instruments 112 exceeds a second threshold exceeding the first threshold. The second threshold can correspond to a second heat map 172 to be displayed. For example, the second area can be a region within the first area and the second heat map 172 can cover a region within the first heat map 172. The first or the second areas of the heat map 172 can be overlaid over the video frames being displayed. For instance, the second heat map 172 can be highlighted in a darker or a more pronounced shade than the first heat map 172. The second area can include a certainty or confidence level (e.g., threshold) that is higher than the threshold of the first heat map 172. The one or more processors 610 can be configured to display, via the graphical user interface, the subset of the frames along with an overlay of the heat map 172, which can include any combination of the first area and the second area.

[0088] The one or more processors 610 can be configured to compare the probability of presence of each respective video frame 164 of the series of video frames 164 with a threshold for presence of the type of medical instruments 112. The one or more processors 610 can be configured to determine, based at least in part on the comparison, a second per-frame label 148 for each respective frame of the series of video frames 164. The second per-frame label 148 can be indicative of whether the type of medical instruments 112 is present at the robotic medical system 120.

[0089] The one or more processors 610 can be configured to identify, from the series of video frames 164 of the video, a subset of the video frames 164 corresponding to a portion of the video capturing the type of the medical instruments 112 used in the medical procedure. The one or more processors 610 can be configured to determine, based on the subset of the video frames 164 input into the model, the respective per video frame 164 label for the subset of the video frames 164.

[0090] The one or more processors 610 can be configured to receive from a robotic medical system 120, a file comprising an indication of a time of the installation of the one or more medical instruments 112 at the robotic medical system 120. The one or more processors 610 can be configured to determine, based at least on the indication of the time input into the instrument model 144, the respective per frame label 148 for at least a video frame 164 of the video frames 164.

[0091] The one or more processors 610 can be configured to generate, based at least on the per-frame label for each video frame 164 of the series of video frames 164, a series of per- frame labels. The one or more processors 610 can be configured to adjust, a value of a first per- frame label 148 for a first video frame 164 of the series of video frames 164 using at least a second value of a second per-frame label 148 for a second video frame 164 adjacent to the first video frame 164. The one or more processors 610 can be configured to determine, using the instrument model 144, the per-frame label based at least on a time of installation of the one or more medical instruments 112 at the one or more robotic medical systems 120.

[0092] The one or more processors 610 can be configured to determine, based on a comparison of the time stamp and the time of installation, the probability of the presence. The one or more processors 610 can be configured to identify the type of medical instruments 112 based at least on the probability of presence exceeding a threshold for the type of medical instruments 112. The one or more processors 610 can be configured to display the indication identifying the type of medical instruments 112. The indication is overlaid over a subset of the series of video frames 164 displayed on the graphical user interface, the subset of the series having the probability of presence that exceeds a threshold for the type of medical instruments 112.

[0093] An aspect of the technical solutions can be direct to a non-transitory computer- readable medium storing processor executable instructions. The instructions can be such, that when executed by one or more processors, they cause the one or more processors to identify a series of frames of a video of a medical procedure captured by a robotic medical system. When executed by one or more processors, the instructions can identify a model trained based at least in part on frames of a plurality of videos captured for one or more medical procedures that are labeled with data identifying installation of one or more instruments of one or more robotic medical systems. When executed by one or more processors, the instructions can determine, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of one or more types of instruments. When executed by one or more processors, the instructions can display, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model. The indication can be overlaid over a subset of the series of frames displayed on the graphical user interface, the subset of the series having the probability of presence that exceeds a threshold for the type of instrument.

[0094] FIG. 2 illustrates an example 200 of a graphical user interface 202 providing indications 182 for an ML model processed video frame capturing and identifying medical instruments 112. Graphical user interface 202 can be a type of an interface 180 in which one or more indications 182 can be provided or illustrated for a user (e.g., a surgeon utilizing an RMS 120). Graphical user interface 202 can include a location or a window for displaying one or more video frames 164, which can include frames of an input video file processed by instrument model 144. The video frames 164 can be frames processed in real-time during an ongoing surgical procedure or correspond to a prior-recorded medical procedure.

[0095] Graphical user interface can include indications 182 of a heat map (e.g., 172) indicative of locations within the video frame 164 in which instrument model 144 determined medical instruments 112 are present. Video frame 164 can identify the manipulator arms 206 attached to the RMS 120 as the medical instruments 112 used in this instance. A left-side manipulator arm 206A can be shown performing a task alongside a right-side manipulator arm 206B. Duration time 204 can include a window indicating to the user the current time (e.g., time stamp) of the video frame 164 within the surgical video being viewed. Graphical user interface can include options or buttons for user’s selection, such as a segments 220 or select instruments 222 buttons. Segments 220 can provide the user with additional information on the surgical segments performed. Select instruments 222 can allow the user (e.g., surgeon) to select a particular instrument to manipulate, hold or maneuver.

[0096] Indications 182 can be illustrated in various formats, such as a line or a bar indicative of the presence or absence of a particular instrument type (e.g., manipulator arm 206 or any other medical instrument 112) with respect to the time duration of the video recording. Indications 182 can indicate or identify, such as via color-coded fillings of the line or the bar of the indication, the type of medical instrument 112 identified or present within particular video frames 164 (e.g., time portions of the video). Indications 182 can include or identify phases 208, which can correspond to various phases of the medical procedure, allowing the user to select or scroll to a start of a particular phase. Indications 182 can include or identify steps 210, such as particular tasks or steps in the medical procedure. Video timer 212 can include a time bar allowing the user to temporally scroll along between the video frames 164 of the video, such as for example, select a particular moment in the video by clicking on a point along the time bar.

[0097] FIG. 3 illustrates a system configuration 300 for generating and deploying noisy lab el -tolerant ML models to detect provide instrument presence 302 and spatial context 304. Instrument presence 302 can include any output corresponding to classification, recognition or identification of a medical instrument 112, whether a robotic manipulator arm 206 or any tool handled or manipulated by the arm. Instrument presence 302 can include a probability output that an instrument is present, or a definitive output with or without a probability or confidence score associated. Spatial context 304 can include any information or data on spatial position or orientation of the identified medical instrument or a tool. For example, spatial context 304 can be indicated with respect to a reference point (e.g., a location in a video frame 164 or a medical environment 102) and can correspond to the location of medical instrument 112 (e.g., instrument type) identified in the image or frame.

[0098] For example, a data stream 162 having input video frames 164 can be received by a data processing system 130. Data stream 162 can include video frames 164 of the input video along with an installation data 122 of the RMS 120 indicating time stamped events, such as time stamped occurrences of instrument installation, uninstallation, attachment, detachment, calibration or use. The data processing system 130 can be deployed on a server, a computer, on a cloud or across any number of devices or systems (e.g., as a distributed system). DPS 130 can utilize or execute an instrument model 144, such as a spatial -temporal transformer-based neural network model with weights implemented and applied to particular model features to provide an attention mechanism for analyzing input data.

[0099] Using the data stream 162 input into the instrument model 144, DPS 130 can provide outputs of determinations of instrument presence 302, such as instrument predictions 152 that can include per-frame labels 148 indicative of the presence of a particular instrument type. For example, instrument presence 302 can include a value in a vector of per-frame labels 148 indicative of whether a particular type of instrument is present in the video frame 164. Instrument model 144 can also provide spatial context 304, which can include a location of the instrument type determined to be present. For example, spatial context 304 can include a heat map 172 indication that can identify or highlight locations in which the instrument type is determined to be present.

[00100] Turning now to FIG. 4, an example flow diagram of a method 400 for generating and using noisy lab el -tolerant ML models to detect objects in robotic procedure videos is illustrated. The method 400 can be performed by a system having one or more processors executing computer-readable instructions stored on a memory. The method 400 can be performed, for example, by system 100 and in accordance with any features or techniques discussed in connection with FIGS. 1-3 and 5-6. For instance, the method 400 can be implemented one or more processors 610 of a computing system 600 executing non-transitory computer-readable instructions stored on a memory (e.g., the memory 615, 620 or 625) and using data from a data repository 160 (e.g., storage device 625). [00101] The method 400 can be used to train an ML model using data from instrument system logs of a robotic medical system to detect objects in a video recorded procedure performed via a robotic medical system. At operation 405, the method can label video frames using noisy data. At operation 410, the method can train a ML model with the labeled frames. At operation 415, the method can determine a per-frame label for one or more input video frames. At 420, the method can determine whether a per-frame label exceeds a threshold. At 425, based on the operation at 420, the method can determine that the instrument is present. At 430, based on the operation at 420, the method can determine that the instrument is not present. At 435, the method can modify the presence determination per post-processing. At 440, the method can display the video frames with indications of instrument presence.

[00102] At operation 405, the method can label video frames using noisy data. The noisy data can include mismatches or discrepancies between the labels for frames from the system logs indicating the presence or installation of a particular instrument (e.g., instrument type) and the image frame for the same label not depicting the particular instrument (e.g., instrument type). For example, the method can label video frames using labels for the series of video frames, wherein a label of a frame of the labeled frames indicates the presence of the type of instrument in the frame and the frame does not include an image of the type of instrument.

[00103] The method can include using data of system logs (e.g., instrument installation and use log) to label video frames of a training dataset for training an ML model for identification and detection of medical instruments. For example, a data repository can store one or more training data sets including a plurality of videos of a plurality of procedures in which a robotic medical system is used to perform medical operations using medical instruments. Training data sets can include installation data that can include installation files or logs for various medical instruments used in connection with any of the medical procedures captured on the plurality of videos used for training of the ML model.

[00104] The method can include identifying one or more labels (e.g., a plurality of labels) for frames of the plurality of videos used for training of the ML model. Each label of the plurality of labels can include or correspond to a vector of one or more of values corresponding to one or more instruments. Each label can correspond to one or more video frames in the one or more video frames used for training of the ML model. For example, a label can correspond to a video fragment providing a plurality of seconds of a video and including a plurality of video frames. The label can indicate presence or non-presence (e.g., absence) of a medical instrument. For example, the label can include a vector with a plurality of entries, each entry corresponding to medical instrument of a plurality of medical instruments. The label can identify, for each medical instrument in its corresponding value, whether that medical instrument is present or not present in the video.

[00105] The method can include identifying, based on a final frame of the set of frames, a label for the set of frames. The label can include data indicative of a time of installation of the one or more instruments. The label can include a final time stamp of the final frame of the set of frames. The label can indicate an event and a timing of the event, such as the installation, uninstallation, configuration, deconfiguration, attachment, detachment, movement or use of a given medical instrument of the robotic medical system.

[00106] For example, the method can include determining, for the frames of the plurality of videos for training of the ML model, a plurality of labels for the frames. Each label of the plurality of labels can include a value indicative of whether the one or more instruments is installed, configured, attached to or otherwise being used by or at the one or more robotic medical systems at a time of each respective frame of the video frames used for training the ML model.

[00107] At operation 410, the method can train a ML model with the labeled frames. The method can include using the training data set of video frames and the installation data (e.g., logged events of installation, configuration, attachment or use of the medical instruments) to train the ML model for detection, identification and recognition of medical instruments from video data. The method can include, for example, identifying a model trained based at least in part on frames of a plurality of videos captured for one or more medical procedures that are labeled using data. The data can include, for example, information identifying installation, configuration, attachment, calibration or use of one or more instruments of one or more robotic medical systems. The data can include the noisy data, such as data from the labels for frames that are taken out from the system logs of the robotic system. Such data can include the information that mismatches or includes discrepancies between the timestamp indicating the presence or installation of a particular instrument (e.g., instrument type) onto a robotic arm and the image frame in which the particular instrument (e.g., instrument type) may not be shown or displayed. For example, the method can train a ML model using labels for the series of video frames, in which labels of some of the labeled frames indicate the presence of the type of instrument in the frame, whereas those specific labeled frames do not include an image of the type of instrument indicated in the label. Such mismatches or noise in the data can be overcome using the spatial-temporal transformer-based neural network model in which specific weights can be applied to portions of data having greater importance than other data.

[00108] The data processing system can train the ML model utilizing any ML architecture or framework. For example, the ML model can include a transformer neural network model, a graphical neural network model or any other attention mechanism machine learning model. The ML model can be configured or trained to include, utilize or apply a first one or more weights to one or more spatial dimensions within an area of an image within a frame of the frames of the plurality of videos. The ML model can be configured or trained to include, utilize or apply a second one or more weights to temporal dimensions across a group of frames of the frames of the plurality of videos. The weights can configure or set up the model to focus on, emphasize or otherwise implement an attention mechanism, on a particular set or combination of features encountered in the input data (e.g., video frames or installation data) of the video to be processed.

[00109] The method can include identify and using one or more logs of the one or more robotic medical systems to train the ML model. Each log of the one or more logs can indicate a time of installation, attachment, configuration or use of the one or more instruments for a respective video of the plurality of videos. The method can assign, for each frame of the frames of the plurality of videos for training the ML model, a label of the plurality of labels indicative of the time of installation, configuration, attachment or use of the medical instrument from a respective log of the one or more logs corresponding to the respective video of the plurality of videos. The method can include receiving a set of frames of the plurality of videos captured for one or more medical procedures and train the model using the label for the set of frames identified at operation 405. The method can include training the model using the plurality of labels identified or determined at operation 405.

[00110] At operation 415, the method can determine a per-frame label for one or more input video frames. The method can include using the ML model trained at operation 410 to determine a per-frame label for one or more input video frames of a video of a medical operation to be processed for presence and recognition of the medical instruments using the ML model. For example, a data processing system can receive a video of a medical procedure to be processed for identification and detection of medical instruments. The data processing system can receive or access a video of a medical procedure stored in a data repository. The method can include identifying a series of frames of a video of a medical procedure captured by a robotic medical system. The series of frames can correspond to a video of a previously performed medical procedure or a video streamed in real-time and corresponding to an ongoing surgical procedure.

[00111] The method can include using the ML model to determine a per-frame label for each frame of the series of frames. The per-frame label can be indicative of a probability of presence of one or more types of instruments. The per-frame label can include an individual label for each individual frame of an input video received for processing or can include a label for a plurality of consecutive video frames of the input video. The method can use the ML model to determine the per-frame label based at least on a time of installation, configuration, attachment or use of the one or more instruments at the one or more robotic medical systems. The method can use the ML model to determine, based on a comparison of the time stamp and the time of installation, the probability of the presence.

[00112] The method can include identifying, from the series of frames of the video, a subset of the frames corresponding to a portion of the video capturing the type of the instrument used in the medical procedure. For example, the ML model can identify a video fragment comprising any number of consecutive video frames (e.g., 30, 45, 60, 90, 120, 150, 180, 210 video frames) that can span any number of seconds of a video (e.g., 2, 3, 4, 5 seconds or more than 5 seconds). The method can determine a per-frame label for the video fragment. The per- frame label can identify, based on the attention mechanism (e.g., weights) of the ML model, the per-frame label for the video frame of the plurality of video frames. The ML model can determine, based on the subset of the frames input into the model, the respective per frame label for the subset of the frames.

[00113] At operation 420, the method can determine whether a per-frame label exceeds a threshold. The method can include comparing a value of the per-frame label corresponding to a probability or a confidence level that a given medical instrument is present in the video against a threshold of the probability of presence. The threshold can be any threshold for the probability of presence acceptable to determine that the medical instrument is present in the video (e.g., 90% certainty or confidence, 95%, 99%, 99.5% or greater than 99.5%). For example, the per-frame label can include a value indicative of a probability of presence that is greater than a threshold or a value indicative of a probability that is not greater than the threshold (e.g., less than the threshold).

[00114] The method can include comparing the probability of presence for each respective frame of the series of frames with a threshold for presence of the given type of instrument. The threshold can be the same for all instrument types or can vary based on the instrument type. The threshold can be the same for per-frame labels for each of the frames or can vary for different frames. For example, a higher threshold can be used for a per-frame label for a video fragment of a plurality of video frames than for per-frame labels for each video frame individually. The method can receive, from a robotic medical system, a file comprising an indication of a time of the installation of the one or more instruments at the robotic medical system. The method can determine, based at least on the indication of the time input into the model, the respective per-frame label for at least a frame of the frames.

[00115] At 425, based on the operation at 420, the method can determine that the instrument is present. Based a determination that the per-frame label exceeds a threshold at operation 420, the method can determine that a particular type of instrument is present in the video frame or video fragment. For example, if a confidence score or a probability of determination of the presence of the medical instrument in a video frame or a fragment represented by a per-frame label is greater than a threshold, the method can create a value for the vector of the per-frame label indicative of the instrument presence.

[00116] At 430, based on the operation at 420, the method can determine that the instrument is not present. For example, if a value in the per-frame label at operation 420 does not exceed the threshold, the ML model can determine that the instrument is not present. For example, if a confidence score or a probability of determination of the presence of the medical instrument in a video frame or a fragment represented by a per-frame label is not greater than the threshold at operation 420, the method can create a value for the vector of the per-frame label indicative that the instrument is not present.

[00117] At 435, the method can modify the presence determination per post-processing. The method can include adjustments or corrections to determinations at operations 425 or 430, based on post-processing functions, such as frame smoothing or filtering of outliers. For example, the method can monitor a series of per-frame labels of a series of consecutive video frames in which each of the frames, except one, has a particular determination of presence or non-presence of a medical instrument exceeding a particular threshold. The method can identify, within such a series of frames, a single frame that includes a determination that contradicts (e.g., opposite to) determinations of all surrounding frames. A processing function can, in response to this identification, make a correction to the outlier per-frame label to conform the determination to the determinations of all surrounding per-frame labels. [00118] For example, the method can include determining, based at least in part on the comparison at operation 420, a second per-frame label for each respective frame of the series of frames indicative of whether the type of instrument is present at the robotic medical system. The method can generate, based at least on the per-frame label for each frame of the series of frames, a series of per-frame labels. The method can adjust, a value of a first per-frame label for a first frame of the series of frames using at least a second value of a second per-frame label for a second frame adjacent to the first frame.

[00119] The method can generate heat maps for identifying or highlighting areas in the video frames corresponding to locations in which medical instruments are detected. For example, the method can generate, based at least on the frames of the plurality of videos, a heat map indicative of an area within a subset of the frames for which the probability of presence of the type of instrument exceeds a threshold for the heat map. The method can identify, based at least on the frames of the plurality of videos, a second area within the subset of the frames for which the probability of presence of the type of instrument exceeds a second threshold exceeding the first threshold. The second area of the second heat map can be within the area of the first heat map, thereby indicating a higher probability area (e.g., more pronounced highlighting) for the video frame to be displayed.

[00120] At 440, the method can display the video frames with indications of instrument presence. The method can include displaying, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per- frame label of the series of frames determined via the model. For example, the method can include the one or more processors displaying the indication identifying the type of instrument determined at operation 425. The indication can be overlaid over a subset of the series of frames displayed on the graphical user interface, the subset of the series having the probability of presence that exceeds a threshold for the type of instrument. The indication can include heat maps that can be displayed to highlight locations of the instruments.

[00121] FIG. 5 depicts a surgical system 500, in accordance with some embodiments. The surgical system 500 may be an example of the medical environment 102. The surgical system 500 may include a robotic medical system 505 (e.g., the robotic medical system 120), a user control system 510, and an auxiliary system 515 communicatively coupled one to another. A visualization tool 520 (e.g., the visualization tool 114) may be connected to the auxiliary system 515, which in turn may be connected to the robotic medical system 505. Thus, when the visualization tool 520 is connected to the auxiliary system 515 and this auxiliary system is connected to the robotic medical system 505, the visualization tool may be considered connected to the robotic medical system. In some embodiments, the visualization tool 520 may additionally or alternatively be directly connected to the robotic medical system 505.

[00122] The surgical system 500 may be used to perform a computer-assisted medical procedure on a patient 525. In some embodiments, surgical team may include a surgeon 530A and additional medical personnel 530B-530D such as a medical assistant, nurse, and anesthesiologist, and other suitable team members who may assist with the surgical procedure or medical session. The medical session may include the surgical procedure being performed on the patient 525, as well as any pre-operative (e.g., which may include setup of the surgical system 500, including preparation of the patient 525 for the procedure), and post-operative (e.g., which may include clean up or post care of the patient), or other processes during the medical session. Although described in the context of a surgical procedure, the surgical system 500 may be implemented in a non-surgical procedure, or other types of medical procedures or diagnostics that may benefit from the accuracy and convenience of the surgical system.

[00123] The robotic medical system 505 can include a plurality of manipulator arms 535 A- 535D to which a plurality of medical tools (e.g., the medical tool 112) can be coupled or installed. Each medical tool can be any suitable surgical tool (e.g., a tool having tissueinteraction functions), imaging device (e.g., an endoscope, an ultrasound tool, etc.), sensing instrument (e.g., a force-sensing surgical instrument), diagnostic instrument, or other suitable instrument that can be used for a computer-assisted surgical procedure on the patient 525 (e.g., by being at least partially inserted into the patient and manipulated to perform a computer- assisted surgical procedure on the patient). Although the robotic medical system 505 is shown as including four manipulator arms (e.g., the manipulator arms 535A-535D), in other embodiments, the robotic medical system can include greater than or fewer than four manipulator arms. Further, not all manipulator arms can have a medical tool installed thereto at all times of the medical session. Moreover, in some embodiments, a medical tool installed on a manipulator arm can be replaced with another medical tool as suitable.

[00124] One or more of the manipulator arms 535A-535D and/or the medical tools attached to manipulator arms can include one or more displacement transducers, orientational sensors, positional sensors, and/or other types of sensors and devices to measure parameters and/or generate kinematics information. One or more components of the surgical system 500 can be configured to use the measured parameters and/or the kinematics information to track (e.g., determine poses of) and/or control the medical tools, as well as anything connected to the medical tools and/or the manipulator arms 535A-535D.

[00125] The user control system 510 can be used by the surgeon 530A to control (e.g., move) one or more of the manipulator arms 535A-535D and/or the medical tools connected to the manipulator arms. To facilitate control of the manipulator arms 535A-535D and track progression of the medical session, the user control system 510 can include a display (e.g., the display 116 or 1130) that can provide the surgeon 530A with imagery (e.g., high-definition 3D imagery) of a surgical site associated with the patient 525 as captured by a medical tool (e.g., the medical tool 112, which can be an endoscope) installed to one of the manipulator arms 535A-535D. The user control system 510 can include a stereo viewer having two or more displays where stereoscopic images of a surgical site associated with the patient 525 and generated by a stereoscopic imaging system can be viewed by the surgeon 530A. In some embodiments, the user control system 510 can also receive images from the auxiliary system 515 and the visualization tool 520.

[00126] The surgeon 530A can use the imagery displayed by the user control system 510 to perform one or more procedures with one or more medical tools attached to the manipulator arms 535A-535D. To facilitate control of the manipulator arms 535A-535D and/or the medical tools installed thereto, the user control system 510 can include a set of controls. These controls can be manipulated by the surgeon 530A to control movement of the manipulator arms 535A- 535D and/or the medical tools installed thereto. The controls can be configured to detect a wide variety of hand, wrist, and finger movements by the surgeon 530A to allow the surgeon to intuitively perform a procedure on the patient 525 using one or more medical tools installed to the manipulator arms 535A-535D.

[00127] The auxiliary system 515 can include one or more computing devices configured to perform processing operations within the surgical system 500. For example, the one or more computing devices can control and/or coordinate operations performed by various other components (e.g., the robotic medical system 505, the user control system 510) of the surgical system 500. A computing device included in the user control system 510 can transmit instructions to the robotic medical system 505 by way of the one or more computing devices of the auxiliary system 515. The auxiliary system 515 can receive and process image data representative of imagery captured by one or more imaging devices (e.g., medical tools) attached to the robotic medical system 505, as well as other data stream sources received from the visualization tool. For example, one or more image capture devices (e.g., the image capture devices 110) can be located within the surgical system 500. These image capture devices can capture images from various viewpoints within the surgical system 500. These images (e.g., video streams) can be transmitted to the visualization tool 520, which can then passthrough those images to the auxiliary system 515 as a single combined data stream. The auxiliary system 515 can then transmit the single video stream (including any data stream received from the medical tool(s) of the robotic medical system 505) to present on a display (e.g., the display 116) of the user control system 510.

[00128] In some embodiments, the auxiliary system 515 can be configured to present visual content (e.g., the single combined data stream) to other team members (e.g., the medical personnel 530B-530D) who might not have access to the user control system 510. Thus, the auxiliary system 515 can include a display 540 configured to display one or more user interfaces, such as images of the surgical site, information associated with the patient 525 and/or the surgical procedure, and/or any other visual content (e.g., the single combined data stream). In some embodiments, display 540 can be a touchscreen display and/or include other features to allow the medical personnel 530A-530D to interact with the auxiliary system 515.

[00129] The robotic medical system 505, the user control system 510, and the auxiliary system 515 can be communicatively coupled one to another in any suitable manner. For example, in some embodiments, the robotic medical system 505, the user control system 510, and the auxiliary system 515 can be communicatively coupled by way of control lines 545, which can represent any wired or wireless communication link that can serve a particular implementation. Thus, the robotic medical system 505, the user control system 510, and the auxiliary system 515 can each include one or more wired or wireless communication interfaces, such as one or more local area network interfaces, Wi-Fi network interfaces, cellular interfaces, etc. It is to be understood that the surgical system 500 can include other or additional components or elements that can be needed or considered desirable to have for the medical session for which the surgical system is being used.

[00130] FIG. 6 depicts an example block diagram of an example computer system 600 is shown, in accordance with some embodiments. The computer system 600 can be any computing device used herein and can include or be used to implement a data processing system or its components. The computer system 600 includes at least one bus 605 or other communication component or interface for communicating information between various elements of the computer system. The computer system further includes at least one processor 610 or processing circuit coupled to the bus 605 for processing information. The computer system 600 also includes at least one main memory 615, such as a random-access memory (RAM) or other dynamic storage device, coupled to the bus 605 for storing information, and instructions to be executed by the processor 610. The main memory 615 can be used for storing information during execution of instructions by the processor 610. The computer system 600 can further include at least one read only memory (ROM) 620 or other static storage device coupled to the bus 605 for storing static information and instructions for the processor 610. A storage device 625, such as a solid-state device, magnetic disk or optical disk, can be coupled to the bus 605 to persistently store information and instructions.

[00131] The computer system 600 can be coupled via the bus 605 to a display 630, such as a liquid crystal display, or active-matrix display, for displaying information. An input device 635, such as a keyboard or voice interface can be coupled to the bus 605 for communicating information and commands to the processor 610. The input device 635 can include a touch screen display (e.g., the display 630). The input device 635 can also include a cursor control, such as a mouse, a trackball, or cursor direction keys, for communicating direction information and command selections to the processor 610 and for controlling cursor movement on the display 630.

[00132] The processes, systems and methods described herein can be implemented by the computer system 600 in response to the processor 610 executing an arrangement of instructions contained in the main memory 615. Such instructions can be read into the main memory 615 from another computer-readable medium, such as the storage device 625. Execution of the arrangement of instructions contained in the main memory 615 causes the computer system 600 to perform the illustrative processes described herein. One or more processors in a multiprocessing arrangement can also be employed to execute the instructions contained in the main memory 615. Hard-wired circuitry can be used in place of or in combination with software instructions together with the systems and methods described herein. Systems and methods described herein are not limited to any specific combination of hardware circuitry and software.

[00133] Although an example computing system has been described in FIG. 6, the subject matter including the operations described in this specification can be implemented in other types of digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. [00134] The herein described subject matter sometimes illustrates different components contained within, or connected with, different other components. It is to be understood that such depicted architectures are illustrative, and that in fact many other architectures can be implemented which achieve the same functionality. In a conceptual sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality, and any two components capable of being so associated can also be viewed as being “operably couplable,” to each other to achieve the desired functionality. Specific examples of operably couplable include but are not limited to physically mateable or physically interacting components or wirelessly interactable or wirelessly interacting components or logically interacting or logically interactable components.

[00135] With respect to the use of plural or singular terms herein, those having skill in the art can translate from the plural to the singular or from the singular to the plural as is appropriate to the context or application. The various singular/plural permutations can be expressly set forth herein for sake of clarity.

[00136] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.).

[00137] Although the figures and description can illustrate a specific order of method steps, the order of such steps can differ from what is depicted and described, unless specified differently above. Also, two or more steps can be performed concurrently or with partial concurrence, unless specified differently above. Such variation can depend, for example, on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the disclosure. Likewise, software implementations of the described methods can be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various connection steps, processing steps, comparison steps, and decision steps. [00138] It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation, no such intent is present. For example, as an aid to understanding, the following appended claims can contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to inventions containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” or “an” should typically be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should typically be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, typically means at least two recitations, or two or more recitations).

[00139] Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general, such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”

[00140] Further, unless otherwise noted, the use of the words “approximate,” “about,” “around,” “substantially,” etc., mean plus or minus ten percent.

[00141] The foregoing description of illustrative implementations has been presented for purposes of illustration and of description. It is not intended to be exhaustive or limiting with respect to the precise form disclosed, and modifications and variations are possible in light of the above teachings or can be acquired from practice of the disclosed implementations. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

Claims

CLAIMS What is claimed is:

1. A system, comprising: one or more processors, coupled with memory, to: identify a series of frames of a video of a medical procedure captured by a robotic medical system; identify a model trained based at least in part on frames of a plurality of videos captured for one or more medical procedures that are labeled with data identifying installation of one or more instruments of one or more robotic medical systems; determine, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of one or more types of instruments; and display, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model.

2. The system of claim 1, wherein the one or more processors are further configured to: receive a set of frames of the plurality of videos captured for one or more medical procedures; identify, based on a final frame of the set of frames, a label for the set of frames, the label including data indicative of a time of installation of the one or more instruments and a final time stamp of the final frame; and train the model using the label for the set of frames.

3. The system of claim 1, wherein the one or more processors are further configured to: identify, for the frames of the plurality of videos, a plurality of labels, each label of the plurality of labels having a vector of one or more of values corresponding to one or more instruments; and train the model using the plurality of labels.

4. The system of claim 1, wherein the one or more processors are further configured to: determine, for the frames of the plurality of videos, a plurality of labels for the frames, each label of the plurality of labels having a value indicative of whether the one or more instruments is installed at the one or more robotic medical systems at a time of each respective frame of the frames; and train the model using the plurality of labels.

5. The system of claim 1, wherein the one or more processors are further configured to: identify one or more logs of the one or more robotic medical systems, each log of the one or more logs prepared to indicate a time of installation of the one or more instruments for a respective video of the plurality of videos; and assign, for each frame of the frames of the plurality of videos, a label of the plurality of labels indicative of the time of installation from a respective log of the one or more logs corresponding to the respective video of the plurality of videos.

6. The system of claim 1, wherein the model includes a spatial temporal neural network model that applies a first one or more weights to one or more spatial dimensions within an area of an image within a frame of the frames of the plurality of videos and a second one or more weights to temporal dimensions across a group of frames of the frames of the plurality of videos.

7. The system of claim 1, wherein the one or more processors are further configured to: generate, based at least on the model, a heat map indicative of an area within a subset of the frames for which the probability of presence of the type of instrument exceeds a threshold for the heat map.

8. The system of claim 7, wherein the one or more processors are further configured to: identify, based at least on the model, a second area within the subset of the frames for which the probability of presence of the type of instrument exceeds a second threshold exceeding the threshold; and display, via the graphical user interface, the subset of the frames with an overlay of the heat map of at least one of the first area and the second area.

9. The system of claim 1, wherein the one or more processors are further configured to: compare the probability of presence for each respective frame of the series of frames with a threshold for presence of the type of instrument; and determine, based at least in part on the comparison, a second per-frame label for each respective frame of the series of frames indicative of whether the type of instrument is present at the robotic medical system.

10. The system of claim 1, wherein the one or more processors are further configured to: identify, from the series of frames of the video, a subset of the frames corresponding to a portion of the video capturing the type of the instrument used in the medical procedure; and determine, based on the subset of the frames input into the model, the respective per frame label for the subset of the frames.

11. The system of claim 1, wherein the one or more processors are further configured to: receive, from a robotic medical system, a file comprising an indication of a time of the installation of the one or more instruments at the robotic medical system; and determine, based at least on the indication of the time input into the model, the respective per frame label for at least a frame of the frames.

12. The system of claim 1, wherein the one or more processors are further configured to: generate, based at least on the per-frame label for each frame of the series of frames, a series of per-frame labels; and adjust, a value of a first per-frame label for a first frame of the series of frames using at least a second value of a second per-frame label for a second frame adjacent to the first frame.

13. The system of claim 1, wherein the one or more processors are further configured to: determine, using the model, the per-frame label based at least on a time of installation of the one or more instruments at the one or more robotic medical systems; and determine, based on a comparison of the time stamp and the time of installation, the probability of the presence.

14. The system of claim 1, wherein the one or more processors are further configured to: identify the type of instrument based at least on the probability of presence exceeding a threshold for the type of instrument; and display the indication identifying the type of instrument.

15. The system of claim 1, wherein the indication is overlaid over a subset of the series of frames displayed on the graphical user interface, the subset of the series having the probability of presence that exceeds a threshold for the type of instrument.

16. A method, comprising: labeling, by one or more processors coupled with memory, frames of a plurality of videos captured for one or more medical procedures using data identifying installation of one or more types of instruments of one or more robotic medical systems; training, by the one or more processors, a model using the labeled frames; determining, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of the one or more types of instruments; and displaying, via a graphical user interface, an indication of a presence of a type of instrument.

17. The method of claim 16, comprising: determining, by the one or more processors, the presence of the type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model, wherein a label of a frame of the labeled frames indicates the presence of the type of instrument and the frame does not include an image of the type of instrument.

18. The method of claim 16, comprising receiving, by the one or more processors, a set of frames of the plurality of videos captured for one or more medical procedures; identifying, by the one or more processors, based on a final frame of the set of frames, a label for the set of frames, the label including data indicative of a time of installation of the one or more instruments and a final time stamp of the final frame; and training, by the one or more processors, the model using the label for the set of frames.

19. A non-transitory computer-readable medium storing processor executable instructions, that when executed by one or more processors, cause the one or more processors to: identify a series of frames of a video of a medical procedure captured by a robotic medical system; identify a model trained based at least in part on frames of a plurality of videos captured for one or more medical procedures that are labeled with data identifying installation of one or more instruments of one or more robotic medical systems; determine, using the model, a per-frame label for each frame of the series of frames, the per-frame label indicative of a probability of presence of one or more types of instruments; and display, via a graphical user interface, an indication of a presence of a type of instrument based at least in part on a time stamp in the video on the per-frame label of the series of frames determined via the model.

20. The non-transitory computer-readable medium of claim 19, wherein the indication is overlaid over a subset of the series of frames displayed on the graphical user interface, the subset of the series having the probability of presence that exceeds a threshold for the type of instrument.