WO2025194117A1

WO2025194117A1 - Interaction detection between robotic medical instruments and anatomical structures

Info

Publication number: WO2025194117A1
Application number: PCT/US2025/020056
Authority: WO
Inventors: Rui Guo; Marzieh ERSHAD LANGROODI; Xi Liu; Conor Perreault; Benjamin Mueller; Anthony M. Jarc
Original assignee: Intuitive Surgical Operations Inc
Current assignee: Intuitive Surgical Operations Inc
Priority date: 2024-03-15
Filing date: 2025-03-14
Publication date: 2025-09-18
Anticipated expiration: 2026-09-15

Abstract

This disclosure is directed to a machine learning framework to detect and analyze interactions between medical instruments anatomies of a patient in a robotic medical procedure. A system identifies, from a data stream of a medical procedure with a robotic medical system, a movement of an instrument used in the medical procedure over the plurality of frames. The system identifies, using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure. The system detects, based at least on a comparison of the movement of the instrument and the pattern of motion of the anatomical structure, an interaction between the instrument and the anatomical structure. The system provides, via an interface, an indication of the interaction.

Description

INTERACTION DETECTION BETWEEN ROBOTIC MEDICAL INSTRUMENTS

AND ANATOMICAL STRUCTURES

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/566,086, filed March 15, 2024, which is incorporated by reference herein in its entirety.

BACKGROUND

[0002] Medical procedures can be performed in an operating room. As the amount and variety of equipment in the operating room increases, or medical procedures become increasingly complex, it can be challenging to perform such medical procedures efficiently, reliably, or without incident.

SUMMARY

[0003] The technical solutions of the present disclosure provide machine learning (ML) based detection and analysis of interactions between medical instruments used in robotic surgeries and anatomical features (e.g., tissues) of patients. Detecting and monitoring tooltissue interactions and in robotic medical systems can be challenging. As various robotic instruments (e.g., surgical tools) can be used to interact with various types of patient anatomies, there can be a risk of unintended injuries and tissue tears unless precautions are taken to perform such tasks within an acceptable range of motion and applied force. However, depending on the tissue location and types of actions performed or types of instruments used, it can be difficult to detect and monitor such interactions, increasing the risk of an injury or error. The technical solutions overcome these challenges by providing ML based detection, recognition and analysis of robotic surgical tool-tissue interactions, allowing for real-time alerts and indications to reduce the risks and improve the surgical performance.

[0004] At least one aspect of the technical solutions is directed to a system. The system can include one or more processors, coupled with memory. The one or more processors can identify, from a data stream of a medical procedure with a robotic medical system, a movement of an instrument used in the medical procedure over the plurality of frames. The one or more processors can identify, using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure. The one or more processors can detect, based at least on a comparison of the movement of the instrument and the pattern of motion of the anatomical structure, an interaction between the instrument and the anatomical structure. The one or more processors can provide, via an interface, an indication of the interaction.

[0005] The one or more processors can be configured to determine a type of the interaction using a model trained using machine learning and based at least on the comparison of the movement of the instrument and the pattern of motion of the anatomical structure. The one or more processors can provide, via the interface, an indication of the type of interaction.

[0006] The one or more processors can be configured to determine a metric indicative of a degree of the interaction and provide, via an interface, an indication of the metric. The one or more processors can identify, based at least on the plurality of frames of a video stream, a type of the instrument used in the medical procedure. The one or more processors can identify, based at least on the plurality of frames, a type of the anatomical structure. The one or more processors can detect, based at least on the type of the instrument and the type of the anatomical structure, a type of the interaction.

[0007] The one or more processors can be configured to identify, from the data stream, kinematics data indicative of the movement of the instrument and video stream data of the anatomical structure. The one or more processors can identify one or more machine learning (ML) models having one or more spatial attention mechanisms and one or more temporal attention mechanisms trained on a dataset of a plurality of interactions between a plurality of instruments and a plurality of anatomical structures in a plurality of medical procedure. The one or more processors can detect the interaction based at least on the kinematics data and the video stream data applied to the one or more spatial attention mechanisms and the one or more temporal attention mechanisms.

[0008] The one or more processors can be configured to determine, based at least on a movement of a portion of the instrument and a pattern of motion of a portion of the anatomical structure, that a level of consistency of the movement and the pattern of motion exceeds a threshold. The one or more processors can detect, based at least on the determination, the interaction. The one or more processors can determine, using the plurality of frames of a video stream input into an encoder of a machine learning model, a plurality of anatomical features indicative of the anatomical structure. The one or more processors can detect the anatomical structure based at least on the plurality of anatomical features. [0009] The one or more processors can be configured to determine, using a first time stamp of a kinematics data of the data stream indicative of a movement of the instrument and a second time stamp of a force data indicative of a force corresponding to the instrument, the movement of the instrument over a time period. The one or more processors can determine, using a third time stamp of the plurality of frames; the pattern of motion over the time period. The one or more processors can detect the interaction based at least on a correlation of the movement of the instrument and the pattern of motion during the time period.

[0010] The one or more processors can identify one or more machine learning (ML) models having a temporal attention mechanism trained on timing of actions captured by a plurality of video streams of a plurality of medical procedures. The one or more processors can determine, based at least on the plurality of frames and kinematics data on movement of the instrument applied to the temporal attention mechanism, one or more locations of the instrument over the plurality of frames.

[0011] The one or more processors can be configured to identify one or more machine learning (ML) models having a spatial attention mechanism trained on spatial arrangement of a plurality of instruments and a plurality of anatomical structures captured by a plurality of video streams of a plurality of medical procedures. The one or more processors can determine, based at least on the plurality of frames applied to the spatial attention mechanism, the movement of the instrument and the pattern of motion of the anatomical structure.

[0012] The one or more processors can be configured to identify a time period corresponding to the plurality of frames. The one or more processors can determine, based at least on kinematics data of the instrument and the plurality of frames, a plurality of locations of the instrument corresponding to the time period. The one or more processors can determine, based at least on the plurality of locations, a velocity of the instrument during the time period. The one or more processors can determine, based at least on the velocity of the instrument and the interaction between the instrument and the anatomical structure a performance metric of a task corresponding to the interaction. The one or more processors can provide for display the performance metric overlaid over the plurality of frames.

[0013] The one or more processors can be configured to identify, using machine learning, a plurality of locations of the instrument within a time period corresponding to the plurality of frames. The one or more processors can determine, based at least on the plurality of locations, one or more velocities of one or more objects within the time period. The one or more processors can identify, using the plurality of frames and one or more machine learning (ML) models, a first one or more vectors corresponding to the movement of the instrument. The one or more processors can identify, using the plurality of frames and the one or more ML models, a second one or more vectors corresponding to the pattern of motion of the anatomical structure. The one or more processors can detect the interaction between the instrument and the anatomical structure based at least on the comparison of the first one or more vectors and the second one or more vectors.

[0014] At least one aspect of the technical solutions is directed to a method. The method can include identifying, by one or more processors coupled with memory from a data stream of a medical procedure implemented using a robotic medical system, a movement of an instrument used in the medical procedure over the plurality of frames. The method can include identifying, by the one or more processors using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure. The method can include comparing, by the one or more processors, the movement of the instrument and the pattern of motion of the anatomical structure. The method can include detecting, by the one or more processors based at least on the comparison, an interaction between the instrument and the anatomical structure. The method can include providing, by the one or more processors via an interface, an indication of the interaction.

[0015] The method can include determining, by the one or more processors, a type of the interaction using a model trained using machine learning and based at least on the comparison of the movement of the instrument and the pattern of motion of the anatomical structure. The method can include providing, by the one or more processors, via the interface, an indication of the type of interaction. The method can include determining, by the one or more processors, a metric indicative of a degree of the interaction. The method can include providing, by the one or more processors, via an interface, an indication of the metric.

[0016] The method can include identifying, by the one or more processors, based at least on the plurality of frames of a video stream, a type of the instrument used in the medical procedure. The method can include identifying, by the one or more processors, based at least on the plurality of frames, a type of the anatomical structure. The method can include detecting, by the one or more processors, based at least on the type of the instrument and the type of the anatomical structure, a type of the interaction. [0017] The method can include identifying, by the one or more processors from the data stream, kinematics data indicative of the movement of the instrument and video stream data of the anatomical structure. The method can include identifying, by the one or more processors, one or more machine learning (ML) models having one or more spatial attention mechanisms and one or more temporal attention mechanisms trained on a dataset of a plurality of interactions between a plurality of instruments and a plurality of anatomical structures in a plurality of medical procedures. The method can include detecting, by the one or more processors, the interaction based at least on the kinematics data and the video stream data applied to the one or more spatial attention mechanisms and the one or more temporal attention mechanisms.

[0018] An aspect of the technical solutions is directed to a non-transitory computer readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to identify, from a data stream of a medical procedure with a robotic medical system, a movement of an instrument used in the medical procedure over the plurality of frames. The instructions, when executed, can cause the one or more processors to identify, using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure. The instructions, when executed, can cause the one or more processors to detect, based at least on a comparison of the movement of the instrument and the pattern of motion of the anatomical structure, an interaction between the instrument and the anatomical structure. The instructions, when executed, can cause the one or more processors to provide, via an interface, an indication of the interaction.

[0019] These and other aspects and implementations are discussed in detail below. The foregoing information and the following detailed description include illustrative examples of various aspects and implementations and provide an overview or framework for understanding the nature and character of the claimed aspects and implementations. The drawings provide illustration and a further understanding of the various aspects and implementations and are incorporated in and constitute a part of this specification. The foregoing information and the following detailed description and drawings include illustrative examples and should not be considered as limiting. BRIEF DESCRIPTION OF THE DRAWINGS

[0020] The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component can be labeled in every drawing. In the drawings:

[0021] FIG. 1 depicts an example system for using an ML framework to detect and analyze interactions between medical instruments of a robotic medical system and anatomical parts of a patient’s body during an ongoing medical procedure.

[0022] FIG. 2 illustrates an example of a system configuration ML-based detection of interactions using an image encoder and a time series processor.

[0023] FIG. 3 illustrates an example of a video frame showing anatomy features and instrument features indicated and marked for motion pattern analysis in a motion pattern map.

[0024] FIG. 4 illustrates an example of a system configuration for using ML framework to determine interactions between detected anatomy features and instrument features.

[0025] FIG. 5 illustrates an example flow diagram of a method for ML-based detection of interactions between instruments and anatomy of a patient in medical robotic systems is illustrated.

[0026] FIG. 6 illustrates an example of a surgical system, in accordance with some aspects of the technical solutions.

[0027] FIG. 7 illustrates an example block diagram of an example computer system is shown, in accordance with some aspects of the technical solutions.

DETAILED DESCRIPTION

[0028] Following below are more detailed descriptions of various concepts related to, and implementations of, systems, methods, apparatuses for detection and analyses of interactions between medical instruments and anatomies of patients. The various concepts introduced above and discussed in greater detail below can be implemented in any of numerous ways.

[0029] Although the present disclosure is discussed in the context of a surgical procedure, in various aspects, the technical solutions of this disclosure can be applicable to other medical or non-medical applications, treatments, sessions, environments or activities, in which detection and analyses of interactions between robot-controlled instruments and parts of an anatomy of a body can be sought. For instance, technical solutions can be applied in any environment, application or industry in which activities, operations, processes or acts by robots or robotic tools are captured on video and other data stream to apply to ML modeling for tooltissue interaction recognition and analysis.

[0030] The technical solutions relate to a framework for detecting tool-tissue manipulation within robotic surgery videos using an ensembled machine learning (ML) modelling. The framework can facilitate detection and recognition surgical instruments, anatomical tissues, and their mutual interaction, including any physical or on-contact relationship. The technical solutions can facilitate improved understanding of the process of endoscopic or other medical surgeries or procedures and facilitate generation of various tool-tissue interaction-related performance metrics.

[0031] Tool-tissue on-contact manipulation detection tasks can involve a combination of three challenges: a surgical instrument detection and localization, an anatomical tissue detection, and an interactive on-contact recognition of the surgical instrument and the anatomical tissue. In robot-assisted endoscopic surgeries, it can be difficult to provide sensory information from kinematic and signal data of the robotic system, leaving users (e.g., surgeons utilizing robotic medical systems) to instead rely on their subjective perception which can be affected by various factors. The technical solutions utilize machine learning technology along with the video and system kinematics data to improve the ability to detect, monitor and analyze tool-tissue on-contact manipulation by combining the ML based detection and monitoring of the instrument (subject) to the tissue (object) with the clinically valid interactive operation (action).

[0032] The technical solutions can take advantage of a clue provided in video stream in which a surgical tool manipulating a tissue has a motion that is consistent with a motion of at least a portion of the tissue. For example, the technical solutions can generate a field of vectors (e.g., a motion field) in which the magnitude and the direction of a velocity of a detected portion of a medical instrument can align or be consistent with the magnitude and the direction of a velocity of the detected portion of an anatomical structure (e.g., a tissue) that is in contact with the instrument.

[0033] FIG. 1 depicts an example system 100 for using an ML framework to detect and analyze interactions between medical instruments of a robotic medical system and anatomical parts of a patient’s body during an ongoing medical procedure. Example system 100 can include a surgical robotic system for performing tasks using medical instruments, such as a robotic medical system 120 used by a surgeon to perform a surgery on a patient. Robotic medical system 120, also referred to as an RMS 120, can be deployed in a medical environment 102. Medical environment 102 can include any space or facility for performing medical procedures, such as a surgical facility, or an operating room. Medical environment 102 can include medical instruments 112 that the RMS 120 can use for performing surgical patient procedures, whether invasive, non-invasive, in-patient, or out-patient procedures.

[0034] The medical environment 102 can include one or more data capture devices 110 (e.g., optical devices, such as cameras or sensors or other types of sensors or detectors) for capturing data streams 162, that can include video data 178 of images or a video stream of a surgery as well as other sensor data 174, events data 176 and kinematics data 172. The medical environment 102 can include one or more visualization tools 114 to gather the captured data streams 162 and process it for display to the user (e.g., a surgeon or other medical professional) at one or more displays 116. A display 116 can present data stream 162 (e.g., video frames, kinematics or sensor data) of an ongoing medical procedure (e.g., an ongoing surgery) performed using the robotic medical system 120 handling, manipulating, holding or otherwise utilizing medical instruments or tools 112 to perform surgical tasks at the surgical site. Coupled with the RMS 120, via a network 101, can be a data processing system (DPS) 130. DPS 130 can include one or more machine learning (ML) frameworks 140, data repositories 160, motion pattern analyzer units 166, interfaces 180 and segmentation functions 184.

[0035] Data repository 160 can include various data streams 162 generated by the robotic medical system (RMS) 120, including kinematics data 172, sensor data 174, events data 176 and video data 178. ML framework 140 can use data streams 162 as inputs into one or more ML models, such as anatomy models 142 for detecting anatomy features 152, instrument models 144 for detecting instrument features 154 and interaction models 146 for detecting interactions 156 between the detected instrument features 154 (e.g., detected medical instruments 112) and anatomy features 152 (e.g., detected tissues or organs of the patient). ML framework 140 can include one or more ML model trainers 150 for training the ML models (e.g., 142, 144 and 146) along with attention mechanisms 164 that can be utilized by the ML models for detection of anatomies, instruments and their interactions. ML framework 140 can include one or more image encoders 148 for processing and detection of image features and time series processors for processing data streams 162 for data relevant to ML framework determinations. [0036] Machine learning (ML) framework 140 can include any combination of hardware and software for providing a system that integrates ML-based anatomy and instrument models alongside attention mechanisms and rule-based modeling to detect and recognize interactions between medical instruments 112 detected by the ML models as instrument features 154 and anatomical parts (e.g., detected anatomy features 152). ML framework 140 can include one or more ML modules and functions for implementing various tasks in detection, recognition and analysis of detected an analyzed features (e.g., anatomy features 152 or instrument features 154) and detection and recognition of the tool -tissue interactions 156. ML trainers 150 can be used for training anatomy models 142, instrument models 144 or interaction models 146, as well as any related functions or components, such as image encoders 148 or time series processors 158. ML framework 140 can include and utilize motion pattern analyzer unit 166 for generating motion pattern maps 168 and segmentation functions 184 for creating labels for segments of video image frames.

[0037] Anatomy models 142 can be designed and trained to identify and delineate anatomical features 152 (e.g., tissues, organs, glands, arteries, or other parts of a patient’s body) using video frames or images of the video data 178, thereby facilitating localization of tissues or organs. Instrument models 144 can be trained and designed to detect and recognize characteristics of various medical instruments 112 utilized during surgical procedures, facilitating identification and tracking of the recognized instrument features 154 (e.g., detected medical instruments 112) throughout the video stream. Interaction models 146 can utilize attention spatial and temporal attention mechanisms 164 to facilitate detection and characterization of interactions between the detected instruments features 154 and detected anatomy features 152. In doing so, the interaction models 146 can facilitate insight and analysis of the interactions 156 and provide any resulting interaction metrics 186 that may be determined from the ongoing video streamed surgical interventions.

[0038] ML framework 140 can include attention mechanisms 164, implemented as neural networks, which enable the extraction of spatial and temporal features from the input data steams 162. Attention mechanisms 164 can facilitate or improve the capacity of the ML models (e.g., 142, 144 and 146) to discern, detect or recognize specific details within the surgical context, thereby improving the accuracy of detection and recognition tasks. ML framework 140 can include and provide rule-based modeling to determine and quantify the consistency of motion (e.g., correlation between the velocity vectors) between detected instrument features 154 and detected anatomical features 152. By integrating image encoders 148 for extracting image features and time-series processors 158 for using kinematics data and sensor measurements (including force data), the ML framework 140 improves the quality of the detection and recognition by the ML models. For example, the ML framework 140 can may utilize attention mechanisms 164 to focus on relevant regions of interest within the video data 178 of medical procedure, while simultaneously using rule-based modeling to assess the coherence or correlation of motion between a detected medical instrument (e.g., a scalpel) and the surrounding anatomical part (e.g., a tissue moving along with the scalpel as it is being cut), thereby facilitating an improved accuracy of the determination by the interaction model 146 that the interaction 156 corresponds to a cutting action of the given tissue.

[0039] Data repository 160 of the DPS 130 can include one or more data streams 162, such as video data 178 including a stream of video frames. Data streams 162 can include measurements from sensors, which can be referred to as sensor data 174 and which can include various force, torque or biometric data, haptic feedback data, pressure or temperature data, vibration, tension or compression data, endoscopic images or data, ultrasound images or videos or communication and command data streams. Data repository 160 can include installation data, such as system files or logs including time stamps and data on installation, activation, calibration or use of particular medical instruments 112. ML models, including anatomy model 142, instrument model 144 and interaction model 146 can each be stored in a data repository 160, along with training data sets and data streams 162.

[0040] Motion pattern analyzer unit 166 can include any combination of hardware and software for determining, generating and providing representation of movement or motions in a space, such as a space corresponding to video frames of a streamed video data 178. Motion pattern analyzer unit 166 can include a representation or a distribution of velocities of objects or features in the video frames, including velocity vectors (e.g., directions and magnitudes) of image frame areas corresponding to detected instrument features 154 and anatomy features 152. Motion pattern analyzer unit 166 can implement an optical flow analysis, such as a computer vision analysis corresponding to motion of objects or features within a visual scene of a video frame. Motion pattern analyzer unit 166 can include the functionality to generate motion pattern or a map of motion pattern (e.g., motion pattern map 168) providing, mapping or illustrating velocities of various objects in the frames of video data 178.

[0041] Motion pattern map 168 can include any two-dimensional representation of velocities of various objects, such as detected instrument features 154 (e.g., detected medical instruments 112) and detected anatomy features 152 (e.g., detected tissues, organs, muscles or other body parts of the patient interacted with by the medical instruments). Motion pattern map 168 can include an optical flow map providing results of a computer vision analysis of movement or motion of objects within a series of video frames. Motion pattern map 168 can include a color-coded map of instrument objects indicative of the velocities of the objects over a time period (e.g., over a prior one or more seconds and through present time). Motion pattern map 168 can indicate, map or illustrate velocity vectors for various portions of image or video frames, highlighting portions of the images in which velocities match (e.g., have their velocities or directions coincide).

[0042] The system 100 can include one or more data capture devices 110 (e.g., video cameras, sensors or detectors) for collecting any data stream 162, that can be used for machine learning and detection of objects, such as medical instruments or tools or anatomical parts of a patient subjected to the medical procedure. Data capture devices 110 can include cameras or other image capture devices for capturing video data 178 (e.g., videos or images) from a particular viewpoint within the medical environment 102. The data capture devices 110 can be positioned, mounted, or otherwise located to capture content from any viewpoint that facilitates the data processing system 130 capturing various surgical tasks or actions.

[0043] Data capture devices 110 can include any of a variety of sensors, cameras, video imaging devices, infrared imaging devices, visible light imaging devices, intensity imaging devices (e.g., black, color, grayscale imaging devices, etc.), depth imaging devices (e.g., stereoscopic imaging devices, time-of-flight imaging devices, etc.), medical imaging devices such as endoscopic imaging devices, ultrasound imaging devices, etc., non-visible light imaging devices, any combination or sub-combination of the above mentioned imaging devices, or any other type of imaging devices that can be suitable for the purposes described herein. Data capture devices 110 can include cameras that a surgeon can use to perform a surgery and observe manipulation components within a purview of field of view suitable for the given task performance.

[0044] Data capture devices 110 can capture, detect, or acquire sensor data, such as videos or images, including for example, still images, video images, vector images, bitmap images, other types of images, or combinations thereof. The data capture devices 110 can capture the images at any suitable predetermined capture rate or frequency. Settings, such as zoom settings or resolution, of each of the data capture devices 110 can vary as desired to capture suitable images from any viewpoint. For instance, data capture devices 110 can have fixed viewpoints, locations, positions, or orientations. The data capture devices 110 can be portable, or otherwise configured to change orientation or telescope in various directions. The data capture devices 110 can be part of a multi-sensor architecture including multiple sensors, with each sensor being configured to detect, measure, or otherwise capture a particular parameter (e.g., sound, images, or pressure).

[0045] Data capture devices 110 can include any type and form of a sensor for providing sensor data 174, including a positioning sensor, a biometric sensor, a velocity sensor, an acceleration sensor, a vibration sensor, a motion sensor, a pressure sensor, a light sensor, a distance sensor, a current sensor, a focus sensor, a temperature sensor, a haptic or tactile sensor or any other type and form of sensor used for providing data on medical tools 112, or data capture devices (e.g., optical devices). For example, a data capture device 110 can include a location sensor, a distance sensor or a positioning sensor providing coordinate locations of a medical tool 112 or a data capture device 110. Data capture device 110 can include a sensor providing information or data on a location, position or spatial orientation of an object (e.g., medical tool 112 or a lens of data capture device 110) with respect to a reference point. The reference point can include any fixed, defined location used as the starting point for measuring distances and positions in a specific direction, serving as the origin from which all other points or locations can be determined.

[0046] Display 116 can show, illustrate or play data streams 162, including video data 178, in which medical tools 112 at or near surgical sites are shown. For example, display 116 can display a rectangular image (e.g., a frame of a video data 178) of a surgical site along with at least a portion of medical instruments 112 being used to perform surgical tasks. Display 116 can provide compiled or composite images generated by the visualization tool 114 from a plurality of data capture devices 110 to provide visual feedback from one or more points of view.

[0047] The visualization tool 114 that can be configured or designed to receive any number of different data streams 162 from any number of data capture devices 110 and combine them into a single data stream displayed on a display 116. The visualization tool 114 can be configured to receive a plurality of data stream components and combine the plurality of data stream components into a single data stream 162. For instance, the visualization tool 114 can receive a visual sensor data from one or more medical tools 112, sensors or cameras with respect to a surgical site or an area in which a surgery is performed. The visualization tool 114 can incorporate, combine or utilize multiple types of data (e.g., positioning data of a medical tool 112 along sensor readings of pressure, temperature, vibration or any other data) to generate an output to present on a display 116. Visualization tool 114 can present locations of medical tools 112 along with locations of any reference points or surgical sites, including locations of anatomical parts of the patient (e.g., organs, glands or bones).

[0048] Medical instruments or tools 112 can be any type and form of tool or instrument used for surgery, medical procedures or a tool in an operating room or environment. Medical tool 112 can be imaged by, associated with or include an image capture device. For instance, a medical tool 112 can be a tool for making incisions, a tool for suturing a wound, an endoscope for visualizing organs or tissues, an imaging device, a needle and a thread for stitching a wound, a surgical scalpel, forceps, scissors, retractors, graspers, or any other tool or instrument to be used during a surgery. Medical tools 112 can include hemostats, trocars, surgical drills, suction devices or any instruments for use during a surgery. The medical tool 112 can include other or additional types of therapeutic or diagnostic medical imaging implements. The medical tool 112 can be configured to be installed in, coupled with, or manipulated by an RMS 120, such as by manipulator arms or other components for holding, using and manipulating the medical instruments or tools 112.

[0049] RMS 120 can be a computer-assisted system configured to perform a surgical or medical procedure or activity on a patient via or using or with the assistance of one or more robotic components or medical tools 112. RMS 120 can include any number of manipulator arms for grasping, holding or manipulating various medical tools 112 and performing computer-assisted medical tasks using medical tools 112 controlled by the manipulator arms.

[0050] Video data 178, including any images or videos captured by a medical tool 112 (e.g., endoscopic camera) can be sent to the visualization tool 114. The robotic medical system 120 can include one or more input ports to receive direct or indirect connection of one or more auxiliary devices. For example, the visualization tool 114 can be connected to the RMS 120 to receive the images from the medical instrument 112 when the medical instrument 112 is installed in the RMS 120 (e.g., on a manipulator arm of the RMS 120 that is used for moving, managing or otherwise handing medical instruments 112). The visualization tool 114 can combine the data streams 162 from the data capture devices 110 and the medical tool 112 into a single combined data stream 162 for use by the ML framework 140 (e.g., ML models 142, 144 or 146 or associated attention mechanisms 164, image encoders 148, time series processors 158 and motion pattern analyzer units 166). [0051] The system 100 can include a data processing system 130. The data processing system 130 can be deployed in or associated with the medical environment 102, or it can be provided by a remote server or be cloud-based. The data processing system 130 can include an interface 180 designed, constructed and operational to communicate with one or more component of system 100 via network 101, including, for example, the robotic medical system 120. Data processing system 130 can be implemented using instructions stored in memory locations and processed by one or more processors, controllers or integrated circuitry. Data processing system 130 can include functionalities, computer codes or programs for executing or implementing any functionality of ML framework 140, including any ML models (e.g., 142- 146) along with any associated functions or features (e.g., 164, 148, 158 166 or 184) for identification, detection and analysis of anatomy features 152, instrument features 154 and their interactions 156.

[0052] The ML trainer 150 can any combination of hardware and software for training ML models. Machine learning (ML) trainer 150 can include or generate ML models 142, 144 and 146, each of which can be trained using training datasets that can include various data streams 162 corresponding to various medical procedures using the RMS 120. ML trainer 150 can include a framework or functionality for training different machine learning models, such as a neural network spatial-temporal attention mechanism model designed for detecting medical instruments 112 (e.g., instrument features 154) as well as detecting anatomical parts of a patient (e.g., anatomy features 152) using various data from data streams 162, including video data 178, kinematics data 172 and sensor data 174 (e.g., force data). ML trainer 150 can train ML models, such as an anatomy model 142, an instrument model 144, or interaction model 146 using a dataset of any number of data streams 162 corresponding to any number of medical procedures utilizing various medical instruments 112 to interact with various types of patient anatomies.

[0053] The ML trainer 150 can include an attention mechanism 164 that can be used to address the noise challenges in the data. Attention mechanism 164 can include a neural network with spatial and temporal attentions performed as an image encoder 148 to learn to identify anatomy features 152 and instrument features 154 representing locations of medical instruments 112 (e.g.., instrument features 154) and patient body tissues (e.g., anatomy features 152). Attention mechanism 164 can utilize weights to emphasize different types of information in the data stream 162, such as movements in a region of an image frame that corresponds to a prior image frame in which a particular type of movement was detected. Such spatial and temporal weights used in the attention mechanism 164 can facilitate an improved or selective focus of the ML functions onto particular features in the data stream 162, assigning varying degrees of importance to each part of the input data during the learning process. For example, the attention mechanism 164 can include spatial -temporal attention mechanism 164 within the neural network architecture that configures an ML model to focus selectively on relevant spatial and temporal features in the video data, thereby improving the accuracy of the detection. By assigning weights to selected segments of the input videos data 178, the attention mechanism 164 allows the model to attenuate the impact less relevant portions of the data, emphasizing the importance of the more relevant cues (e.g., focus on a detected medical instrument 112 or a detected anatomical tissue of a patient) for more accurate interaction 156 determinations.

[0054] Anatomy model 142 can include any combination of hardware and software for utilizing machine learning for detection of anatomical parts of a patient undergoing a medical procedure. Anatomy model 142 can include a neural network model that can utilize an image encoder 148 to detect features of a video data 178 that correspond to a portion of an anatomy or anatomical feature 152 of a patient. Anatomy model 142 can utilize a time series processor 158 to utilize sensor data 174 (e.g., temperature, pressure or tactile force) to detect help detect or recognize a particular anatomical feature 152. Anatomy model 142 can utilize motion pattern map 168 generated by a motion pattern analyzer unit 166 to identify a particular portion of a patient’s body corresponding to an anatomy feature 152.

[0055] Anatomy model 142 can include or utilize transformers or transformer-based architectures, such as a spatial-temporal transformers or a graphical neural network with transformers to detect or recognize anatomy features 152. Anatomy features 152 can include any identifications, predictions, determinations or recognitions of a portion of a patient’s body captured by a video. Spatial-temporal transformer can facilitate determinations of the anatomy features 152 using motion pattern map 168 in which a particular region of interest can be highlighted, such as based on a velocity direction and magnitude that correlates or coincides with the direction and magnitude of the velocity of a detected instrument feature 154 or another anatomy feature 152. Spatial temporal transformer can identify regions in the video frame that are of interest and that correspond to locations in which anatomy features 152 are being identified or detected. For example, transformers can be used for multimodal integration in which data streams 162 from multiple types of sources (e.g., data from various detectors, sensors and cameras) can be combined to predict anatomy features 152. Spatial -temporal transformer neural network can be applied to the frames of video data 178 to facilitate spatial relations of features across different images or data sources (e.g., 110). Anatomy model 142 can include any one or more machine learning (e.g., deep neural network) models trained on diverse datasets to learn to recognize various details of different anatomy features 152.

[0056] Anatomy features 152 determined by the anatomy models 142 can include any portion of a patient’s body interacted with by a medical instrument 112. Anatomy features 152 can include various tissues, organs, and glands of a patient’s body. Anatomy features 152 can include skeletal, smooth, and cardiac muscles for movement and organ function, vital organs such as the heart, lungs, liver, kidneys, and spleen, or any organs facilitating blood circulation, respiration, and metabolism. Anatomy features 152 can include, glandular structures, such as the thyroid, adrenal, and pituitary glands for regulating hormone production, as well as vascular structures such as arteries, veins, and capillaries. Anatomy features 152 can include connective tissues, adipose tissue, nervous tissue, and epithelial tissue further contribute to bodily functions and homeostasis.

[0057] Anatomy features 152, taken together, can form anatomical structure that can be detected by the anatomy model 142 to more accurately identify an anatomy feature 152 within the recognized anatomical structure. For instance, anatomy model 142 can detect anatomy features 152 corresponding to bones, joints, and cartilage which, along with the surrounding skeletal muscles and nearby glands can provide an overall anatomical structure of the given region being imaged, allowing the anatomy model 142 to recognize the relative arrangement and orientation of these anatomy features 152 to more precisely narrow down the scope of possible anatomy feature 152 being interacted with by the detected instrument feature 154, thus improving the accuracy of the anatomy model 142.

[0058] Instrument model 144 can include any combination of hardware and software, including machine learning features and architectures for detecting and recognizing an instrument feature 154, such as any medical instrument 112 being used in the frames of the video data 178. Instrument model 144 can utilize an image encoder 148 to detect objects or features in the frames of the video data 178 to detect particular medical instruments 112 being used. Instrument model 144 can utilize time series processor 158 to process kinematics data 172 with timestamped movements or motion of various medical instruments 112 to identify the instrument features 154. Instrument model 144 can utilize motion pattern maps 168 generated by motion pattern analyzer units 166 to identify and detect instrument features 154 in the video frames of the video data 178. [0059] Instrument model 144 can include support vector machines (SVMs) that can facilitate predictions (e.g., anatomical, instrument, object, action or any other) in relation to class boundaries, random forests for classification and regression tasks, decision trees for prediction trees with respect to distinct decision points, K-nearest neighbors (KNNs) that can use similarity measures for predictions based on characteristics of neighboring data points, Naive Bayes functions for probabilistic classifications, logistic or linear regressions, or gradient boosting models. Instrument model 144 can include neural networks, such as deep neural networks configured for hierarchical representations of features, convolutional neural networks (CNNs) for image-based classifications and predictions, as well as spatial relations and hierarchies, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for determining structures and processes unfolding over time.

[0060] Instrument model 144 can include or utilize transformers or transformer-based architectures, such as a spatial-temporal transformers or a graphical neural network with transformers to detect or recognize instrument features 154. Instrument features 154 can include any identifications, predictions, determinations or recognitions of a portion of a medical instrument 112 captured by a video. Spatial -temporal transformer can facilitate determinations of the instrument features 154 using motion pattern map 168 in which a particular region of interest can be highlighted, such as based on a velocity direction and magnitude that correlates or coincides with the direction and magnitude of the velocity of a detected anatomy feature 152 or another instrument feature 154. Spatial temporal transformer can identify regions in the video frame that are of interest and that correspond to locations in which instrument features 154 are being identified or detected. For example, transformers can be used for multimodal integration in which data streams 162 from multiple types of sources (e.g., data from various detectors, sensors and cameras) can be combined to predict instrument features 154. Spatial -temporal transformer neural network can be applied to the frames of video data 178 to facilitate spatial relations of features across different images or data sources (e.g., 110). Instrument model 144 can include any one or more machine learning (e.g., deep neural network) models trained on diverse datasets to learn to recognize various details of different instrument features 154.

[0061] Instrument features 154 detected or recognized by the instrument model 144 include, for example, any medical or surgical tool used in a surgery, such as any one or more of: shears, needles, threads, scalpels, clips, rings, bone screws, graspers, retractors, saws, forceps, imaging devices, or any other medical instrument 112 or a tool used in a medical procedure. Instrument features 154 can include any tools or systems utilized in non-medical applications, such as tools used by industrial robots in handling or manipulating objects in manufacturing or assembly, or robots handling tools or components in other applications, such as agricultural applications, drone applications or service applications.

[0062] Interaction model 146 can include any combination of hardware and software for detecting interactions 156 between anatomy features 152 detected by anatomy models 142 and instrument features 154 detected by instrument model 144. Interaction model 146 can include a rule-based model that can utilize rules for various arrangements and configurations of anatomy features 152 and instrument features 154 to determine interactions 156. Interaction model 146 can include a neural network model that can utilize an image encoder 148 to detect various imaged objects (e.g., 152 or 154) along with time series processor 158 using sensor data 174 and kinematics data 172 to discern, determine or recognize the type of interaction 156 taking place. Interaction model 146 can utilize motion pattern map 168 generated by a motion pattern analyzer unit 166 to identify anatomy features 152 and instrument features 154 and determine, based on the velocities and directions of movements, the interaction 156 taking place.

[0063] Interaction model 146 can include ML functionality to determine, detect or recognize a level of interaction between the instrument feature 154 and anatomy feature 152. For example, interaction model 146 can identify, detect or determine an amount of force an instrument feature 154 (e.g., a recognized or detected medical instrument 112) is applying to a particular anatomy feature 152 (e.g., a particular tissue). The interaction model 146 can determine or detect a consistency of the force applied to the tissue, such as smoothness of force or flow vectors The interaction model 146 can determine the level of stress that the instrument feature 154 is applying on the anatomy feature 152, including any amount of force, such as tension force pulling or stretching a tissue, compression force to compress or push into a tissue, shear force, a duration of hold time (e.g., time for which the tissue is being held), drag or pull force, torsion, heat or cooling or any other stress that can be applied to the anatomy features 152. The interaction model 146 can determine amounts, levels or scales of interaction, such as levels from 1-10, levels such as low, medium or high, percentage of force that a tissue can withstand without a risk of damage (e.g., 0-100% where 100% is a threshold for damaging the tissue) or any other level. The interaction model 146 can trigger alarms or indications 182 via the interface 180 in response to levels of interactions 156 exceeding threshold levels for particular type of tissue (e.g., threshold for anatomy feature 152). For example, interaction model 146 can trigger an indication 172 in response to the level of interaction 156 exceeding thresholds for pulling on a tissue, pushing a tissue, stretching a tissue beyond a particular range, ripping a tissue, compressing a tissue, cutting a tissue or otherwise affecting the tissue.

[0064] Interaction model 146 can include or utilize transformers or transformer-based architectures, such as a spatial-temporal transformers or a graphical neural network with transformers to detect or recognize interactions 156. Interactions 156 can include any identifications, predictions, determinations or recognitions of an action taken by the instrument feature 154 with respect to the anatomy feature 152. A spatial -temporal transformer can facilitate determinations of the interactions 156 using motion pattern map 168 in which a particular region of interest can be highlighted for a given instrument feature 154 and an anatomy 152, such as when the velocities of the two features share the same or a similar direction and magnitude (e.g., correlating or coinciding velocities), which can be indicative of an interaction 156. Spatial temporal transformer can identify relative changes in motion between the anatomy feature 152 and instrument feature 154, thereby identifying motions indicative of a particular interaction 156.

[0065] Interactions 156 detected by an interaction model 146 can include any interactions or manipulations of an anatomical part of a patient’s body by a medical instrument 112. Examples of detected interactions 156 can include actions, such as suturing a wound using a surgical needle and thread, making an incision with a scalpel to access underlying tissues or organs, inserting an endoscopic device into a body cavity for visualization or treatment, retracting tissues using surgical retractors to expose the surgical site, grasping and manipulating tissues or organs with surgical forceps or graspers, cauterizing tissue using electrocautery or laser devices to control bleeding or remove tissue, ligating blood vessels or other structures using surgical clips or ligatures to occlude them, aspirating fluids or debris using suction devices to clear the surgical field, and irrigating a surgical site with saline or other solutions to clean and maintain visibility. Interactions 156 can include any task or a phase of a medical procedure, such as a robotic surgery.

[0066] Interaction model 146 can be configured to identify, predict, classify, categorize, or otherwise score various performance aspects of the interaction 156. Interaction model 146 can identify performance metrics with respect to a particular task detected as an interaction 156, based on the amount of force applied by the detected instrument feature 154 (e.g., detected instrument 112) on a particular type of anatomy feature 152 (e.g., identified tissue). For example, interaction model 146 can be configured to determine a performance metric 186 for a given interaction 156, such as a scalpel made incision, wound suturing or an endoscopic insertion. For example, interaction model 146 can make a determination that an interaction 156 between a detected anatomy feature 152 (e.g., a wound on an arm) and a detected instrument feature 154 (e.g., a surgical needle) are involved in an interaction 156 of suturing the wound using a thread and a needle. Interaction model 146 can determine that the amount of force used to pull on the thread is within acceptable range, thereby providing an interaction metric 186 of 100% that is indicative of a well performed interaction 156.

[0067] Segmentation function 184 can include any combination of hardware and software for segmenting objects or features recognized in the frames of video data 178. Segmentation function 184 can include the functionality for implementing anatomy segmentation to identify specific anatomy features 152 from an anatomy structure of a plurality of anatomy features 152 of the patient’s body. Segmentation function 184 can include the functionality for segmenting instrument features 154 to identify specific portions of the medical instruments 112. Segmentation function 184 can convert encoded feature vectors into a segmentation map, which can include, represent, highlight or label spatial locations around particular detected anatomy features 152 or particular detected instrument features 154. Segmentation function 184 can function or operate together with attention mechanism 164 and the motion pattern analyzer unit 166 to provide velocity distributions in motion pattern maps 168 for detected labeled objects in the segmentation maps.

[0068] Time series processor 158 can include any combination of hardware and software for processing kinematics and sensor data to facilitate improved performance of ML models. Time series processor 158 can utilize time stamped data of the data streams 162 to temporally line up the data stream data with video data 178 and use the data stream inputs to supplement determinations of ML models 142, 144 or 146. Time series processor 158 can utilize time stamped sensor data 174 (e.g., time stamped sensor measurements), time stamped kinematics data 172 (e.g., time stamped data on motion or movements of medical instruments 112) and time stamped events data 176 to provide additional information which the ML models 142, 144 or 146 can use to determine anatomy features 152, instrument features 154 or detect interactions 156.

[0069] The data repository 160 can include one or more data files, data structures, arrays, values, or other information that facilitates operation of the data processing system 130. The data repository 160 can include one or more local or distributed databases and can include a database management system. The data repository 160 can include, maintain, or manage a data stream 162. The data stream 162 can include or be formed from one or more of a video stream, image stream, stream of sensor measurements, event stream, or kinematics stream. The data stream 162 can include data collected by one or more data capture devices 110, such as a set of 3D sensors from a variety of angles or vantage points with respect to the procedure activity (e.g., point or area of surgery).

[0070] Data stream 162 can include video data 178, which can include a series of video frames formed or organized into video fragments, such as video fragments of about 1, 2, 3, 4, 5, 10 or 15 seconds of a video. Each second of the video can include, for example, 30, 45, 60, 90 or 120 video frames 308 per second. Data stream 162 can include a stream of events data 176 which can include a stream of event data or information, such as packets, which identify or convey a state of the robotic medical system 120 or an event that occurred in association with the robotic medical system 120. Events data 176 can include information on a state of the RMS 120 indicating whether a medical instrument 112 is calibrated, adjusted or includes a manipulator arm installed on an RMS 120. Event data 176 can include data on whether an RMS 120 is fully functional (e.g., without errors) during the procedure. For example, when a medical instrument 112 is installed on a manipulator arm of the RMS 120, a signal or data packet(s) can be generated indicating that the medical instrument 112 has been installed on the manipulator arm of the RMS 120.

[0071] Data stream 162 can include a stream of kinematics data 172, which can refer to or include data associated with one or more of the manipulator arms or medical tools 112 (e.g., instruments) attached to the manipulator arms, such as arm movements, locations or positioning. Data corresponding to medical tools 112 can be captured or detected by one or more displacement transducers, orientational sensors, positional sensors, or other types of sensors and devices to measure parameters or generate kinematics information. The kinematics data 172 can include sensor data along with time stamps and an indication of the medical tool 112 or type of medical tool 112 associated with the data stream 162.

[0072] DPS 130 can include an interface 180 designed, constructed and operational to communicate with one or more component of system 100 via network 101, including, for example, the RMS 120 or another device, such as a client’s personal computer. The interface 180 can include a network interface. The interface 180 can include or provide a user interface, such as a graphical user interface. The graphical user interface can include, for example, a window for displaying video data 178, or indications 182 that can be overlaid or displayed instead of, along with, or on top of the video data 178. Interface 180 can provide data for presentation via a display, such as a display 116, and can depict, illustrate, render, present, or otherwise provide indications 182 indicating determinations (e.g., outputs) of the ML models, such as anatomy features 152, instrument features 154 and interactions 156.

[0073] The data processing system 130 can interface with, communicate with, or otherwise receive or provide information with one or more component of system 100 via network 101, including, for example, the RMS 120. The data processing system 130, RMS 120 and devices in the medical environment 102 can each include at least one logic device such as a computing device having a processor to communicate via the network 101. The DPS 130, any portion of the ML framework 140, the RMS 120 or a client device that can be communicatively coupled with the DPS or the RMS 120 via the network 101, can each include at least one computation resource, server, processor or memory for processing data. For example, the data processing system 130 can include a plurality of computation resources or processors coupled with memory.

[0074] The data processing system 130, as well as any of its components (e.g., ML framework 140, motion pattern analyzer unit 166 or a segmentation function 184) can each be a part of or include a cloud computing environment functionality or features. The data processing system 130 can include multiple, logically grouped servers and facilitate distributed computing techniques. The logical group of servers may be referred to as a data center, server farm or a machine farm. The servers can also be geographically dispersed. A data center or machine farm may be administered as a single entity, or the machine farm can include a plurality of machine farms. The servers within each machine farm can be heterogeneous - one or more of the servers or machines can operate according to one or more type of operating system platform.

[0075] The data processing system 130, or components thereof can include a physical or virtual computer system operatively coupled, or associated with, the medical environment 102. In some embodiments, the data processing system 130, or components thereof can be coupled, or associated with, the medical environment 102 via a network 101, either directly or directly through an intermediate computing device or system. The network 101 can be any type or form of network. The geographical scope of the network can vary widely and can include a body area network (BAN), a personal area network (PAN), a local-area network (LAN) (e.g., Intranet), a metropolitan area network (MAN), a wide area network (WAN), or the Internet. The topology of the network 101 can assume any form such as point-to-point, bus, star, ring, mesh, tree, etc. The network 101 can utilize different techniques and layers or stacks of protocols, including, for example, the Ethernet protocol, the internet protocol suite (TCP/IP), the ATM (Asynchronous Transfer Mode) technique, the SONET (Synchronous Optical Networking) protocol, the SDH (Synchronous Digital Hierarchy) protocol, etc. The TCP/IP internet protocol suite can include application layer, transport layer, internet layer (including, e.g., IPv6), or the link layer. The network 101 can be a type of a broadcast network, a telecommunications network, a data communication network, a computer network, a Bluetooth network, or other types of wired and wireless networks.

[0076] The data processing system 130, or components thereof, can be located at least partially at the location of the surgical facility associated with the medical environment 102 or remotely therefrom. Elements of the data processing system 130, or components thereof can be accessible via portable devices such as laptops, mobile devices, wearable smart devices, etc. The data processing system 130, or components thereof, can include other or additional elements that can be considered desirable to have in performing the functions described herein. The data processing system 130, or components thereof, can include, or be associated with, one or more components or functionality of a computing including, for example, one or more processors coupled with memory that can store instructions, data or commands for implementing the functionalities of the DPS 130 discussed herein.

[0077] FIG. 2 illustrates an example 200 of a system configuration ML based detection of interactions 156 using an image encoder 148 and a time series processor 158. Example 200 can correspond to a system 100 in which machine learning based modelized processing units can be used to recognize and localize anatomies of interest (e.g., anatomy features 152 and medical instruments 112 (e.g., instrument features 154) being used during a surgical procedure.

Example 200 can also include ML based determinations of the on-contact interaction 156 (e.g., tool-tissue manipulation) between the instruments and the anatomies through a multi-modality input stream.

[0078] Example 200 can include a processing pipeline that includes a data stream 162 of a surgical video data 178 input into an image encoder 148 and a kinematics data 172 input into a time series processor 158. The image encoder 148 can include an ML module configured to extract from input surgical videos features that can be used for recognizing medical instruments 112 (e.g., instrument features 154) and anatomies (e.g., anatomy features 152) of interest, along with base features 202. Base features 202 can include data on instrument movement (e.g., instrument jaw open or closed), sensor readings, locations or representations of objects other than medical instruments 112 interacting with anatomical features 154, as well as any other contextual information that can be used to determine interactions between the instrument features 154 and anatomical features 152. The image encoder 148 can localize the instrument features 154 and anatomy features 152 them in the format of mask map (e.g., anatomy mask 204) or a bounding box (e.g., instrument box 206) which can be displayed or provided in a two- dimensional representation, such as video frame 302 or motion pattern map 168. For instance, outputs from image encoder 148 and time series processor 158 can be combined or fused together to generate or identify anatomy features 152, instrument features 154 or any base features 202.

[0079] Anatomy mask 204 can include any indication of an anatomy feature 152 that can be used to represent an anatomy feature 152 in a space, such as in an image frame of a video data 178 or motion pattern map 168. Anatomy mask 204 can be used to label a region of interest, such as a region corresponding to an anatomy feature 152 or an instrument feature 154 detected using the ML based infrastructure (e.g., ML models, image encoder 148, attention mechanisms 164 and motion pattern maps 168). Anatomy mask 204 can include a highlight of the recognized anatomy feature region on a video frame. Anatomy mask 204 can include an outline or a contour of the anatomy feature 152.

[0080] Instrument box 206 can include any indication of an instrument feature 154 that can be used to represent any instrument feature 154 in a two-dimensional space. For example, instrument box 206 can include an outline indication of an instrument feature 154 that can be used to represent an instrument feature 154 in a space, such as in an image frame of a video data 178 or motion pattern map 168. Instrument box 206 can be used to label a region of interest, such as a region corresponding to an instrument feature 154 detected using the ML models, image encoder 148, attention mechanisms 164 and motion pattern maps 168. Instrument feature 154 can include a highlight of the recognized instrument feature 154 region on a video frame or an outline or a contour of the instrument feature 154.

[0081] In example 200, the time series processor 158 can process the system and kinematics data 172 streams from the robotic medical system 120, including for example force feedback data (e.g., sensor data 174) providing sensory information about the interaction to be determined. The time series processor 158 can parse the data and functions as a time series analyzer to extract temporal features. For example, the time series processor 158 can analyze data of the time period coinciding with a video segment of the video data 178 in which the interaction 156 is recorded. The image encoder 148 and the time series processor 158 can output various features, including recognized anatomy features 152, instrument features 154 and the base feature 202 that can represent semantics learned from the input data streams.

[0082] Base features 202 can include any features or determinations determined by one or more ML models trained to analyze data streams 162 according to time stamped sensory or kinematics data to facilitate determination of the anatomy feature 152 or instrument feature 154 used for the interaction 156. Base features 202 can be determined using kinematics data 172, sensor data 174 or events data 176, such as information indicative that a jaw or a grasper of a medical instrument 112 is open or closed, that a particular instrument 112 was installed or activated, or that a pressure has been detected on a particular instrument 112 approaching an anatomy feature 152. Base features 202 can include information on trajectory of a medical instrument 112, including flow vectors pertaining to movement. The example system 200 can generate anatomy class (e.g., anatomy mask 204) or instrument box 206 including a box for the instrument type detected to highlight, indicate or display the anatomy features 152 and instrument features 154 detected as output (e.g., an overlay on a display of the video fragment), or for further processing. The interaction model 146 can act as an interaction identifier utilizing the features 152 and 154 to determine and generate tool -tissue interaction outcomes between detected instruments and recognized anatomies, such as particular tasks or actions performed by the instrument feature 154 on a particular anatomy feature 152.

[0083] FIG. 3 illustrates an example 300 of a video frame 308 showing anatomy features 152 and instrument features 154 indicated and marked for optical flow or a motion pattern analysis in a motion pattern map 168. Example 300 can show marking anatomy regions in a video frame 308 using anatomy masks and marking instrument regions of the video frame 308 using instrument boxes 206, as applied, for example in a motion pattern map 168. Example 300 can include views 302, 304 and 306 showing various stages of indications 182 or overlays in the video frame 308 of a video data 178 of the medical procedure.

[0084] At view 302, a video frame 308 can show a pair of medical instruments 112A and 112B being manipulated by manipulator arms of the RMS 120. Medical instruments 112A and 112B can be used in an interaction with an anatomy region (e.g., a tissue) displayed in the video frame 308. The medical instruments 112 can be detected by the ML framework 140 as instrument features 154 within an instrument region 312 of the video frame 308. Similarly, the anatomical part on which the medical instruments 112 take action can be identified as anatomy region 310 of the video frame 308. [0085] At view 304, the video frame 308 can show the anatomy region 310 that is marked by an anatomy mask 204, which can serve as a form of an indication 182. The anatomy mask 204 can indicate or highlight the anatomy region 310 or it can provide a contour of the outer edges of the anatomy region 310 identified by the ML framework 140. As shown in view 304, a grasper or a jaw of a medical instrument 112A can grasp or pull onto anatomy feature 154 corresponding to the anatomy region 310 in the video frame 308. The interaction 156 can be identified, for example, responsive to the ML framework 140 detecting a portion of the jaw of the medical instrument 112A overlapping a portion of the anatomy feature 152 in the anatomy region 310. Detection of the overlap between the instrument 112A and the anatomy feature 152 can be made in response to the contour of the anatomy region 310 or anatomy mask 204 being interrupted by the surface of the medical instrument 112 handling the anatomy feature 154. As shown in this example, the contour of the anatomy mask 204 indicating the anatomy feature 154 can include a line that is interrupted or overlaid by the medical instrument 112A, which along with the action of pulling or moving the portion of the anatomy feature 154 (e.g., indicated by changed in the shape of the anatomy mask 204 reflecting the outer edges of the feature 154) provides indication which ML framework 140 can use to detect and analyze the interaction 156.

[0086] At view 306 of a video anatomy region 310 corresponding to the anatomy feature 154 can be marked by an anatomy mask 204 showing outer contour of the anatomy feature 154 and instrument region 312 corresponding to the instrument feature 154 can be marked by an instrument box 206 outlining the outer contour of the medical instrument 112. At view 306, outer contours of the anatomy region 310 and the instrument region 312 can be used to indicate interaction between the medical instrument and a patient tissue. For instance, overlap between the instrument region 312 and the anatomy region 310 or an interruption of the contour of the anatomy region 310 by the instrument region 312 can indicate an interaction 156. For instance, a combination of the overlap of the anatomy region 310 and the instrument region 312 together with correlation in velocity vectors of the overlapped portion of the anatomy region 310 and the overlapped portion of the instrument region 312 can indicate the interaction 156. The motion pattern map 168 can include color coded regions of the video frame 308 distinguishing the instrument region 312 from the anatomy region 310. Motion pattern map 168 can include optical flow map and indicate the velocities of various components or objects, such as the instrument region 312 and the anatomy region 310. When velocities of the relative two regions 310 and 312 coincide to indicate a particular movement, either in unison (e.g., towards the same direction and at the same speed) or at different directions or different speeds, the motion pattern map 168 can indicate such velocity distributions, facilitating for more accurate determination of the interaction 156 by the interaction model 146. For example, the color blue can represent a direction of the motion vector that is towards the anatomical structure, whereas the color red can represent a direction of the motion vector away from the anatomical structure. The shades of the color, or color gradient, can vary based on the direction of movement. The intensity of the color can represent the magnitude of the vector, such as the force, acceleration or speed of movement.

[0087] FIG. 4 illustrates an example 400 of a system configuration for using ML framework to determine interactions 156 between detected anatomy features 152 and instrument features 154. In the configuration 400, a ML system 100 can take sequential images and time series of kinematics data 172 and sensor data 174 that are recorded during the procedure. A motion pattern analyzer unit 166 can provide motion pattern-based motion fields. The system can utilize a neural network with spatial and temporal attention mechanisms 164 performed as an image encoder 148 to learn the features (e.g., 152) that represent the possible location of each relative organ and structure. The ML framework 140 can utilize ML models and segmentation function 184 generate anatomy mask 204. The system can utilize a rulebased ML interaction model 146 that can check the motions of the identified instrument features 154 and anatomy features 152 for consistency, alignment or correlation in order to determine the on-contact manipulation or otherwise interaction 156.

[0088] The surgical video data 178 can be discretized as the sequential framed images or video frames 308. One or more ML models 142 can process the images or video frames 308 of the vide stream and generate the anatomy masks 204 to label out the anatomy region 310 of interest in the analysis. The ML framework 140 can utilize a segmentation function 184 to implement anatomy segmentation as a type of neural network for adopting sequential frames to take numeric operations over the sequential inputs and convert them into a compact feature vector that represents the input. For example, the ML network can utilize the attention mechanism 164 to provide a spatial and temporal correlation disentangling of the data. Attention mechanism 164 can be utilize selective weights, such that it can apply different weights to emphasize different parts of the information, resulting in finding out the best and compressed representation that fulfills the tasks. Attention mechanism 164 can include both of weighting capabilities on spatial and temporal dimensions, resulting in the neural network structure including a segmentation head that converts encoded feature vectors into a segmentation map. The segmentation map can label the spatial location around the anatomies, such as by providing anatomy masks 204 for anatomy regions 310 or instrument boxes 206 for instrument regions 312 in video frames 308.

[0089] Motion pattern analyzer unit 166 can provide the distribution of the apparent velocities of objects in a video frame 308. By estimating motion pattern between different video frames 308 in the sequence of video frames 308 of a video data 178 stream, the ML framework 140 can measure the velocities of objects (e.g., 152 and 154) in the video. The motion pattern analyzer unit 166 can characterize and quantify the motion field (e.g., motion pattern map 168) of the surgical scene.

[0090] Interactions model 146 can determine interactions 156 involving the instrument features 154 and anatomy features 152. Base features 202, such as time stamped force data that can be synchronized with the video frames 308 of the video stream can be utilized as inputs to the ML models 142, 144 and 146 for improved determinations. ML framework 140 can act as an expert system to quantitatively calculate the consistency of the motion pattern in the surgical instrument and anatomy overlapped area. The consistency metric can include autocorrelation or other numerical measurements that evaluate the smoothness of the flow vectors in the area. By thresholding the metric, such as by determining correlation or coinciding velocity vector according to a threshold level of similarity (e.g., cosine similarity within a threshold range), the technical solutions can determine the type of the interactions 156. The force data (e.g., sensor data 174) can include force feedback that can be used to validate the location or motion of the detected instrument feature 154. The force feedback can be utilized as an auxiliary threshold to validate the manipulating dynamics and improve the accuracy of the determinations.

[0091] In one aspect, the technical solutions can include a system 100 that can include one or more processors (e.g., 710) that can be coupled with memory (e.g., 715 or 720). The memory 715 or 720 can store instructions, computer code or data that can cause the one or more processors 720 to implement any functionality of a DPS 130, including for example any functionality of a ML framework 140, ML models 142, 144 or 146, attention mechanism 164, image encoder 148, time series processor 158, motion pattern analyzer unit 166, segmentation function 184 or interfaces 180 providing indications 182. Indications 182 can include alerts, messages or notifications of anatomy features 152, instrument features 154 or interactions 156 along with any interaction metrics 186 which can be sounded via a sound alarm or displayed or overlaid on a display 116. For example, instructions stored in memory 715 or 720 can configure or cause the one or more processors to perform various operations or tasks of the DPS 130.

[0092] The one or more processors 710 can identify, from a data stream 162 of a medical procedure with a robotic medical system 120, a movement of an instrument feature 154 (e.g., a medical instrument 112 used in the medical procedure) over the plurality of video frames (e.g., 302). The movement can be identified using an image encoder 148 implemented with an attention mechanism 164 to identify a series of instrument features 154 in a series of video frames 308 of a video data 178 of the medical procedure implemented using the RMS 120. The movement can be identified using a motion pattern map 168 generated with a motion pattern analyzer unit 166 generating instrument boxes 206 around instrument regions 312 of a motion pattern map 168.

[0093] The one or more processors 710 can identify, using the data stream 162, a pattern of motion of an anatomical feature 152 or a structure over at least a portion of the medical procedure. The pattern of motion can be identified using an image encoder 148 implemented with an attention mechanism 164 to identify a series of anatomy features 152 in a series of video frames 308 of a video data 178 of the medical procedure implemented using the RMS 120. The pattern of motion can include a series of locations of the anatomy region 310 in a series of video frames 308 over a time period. The pattern of motion can be identified using a motion pattern map 168 generated with a motion pattern analyzer unit 166 generating an anatomy mask 204 around an anatomy region 310 of a motion pattern map 168.

[0094] The one or more processors 710 can detect, based at least on a comparison of the movement of the detected instrument feature 154 (e.g., the medical instrument 112) and the pattern of motion of the anatomical structure (e.g., 152), an interaction 156 between the instrument (e.g., 112) and the anatomical structure (e.g., 152). An interaction model 146 can utilize one or more rules to correlate or compare the movement of the instrument feature 154 with the pattern of motion or movement of the anatomical feature 152. For example, in a motion pattern map 168 one or more vectors (e.g., directions and magnitudes) of velocity of the instrument feature 154 can be compared with, or correlated with, one or more vectors of velocity of the movement or pattern of motion of the anatomical feature 152 detected by the ML framework 140. The correlation or comparison can be implemented based on rules of a rule-based ML interaction model 146 which can utilize a plurality of rules to match a plurality of correlations of velocities between anatomy features 152 and instrument features 154 with any one of a plurality of trained interactions 156. The interactions 156 to be detected by the ML framework 140 can include, for example, any particular task used in a medial procedure, such as using tweezers to grab and hold onto a tissue of a patient’s body, utilize a needle and a thread to suture a wound, or utilize a scalpel to make an incision.

[0095] The one or more processors 710 can provide, via an interface 180, an indication 182 of the interaction 156. The interface can include, for example, a graphical user interface in which a series of video frames 308 of a video file (e.g., 178) can be displayed. The interface 180 can include one or more indications 182, such as an anatomy mask 204 marking, highlighting, outlining, contouring or indicating an anatomy region 310 corresponding to a detected anatomy feature 152. The interface 180 can include one or more indications 182, such as an instrument box 206 marking, highlighting, outlining, contouring or indicating an instrument region 312 corresponding to a detected instrument feature 154. The indication 182 can include a location of the contact between the identified instrument feature 154 and the anatomy feature 152, or an overlaid indication or highlight of the location of the tool -tissue interaction.

[0096] The one or more processors 710 can determine a type of the interaction 156 using one or more ML models (e.g., 142, 144 or 146) that can be trained using machine learning. The one or more ML models 142, 144 or 146 can be based at least on the comparison of the movement of the instrument (e.g., 112) and the pattern of motion of the anatomical structure (e.g., 152). For instance, the ML models 142-146 can utilize a motion pattern map 168 mapping the velocities of the anatomy features 152 and instrument features 154 to correlations of the mapped velocities associated with certain interactions 156 (e.g., tasks performed during the medical procedure). The one or more processors 710 can provide, via the interface 180, an indication 182 of the type of interaction 156, such as by including it as an indication 182 to be displayed or by overlaying it over the video frames 308 displayed.

[0097] The one or more processors 710 can determine an interaction metric 186. The interaction metric can be determined by the ML framework 140, including for example by interaction model 146. The interaction metric 184 can be indicative of a degree of the interaction 156, such as the amount of force (e.g., compression or compression) applied by the medical instrument 112 identified as the instrument feature 154 onto the tissue identified as the anatomy feature 152. For example, the degree of interaction can include levels of force, such as tension or compression, applied by the medical instrument 112 onto a tissue. The level of force can be quantified in terms of levels, scores or percentages, such as 0% indicative of no force applied to 100% indicative of the force that can damage the tissue and should be avoided. The degree of interaction can include or be indicated as low, medium or high, a level scale from 1 to 10, a letter grade or a color-coded symbol. The degree of interaction can include thresholds that can trigger safety alarms or indications 182 to the user (e.g., surgeon) that there is a danger or risk of injury. For instance, an interaction metric 184 can indicate that a level of interaction exceeds a threshold level of force for a particular type of anatomy feature 152 beyond which this particular anatomy feature 152 can be cut, pierced, bruised or damaged.

[0098] In response to the level of interaction exceeding a threshold level, DPS 130 can take action. For instance, responsive to the level of interaction exceeding a threshold level for a particular anatomy feature 152 or a particular anatomy feature type, the interface 180 can trigger or issue an alarm or indication 182 warning the user (e.g., surgeon) of the threshold being exceeded. In response to the level of interaction exceeding a threshold, DPS 130 can trigger an instruction for the RMS 120 to cease an action, such as release the anatomy feature 152 from the hold. For instance, a first level of interaction at a lower level can trigger an alarm or an indication that a damage to the tissue can be done, while a second level of interaction at a level that exceeds the first level of interaction can trigger an automatic release of the tissue by the instrument, stopping the movement of the instrument or retraction of the instrument from applying additional force on the tissue. Responsive to the threshold being exceeded, the RMS 120 can provide a dynamic force feedback, such as haptic feedback, to the user (e.g., surgeon) providing haptic indication that interaction level has exceeded a threshold for the level of dragging, stretching, pulling or holding the tissue. Interaction metric 186 can include a consistency metric indicative of the smoothness of force or vectors of flow of movement with respect to the tissue. The one or more processors 710 can provide, via an interface 180, an indication of the metric (e.g., 186) by displaying or overlaying the metric over the video frames 308. Interaction metrics 186 can be provided or displayed via indications 182, using an interface 180, which can be displayed on a display, such as a display 116.

[0099] The one or more processors 710 can identify, based at least on the plurality of video frames 308 of a video stream (e.g., 178), a type of the medical instrument 112 used in the medical procedure. For instance, the instrument model 144 can detect and identify the type of the medical instrument 112 identified as the instrument feature 154. For example, the instrument model 144 can utilize the image encoder 148 to identify the instrument feature 154 as any medical instrument 112, such as a grasper, a pair scissors, a surgical stapler, a dissector, a needle or a scalpel. [00100] The one or more processors 710 can identify, based at least on the plurality of video frames 308, a type of the anatomical structure or an anatomy feature 152. The type of anatomical structure can include a plurality of detected anatomy features 152 arranged in a particular way to indicate a particular portion of the patient’s body and facilitating a more accurate identification of the anatomy feature 152 interacted by the instrument feature 154. The one or more processors 710 can detect, based at least on the type of the medical instrument 112 and the type of the anatomical structure (e.g., one or more anatomy features 152) a type of the interaction 156.

[00101] The one or more processors 710 can identify, from the data stream 162, kinematics data 172 indicative of the movement of the instrument (e.g., 112) and video stream data (e.g., 178) of the anatomical structure. The one or more processors 710 can identify one or more machine learning (ML) models (e.g., 142, 144 and 146) having one or more spatial attention mechanisms 164 and one or more temporal attention mechanisms 164 trained on a dataset of a plurality of interactions 156 between a plurality of instruments 112 (e.g., instrument features 154) and a plurality of anatomical structures (e.g., anatomy features 152) in a plurality of medical procedures. The one or more processors 710 can detect the interaction 156 based at least on the kinematics data 172 and the video stream data (e.g., 178) applied to the one or more spatial attention mechanisms 164 and the one or more temporal attention mechanisms 164.

[00102] The one or more processors 710 can determine, based at least on a movement of a portion of the instrument 112 (e.g., instrument feature 154) and a pattern of motion of a portion of the anatomical structure (e.g., anatomy feature 152), that a level of consistency of the movement and the pattern of motion exceeds a threshold. The threshold can include, for example, a threshold for a level of consistency in the motion or correlation of the features 152 and 154. The threshold can include, for example, a threshold range of a similarity function, such as a cosine similarity function between the movement of the feature 154 and the pattern of motion of the feature 152. The one or more processors 710 can detect the interaction 156 based at least on the determination of the level of consistency exceeding the threshold.

[00103] The one or more processors 710 can determine, using the plurality of frames of a video stream input into an image encoder 148 of a machine learning model (e.g., 142), a plurality of anatomical features (e.g.., anatomy features 152) indicative of the anatomical structure. The anatomical structure can correspond to a particular organ or a portion of a person’s body that can be detected or determined using the plurality of anatomy features 152 of the organ and its surroundings identified in the video frames 308. The one or more processors 710 can detect the anatomical structure based at least on the plurality of anatomical features (e.g., 152).

[00104] The one or more processors 710 can determine, using a first time stamp of a kinematics data 172 of the data stream 162 indicative of a movement of the instrument 112 (e.g., instrument feature 154) and a second time stamp of a force data (e.g., 174) indicative of a force corresponding to the instrument 112, the movement of the instrument 112 over a time period. The one or more processors 710 can determine, using a third time stamp of the plurality of video frames 308; the pattern of motion over the time period. The one or more processors 710 can detect the interaction 156 based at least on a correlation of the movement of the instrument 112 and the pattern of motion of the anatomy feature 152 during the time period.

[00105] The one or more processors 710 can identify one or more machine learning (ML) models having a temporal attention mechanism 164 trained on timing of actions captured by a plurality of video streams (e.g., 178) of a plurality of medical procedures. The one or more processors 710 can determine, based at least on the plurality of video frames 308 and kinematics data 172 on movement of the instrument applied to the temporal attention mechanism 164, one or more locations of the medical instrument 112 (e.g., instrument feature 154) over the plurality of video frames 308.

[00106] The one or more processors 710 can identify one or more machine learning (ML) models having a spatial attention mechanism 164 trained on spatial arrangement of a plurality of instruments and a plurality of anatomical structures (e.g., 152) captured by a plurality of video streams of a plurality of medical procedures. The one or more processors 710 can determine, based at least on the plurality of video frames 308 applied to the spatial attention mechanism 164, the movement of the medical instrument 112 (e.g., 154) and the pattern of motion of the anatomical structure (e.g., 152).

[00107] The one or more processors 710 can identify a time period corresponding to the plurality of frames. The one or more processors 710 can determine, based at least on kinematics data 172 of the medical instrument 112 and the plurality of video frames 308, a plurality of locations of the medical instrument 112 corresponding to the time period. The one or more processors 710 can determine, based at least on the plurality of locations, a velocity of the medical instrument 112 (e.g., instrument feature 154) during the time period. [00108] The one or more processors 710 can determine, based at least on the velocity of the medical instrument 112 and the interaction 156 between the instrument 112 (e.g., 154) and the anatomical structure (e.g., 152) a performance metric (e.g., interaction metric 186) directed to a performance quality measurement of a task corresponding to the interaction 156. The performance metric 186 can include analysis of the quality of the surgical task performance determined based at least on the interaction 156 detected. The one or more processors 710 can provide for display the performance metric 186 overlaid over the plurality of video frames 308.

[00109] The one or more processors 710 can identify, using machine learning, a plurality of locations of the instrument (e.g., instrument feature 154) within a time period corresponding to the plurality of video frames 308. The one or more processors 710 can determine, based at least on the plurality of locations, one or more velocities of one or more objects within the time period. The one or more processors 710 can identify, using the plurality of video frames 308 and one or more machine learning (ML) models (e.g., 142-146), a first one or more vectors corresponding to the movement of the instrument 112 (e.g., 154). The one or more processors 710 can identify, using the plurality of video frames 308 and the one or more ML models (e.g., 142-146), a second one or more vectors corresponding to the pattern of motion of the anatomical structure (e.g., 152). The one or more processors 710 can detect the interaction between the instrument (e.g., 154) and the anatomical structure (e.g., 152) based at least on the comparison of the first one or more vectors and the second one or more vectors.

[00110] FIG. 5 depicts an example flow diagram of a method 500 for machine learning based detection of interactions between instruments and anatomy of a patient in medical robotic systems. The method 500 can be performed by a system having one or more processors configured to perform operations of the system 100 by executing computer-readable instructions stored on a memory. The method can be implemented using a non-transitory computer readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to implement operations of the method 500. The method 500 can be performed, for example, by system 100 and in accordance with any features or techniques discussed in connection with FIGS. 1-4 and 6-7. For instance, the method 500 can be implemented one or more processors 710 of a computing system 700 executing non- transitory computer-readable instructions stored on a memory (e.g., the memory 715, 720 or 725) and using data from a data repository 160 (e.g., storage device 725).

[00111] The method 500 can be used to detect and analyze interactions between anatomy and instrument features via a ML framework using image encoders and attention mechanisms to detect such features in a video stream. Method 500 can include operations 505-530. At operation 505, the method can identify instrument features. At operation 510, the method can identify anatomy feature. At operation 515, the method can identify movements of anatomy and instrument features. At operation 520, the method can determine if the movements of the anatomy and instrument features correlate. At operation 525, the method can detect interaction between the instrument and the anatomy features. At operation 530, the method can provide an indication of the interaction.

[00112] At operation 505, the method can identify instrument features. The method can include the one or more processors determining or detecting the instrument feature based at least on the plurality of frames of a video stream. The instrument feature can be detected or determined based at least on kinematics data on movement of the instrument or sensor data, such as force feedback data, on the instrument. The kinematics or force data can be timestamped and temporally aligned with the time period of the vide frames of the video stream used to determine or detect the presence of the instrument features. The instrument features can include any portion, or entirety, of any medical instrument utilized by a robotic medical system. The instrument can include, for example, a medical tool for making incisions, such as a scalpel, a tool for suturing a wound, such as a needle and a thread, an endoscope for visualizing organs or tissues, an imaging device, a forceps, a pair of scissors, one or more retractors, graspers, or any other tool or instrument used by the robotic medical system during the medical operation.

[00113] The method can include the one or more processors determining a type of the instrument used in the medical procedure. The method can determine the instrument feature from a motion pattern map of one or more video frames of a video data stream. The method can determine the instrument feature using an instrument model. The instrument model can utilize an image encoder to detect objects or features in one or more image or video frames. The image encoder can be implemented using one or more attention mechanisms, which can include a spatial attention mechanism that can apply weights to consider more strongly features of a particular spatial orientation or arrangement than other spatial features. For instance, activity within a particular portion or region of a video frame can be identified for more close attention for locating a medical instrument based at least on location of the medical instrument in one or more preceding video frames. An attention mechanism can include a temporal attention mechanism to apply weights to consider more strongly features of a particular temporal arrangement than other temporal features. For instance, instrument features can be identified based at least on the timing of the data (e.g., video frame time stamp, sensor data time stamp or kinematics data time stamp).

[00114] The method can include identify, using machine learning, a plurality of locations of the instrument within a time period corresponding to the plurality of video frames. For instance, the method can detect locations of the instrument feature identified in a motion pattern map over a time period corresponding to a plurality of video frames. The locations of the instrument feature in subsequent video frames can be determined based at least on the locations of the instrument feature in preceding video frames.

[00115] At operation 510, the method can identify anatomy feature. The method can include the one or more processors determining or detecting the anatomy feature based at least on the plurality of frames of a video stream. The anatomy features can include any portion, or entirety, of any medical instrument utilized by a robotic medical system. The anatomy feature can include, for example, any portion of anatomy of a patient, such as an organ or a portion of an organ, a tissue, a gland, a muscle, an artery, a cartilage, a bone, a nerve tissue or any other portion of anatomy of a patient subjected to a medical procedure using a robotic medical system.

[00116] The method can include the one or more processors determining a type of the anatomy used in the medical procedure. The method can determine the anatomy feature from a motion pattern map of one or more video frames of a video data stream. The method can determine the anatomy feature using an anatomy model. The anatomy model can utilize an image encoder to detect objects or features in one or more image or video frames. The image encoder can be implemented using one or more attention mechanisms, which can include a spatial attention mechanism that can apply weights to consider more strongly features of a particular spatial orientation or arrangement than other spatial features. For instance, activity within a particular portion or region of a video frame can be identified for more close attention for locating an anatomy feature based at least on a location of the anatomy feature in one or more preceding video frames. An attention mechanism can include a temporal attention mechanism to apply weights to consider more strongly features of a particular temporal arrangement than other temporal features.

[00117] Anatomy features can be detected or identified based at least on the timing of the data (e.g., video frame time stamp, sensor data time stamp or kinematics data time stamp). The method can include identify, using machine learning, a plurality of locations of the anatomy feature within a time period corresponding to the plurality of video frames. For instance, the method can detect locations of the anatomy feature identified in a motion pattern map over a time period corresponding to a plurality of video frames. The locations of the anatomy feature in subsequent video frames can be determined based at least on the locations of the anatomy feature in preceding video frames.

[00118] The one or more processors can identify a type of the anatomical structure. The anatomical structure can include one or more anatomical features. The anatomical structure can include an arrangement, such a location or spatial arrangement, of one or more anatomical features with respect to other anatomical features or frames of reference. The anatomical structure or one or more anatomical features can be identified by the one or more processors, based at least on the plurality of video frames. For instance, the method can include determining, using the plurality of frames of a video stream input into an encoder of a machine learning model, a plurality of anatomical features indicative of the anatomical structure. The method can include detecting the anatomical structure based at least on the plurality of anatomical features.

[00119] At operation 515, the method can identify movements of one or more anatomy features or one or more instrument features. The method can include the one or more processors identifying, from a data stream of a medical procedure implemented using a robotic medical system, a movement of an instrument or an instrument feature that can be moved or used in the medical procedure. Identification of the movement of the instrument can be determined or identified using one or more video frames of the video stream data. The method can determine or detect the movement of the anatomy feature or the instrument feature, or both, based on changes to locations of these respective features over a series of video frames of the video stream.

[00120] The method can identify, using the plurality of frames and one or more machine learning (ML) models, a first one or more vectors corresponding to the movement of the instrument. The first one or more vectors can include one or more velocity vectors that can include direction and magnitude of the velocity of the instrument feature identified in a video frame or a motion pattern map. The method can include identifying, using the plurality of frames and the one or more ML models, a second one or more vectors corresponding to the pattern of motion of the anatomical structure or one or more anatomical features. The second one or more vectors can include one or more velocity vectors that can include direction and magnitude of the velocity of the anatomical feature identified in the video frame or the motion pattern map.

[00121] The method can include the one or more processors identifying, using the data stream (e.g., video data stream, kinematics data, sensor data or events data), a pattern of motion of an anatomical structure over at least a portion of the medical procedure. For instance, the method can include the one or more processors identifying, from the data stream, kinematics data indicative of the movement of the instrument and video stream data of the anatomical structure. The movement can be identified by the motion pattern analyzer unit that can apply velocity vectors to instrument and tissue features in a two-dimensional space corresponding to a video frame.

[00122] The method can include the one or more processors identifying one or more machine learning (ML) models having one or more spatial attention mechanisms and one or more temporal attention mechanisms. The attention mechanisms can be trained on a dataset of a plurality of interactions between a plurality of instruments and a plurality of anatomical structures in a plurality of medical procedures. The method can determine, based at least on a movement of a portion of the instrument and a pattern of motion of a portion of the anatomical structure, that a level of consistency or correlation of the movement and the pattern of motion exceeds a threshold. The threshold can include a threshold level or range of similarity that can be determined based on a similarity function (e.g., cosine similarity or Euclidean distance function) that can be applied to vector representations of the movement of the features, such as the velocity vectors of the instrument feature and the velocity vectors of the anatomy features over one or more video frames. Based on determination of the level of consistency or correlation, the method can detect or identify the interaction, or interaction type, between the instrument feature and the anatomy feature.

[00123] The method can include identifying one or more ML models having a temporal attention mechanism trained on timing of actions captured by a plurality of video streams of a plurality of medical procedures. The method can determine, based at least on the plurality of frames and kinematics data on movement of the instrument applied to the temporal attention mechanism, one or more locations of the instrument over the plurality of frames. The method can identify one or more ML models having a spatial attention mechanism trained on spatial arrangement of a plurality of instruments and a plurality of anatomical structures captured by a plurality of video streams of a plurality of medical procedures. The method can determine, based at least on the plurality of frames applied to the spatial attention mechanism, the movement of the instrument and the pattern of motion of the anatomical structure.

[00124] The method can identify a time period corresponding to the plurality of video frames that are used to determine or identify the movements of the instrument feature or the anatomy feature. The method can determine, based at least on kinematics data of the instrument and the plurality of frames, a plurality of locations of the instrument corresponding to the time period. The method can determine, based at least on the plurality of locations, a velocity of the instrument during the time period. The method can determine, based at least on the plurality of locations, one or more velocities of one or more objects (e.g., anatomy features, instrument features or other objects determined within video frames) within the time period.

[00125] At operation 520, the method can determine if the movements of the anatomy features and the instrument features correlate or coincide. The method can include the one or more processors comparing the movement of the instrument feature and the pattern of motion of the anatomical structure. For instance, the one or more processors can compare the movement (e.g., velocity vector) of the instrument feature detected by the instrument ML model with the movement (e.g., velocity vector) of the anatomy feature detected by the anatomy ML model over a series of video frames corresponding to a period of time.

[00126] The method can determine, based at least on a movement of a portion of the instrument and a pattern of motion of a portion of the anatomical structure, that a level of consistency of the movement and the pattern of motion exceeds a threshold; and detect, based at least on this determination, the interaction. For instance, the one or more processors can identify the level of consistency exceeding a threshold in response to a similarity function (e.g., a cosine similarity or a Euclidean distance) between two velocity vectors of the anatomy feature and the instrument feature having similarity or correlation that is beyond a threshold level indicative of a sufficiently high similarity.

[00127] The method can include the one or more processors determining the movement of the instrument feature over a time period using any data stream data. For example, the method can determine the movement of the instrument feature over a time period using a first time stamp of a kinematics data of the data stream indicative of a movement of the instrument and a second time stamp of a force data indicative of a force corresponding to the instrument being used by the RMS during the same time period. The method can include determine, using a third time stamp of the plurality of frames; the pattern of motion over the time period. The method can identify or detect the interaction between the instrument feature and the anatomy feature based at least on a correlation of the movement of the instrument and the pattern of motion during the time period.

[00128] At operation 525, the method can detect interaction between the instrument and the anatomy features. The interaction can include any action or activity implemented using a medical instrument on an anatomical part or a feature of a patient. The interaction can include any task of a medical procedure, such as an action of making an incision using a scalpel, an action of suturing a wound, an action of using a pair of scissors on a thread for suturing, an action of inserting or removing an object (e.g., endoscope) from an anatomical part or feature or any other action or activity involving the instrument feature (e.g., detected medical instrument) and the anatomy feature (e.g., detected part of patient’s anatomy).

[00129] The one or more processors can determine a type of the interaction using a model trained using machine learning and based at least on the comparison of the movement of the instrument and the pattern of motion of the anatomical structure. The one or more processors can detect an interaction between the instrument and the anatomical structure based at least on a comparison of the movement of the instrument feature and the movement of the anatomy feature. For example, if a velocity vector of the instrument feature in a motion pattern map coincides or corresponds to a velocity vector of the anatomy feature in the motion pattern map in accordance with an expected relationship between the two vectors, then a rule can be triggered to identify a particular interaction corresponding to such relationship between the two vectors. For instance, a first velocity vector can be directed in one direction and at one magnitude of velocity while another velocity vector can be directed in another direction at another magnitude of velocity. In response to the two directions and magnitudes being within particular expected ranges, instrument model can trigger a rule to detect a particular interaction corresponding to such directions and magnitudes of the two velocity vectors.

[00130] The method can include the one or more processors detecting, based at least on the type of the instrument and the type of the anatomical structure, a type of the interaction. The one or more processors can determine a metric indicative of a degree of the interaction. For example, the metric can include an interaction metric indicative of the amount of force (e.g., tension or pressure) applied to an anatomy feature by the instrument feature. The metric can indicate the level at which the applied amount of force is within the acceptable range. The metric can, for example, indicate that too much force is being applied to trigger an alert or indication to reduce the force being applied. [00131] The method can include the one or more processors detecting the interaction based at least on the kinematics data and the video stream data applied to the one or more spatial attention mechanisms and the one or more temporal attention mechanisms. The method can detect the interaction based at least on a correlation of the movement of the instrument and the pattern of motion during the time period. For instance, the correlation can include a comparison (e.g., similarity function) being applied to the vector indicative of the movement of the instrument (e.g., instrument feature) and the vector indicative of the pattern of motion of the anatomy feature.

[00132] The method can determine, based at least on the velocity of the instrument and the interaction between the instrument and the anatomical structure a performance metric of a task corresponding to the interaction. The performance metric can be indicative of the quality of the action performed. The method can detect the interaction between the instrument and the anatomical structure based at least on the comparison of the first one or more vectors and the second one or more vectors.

[00133] At operation 530, the method can provide an indication of the interaction. The indication can include an alert, a message, an alarm sound or an overlay of information over a displayed content (e.g., video stream). The indication can be generated to indicate a performance metric, such as an interaction metric indicative of the amount of force applied to an anatomical part corresponding to the anatomy feature detected by the model. The indication can, for example, indicate that too much force is being applied and request the user to reduce the force. The indication can, for example, identify a score corresponding to an action or movement by the medical instrument with respect to an anatomical part of the patient’s body.

[00134] The method can include the one or more processors providing, via an interface, an indication of the interaction. The indication can state or indicate that an interaction between the medical instrument and an anatomy feature took place. The indication can indicate that a contact is made between an instrument and an anatomical part that can be designated as sensitive or unintended for instrument action. The method can include the one or more processors providing, via the interface, an indication of the type of interaction being made. The type of interaction can include pulling on a tissue, pushing a tissue, dragging or stretching a tissue beyond a particular range, ripping a tissue, compressing a tissue, cutting a tissue or otherwise affecting the tissue, a time period for which the tissue is being held or grasped by an instrument. The one or more processors can provide, via an interface, an indication of the metric, such as a performance metric or an interaction metric. The performance metric can indicate a performance of the action performed. The interaction metric can indicate a level of interaction between the instrument and the tissue (e.g., amount of force applied). The method can provide for display the performance metric or the interaction metric overlaid over the plurality of video frames.

[00135] FIG. 6 depicts a surgical system 600, in accordance with some embodiments. The surgical system 600 may be an example of the medical environment 102. The surgical system 600 may include a robotic medical system 605 (e.g., the robotic medical system 120), a user control system 610, and an auxiliary system 615 communicatively coupled one to another. A visualization tool 620 (e.g., the visualization tool 114) may be connected to the auxiliary system 615, which in turn may be connected to the robotic medical system 605. Thus, when the visualization tool 620 is connected to the auxiliary system 615 and this auxiliary system is connected to the robotic medical system 605, the visualization tool may be considered connected to the robotic medical system. In some embodiments, the visualization tool 620 may additionally or alternatively be directly connected to the robotic medical system 605.

[00136] The surgical system 600 may be used to perform a computer-assisted medical procedure on a patient 625. In some embodiments, surgical team may include a surgeon 630 A and additional medical personnel 630B-630D such as a medical assistant, nurse, and anesthesiologist, and other suitable team members who may assist with the surgical procedure or medical session. The medical session may include the surgical procedure being performed on the patient 625, as well as any pre-operative (e.g., which may include setup of the surgical system 600, including preparation of the patient 625 for the procedure), and post-operative (e.g., which may include clean up or post care of the patient), or other processes during the medical session. Although described in the context of a surgical procedure, the surgical system 600 may be implemented in a non-surgical procedure, or other types of medical procedures or diagnostics that may benefit from the accuracy and convenience of the surgical system.

[00137] The robotic medical system 605 can include a plurality of manipulator arms 635 A- 635D to which a plurality of medical tools (e.g., the medical tool 112) can be coupled or installed. Each medical tool can be any suitable surgical tool (e.g., a tool having tissueinteraction functions), imaging device (e.g., an endoscope, an ultrasound tool, etc.), sensing instrument (e.g., a force-sensing surgical instrument), diagnostic instrument, or other suitable instrument that can be used for a computer-assisted surgical procedure on the patient 625 (e.g., by being at least partially inserted into the patient and manipulated to perform a computer- assisted surgical procedure on the patient). Although the robotic medical system 605 is shown as including four manipulator arms (e.g., the manipulator arms 635A-635D), in other embodiments, the robotic medical system can include greater than or fewer than four manipulator arms. Further, not all manipulator arms can have a medical tool installed thereto at all times of the medical session. Moreover, in some embodiments, a medical tool installed on a manipulator arm can be replaced with another medical tool as suitable.

[00138] One or more of the manipulator arms 635A-635D and/or the medical tools attached to manipulator arms can include one or more displacement transducers, orientational sensors, positional sensors, and/or other types of sensors and devices to measure parameters and/or generate kinematics information. One or more components of the surgical system 600 can be configured to use the measured parameters and/or the kinematics information to track (e.g., determine poses of) and/or control the medical tools, as well as anything connected to the medical tools and/or the manipulator arms 635A-635D.

[00139] The user control system 610 can be used by the surgeon 630A to control (e.g., move) one or more of the manipulator arms 635A-635D and/or the medical tools connected to the manipulator arms. To facilitate control of the manipulator arms 635A-635D and track progression of the medical session, the user control system 610 can include a display (e.g., the display 116) that can provide the surgeon 630A with imagery (e.g., high-definition 3D imagery) of a surgical site associated with the patient 625 as captured by a medical tool (e.g., the medical tool 112, which can be an endoscope) installed to one of the manipulator arms 635A-635D. The user control system 610 can include a stereo viewer having two or more displays where stereoscopic images of a surgical site associated with the patient 625 and generated by a stereoscopic imaging system can be viewed by the surgeon 630 A. In some embodiments, the user control system 610 can also receive images from the auxiliary system 615 and the visualization tool 620.

[00140] The surgeon 630A can use the imagery displayed by the user control system 610 to perform one or more procedures with one or more medical tools attached to the manipulator arms 635A-635D. To facilitate control of the manipulator arms 635A-635D and/or the medical tools installed thereto, the user control system 610 can include a set of controls. These controls can be manipulated by the surgeon 630 A to control movement of the manipulator arms 63 SA- 635D and/or the medical tools installed thereto. The controls can be configured to detect a wide variety of hand, wrist, and finger movements by the surgeon 630A to allow the surgeon to intuitively perform a procedure on the patient 625 using one or more medical tools installed to the manipulator arms 635A-635D. [00141] The auxiliary system 615 can include one or more computing devices configured to perform processing operations within the surgical system 600. For example, the one or more computing devices can control and/or coordinate operations performed by various other components (e.g., the robotic medical system 605, the user control system 610) of the surgical system 600. A computing device included in the user control system 610 can transmit instructions to the robotic medical system 605 by way of the one or more computing devices of the auxiliary system 615. The auxiliary system 615 can receive and process image data representative of imagery captured by one or more imaging devices (e.g., medical tools) attached to the robotic medical system 605, as well as other data stream sources received from the visualization tool. For example, one or more image capture devices (e.g., the image capture devices 110) can be located within the surgical system 600. These image capture devices can capture images from various viewpoints within the surgical system 600. These images (e.g., video streams) can be transmitted to the visualization tool 620, which can then passthrough those images to the auxiliary system 615 as a single combined data stream. The auxiliary system 615 can then transmit the single video stream (including any data stream received from the medical tool(s) of the robotic medical system 605) to present on a display (e.g., the display 116) of the user control system 610.

[00142] In some embodiments, the auxiliary system 615 can be configured to present visual content (e.g., the single combined data stream) to other team members (e.g., the medical personnel 630B-630D) who might not have access to the user control system 610. Thus, the auxiliary system 615 can include a display 640 configured to display one or more user interfaces, such as images of the surgical site, information associated with the patient 625 and/or the surgical procedure, and/or any other visual content (e.g., the single combined data stream). In some embodiments, display 640 can be a touchscreen display and/or include other features to allow the medical personnel 630A-630D to interact with the auxiliary system 615.

[00143] The robotic medical system 605, the user control system 610, and the auxiliary system 615 can be communicatively coupled one to another in any suitable manner. For example, in some embodiments, the robotic medical system 605, the user control system 610, and the auxiliary system 615 can be communicatively coupled by way of control lines 645, which can represent any wired or wireless communication link that can serve a particular implementation. Thus, the robotic medical system 605, the user control system 610, and the auxiliary system 615 can each include one or more wired or wireless communication interfaces, such as one or more local area network interfaces, Wi-Fi network interfaces, cellular interfaces, etc. It is to be understood that the surgical system 600 can include other or additional components or elements that can be needed or considered desirable to have for the medical session for which the surgical system is being used.

[00144] FIG. 7 depicts an example block diagram of an example computer system 700 is shown, in accordance with some embodiments. The computer system 700 can be any computing device used herein and can include or be used to implement a data processing system or its components. The computer system 700 includes at least one bus 705 or other communication component or interface for communicating information between various elements of the computer system. The computer system further includes at least one processor 710 or processing circuit coupled to the bus 705 for processing information. The computer system 700 also includes at least one main memory 715, such as a random-access memory (RAM) or other dynamic storage device, coupled to the bus 705 for storing information, and instructions to be executed by the processor 710. The main memory 715 can be used for storing information during execution of instructions by the processor 710. The computer system 700 can further include at least one read only memory (ROM) 720 or other static storage device coupled to the bus 705 for storing static information and instructions for the processor 710. A storage device 725, such as a solid-state device, magnetic disk or optical disk, can be coupled to the bus 705 to persistently store information and instructions.

[00145] The computer system 700 can be coupled via the bus 705 to a display 730, such as a liquid crystal display, or active-matrix display, for displaying information. An input device 735, such as a keyboard or voice interface can be coupled to the bus 705 for communicating information and commands to the processor 710. The input device 735 can include a touch screen display (e.g., the display 730). The input device 735 can also include a cursor control, such as a mouse, a trackball, or cursor direction keys, for communicating direction information and command selections to the processor 710 and for controlling cursor movement on the display 730.

[00146] The processes, systems and methods described herein can be implemented by the computer system 700 in response to the processor 710 executing an arrangement of instructions contained in the main memory 715. Such instructions can be read into the main memory 715 from another computer-readable medium, such as the storage device 725. Execution of the arrangement of instructions contained in the main memory 715 causes the computer system 700 to perform the illustrative processes described herein. One or more processors in a multiprocessing arrangement can also be employed to execute the instructions contained in the main memory 715. Hard-wired circuitry can be used in place of or in combination with software instructions together with the systems and methods described herein. Systems and methods described herein are not limited to any specific combination of hardware circuitry and software.

[00147] Although an example computing system has been described in FIG. 7, the subject matter including the operations described in this specification can be implemented in other types of digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.

[00148] The herein described subject matter sometimes illustrates different components contained within, or connected with, different other components. It is to be understood that such depicted architectures are illustrative, and that in fact many other architectures can be implemented which achieve the same functionality. In a conceptual sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality, and any two components capable of being so associated can also be viewed as being “operably couplable,” to each other to achieve the desired functionality. Specific examples of operably couplable include but are not limited to physically mateable or physically interacting components or wirelessly interactable or wirelessly interacting components or logically interacting or logically interactable components.

[00149] With respect to the use of plural or singular terms herein, those having skill in the art can translate from the plural to the singular or from the singular to the plural as is appropriate to the context or application. The various singular/plural permutations can be expressly set forth herein for sake of clarity.

[00150] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). [00151] Although the figures and description can illustrate a specific order of method steps, the order of such steps can differ from what is depicted and described, unless specified differently above. Also, two or more steps can be performed concurrently or with partial concurrence, unless specified differently above. Such variation can depend, for example, on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the disclosure. Likewise, software implementations of the described methods can be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various connection steps, processing steps, comparison steps, and decision steps.

[00152] It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation, no such intent is present. For example, as an aid to understanding, the following appended claims can contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to inventions containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” or “an” should typically be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should typically be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, typically means at least two recitations, or two or more recitations).

[00153] Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general, such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”

[00154] Further, unless otherwise noted, the use of the words “approximate,” “about,” “around,” “substantially,” etc., mean plus or minus ten percent.

[00155] The foregoing description of illustrative implementations has been presented for purposes of illustration and of description. It is not intended to be exhaustive or limiting with respect to the precise form disclosed, and modifications and variations are possible in light of the above teachings or can be acquired from practice of the disclosed implementations. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

Claims

CLAIMS What is claimed is:

1. A system, comprising: one or more processors, coupled with memory, to: identify, from a data stream of a medical procedure with a robotic medical system, a movement of an instrument used in the medical procedure; identify, using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure; detect, based at least on a comparison of the movement of the instrument and the pattern of motion of the anatomical structure, an interaction between the instrument and the anatomical structure; and provide, via an interface, an indication of the interaction.

2. The system of claim 1, comprising the one or more processors to: determine a type of the interaction using a model trained using machine learning and based at least on the comparison of the movement of the instrument and the pattern of motion of the anatomical structure; and provide, via the interface, an indication of the type of interaction.

3. The system of claim 1, comprising the one or more processors to: determine a metric indicative of a degree of the interaction; and provide, via an interface, an indication of the metric.

4. The system of claim 1, comprising the one or more processors to: identify, based at least on a plurality of frames of a video stream, a type of the instrument used in the medical procedure; identify, based at least on the plurality of frames, a type of the anatomical structure; and detect, based at least on the type of the instrument and the type of the anatomical structure, a type of the interaction.

5. The system of claim 1, comprising the one or more processors to: identify, from the data stream, kinematics data indicative of the movement of the instrument, and video stream data of the anatomical structure; identify one or more machine learning models having one or more spatial attention mechanisms and one or more temporal attention mechanisms trained on a dataset of a plurality of interactions between a plurality of instruments and a plurality of anatomical structures in a plurality of medical procedures; and detect the interaction based at least on the kinematics data and the video stream data applied to the one or more spatial attention mechanisms and the one or more temporal attention mechanisms.

6. The system of claim 1, comprising the one or more processors to: determine, based at least on a movement of a portion of the instrument and a pattern of motion of a portion of the anatomical structure, that a level of consistency of the movement and the pattern of motion exceeds a threshold; and detect, based at least on the determination, the interaction.

7. The system of claim 1, comprising the one or more processors to: determine, using a plurality of frames of a video stream input into an encoder of a machine learning model, a plurality of anatomical features indicative of the anatomical structure; and detect the anatomical structure based at least on the plurality of anatomical features.

8. The system of claim 1, comprising the one or more processors to: determine, using a first time stamp of a kinematics data of the data stream indicative of a movement of the instrument and a second time stamp of a force data indicative of a force corresponding to the instrument, the movement of the instrument over a time period; determine, using a third time stamp of a plurality of frames of the data stream, the pattern of motion over the time period; and detect the interaction based at least on a correlation of the movement of the instrument and the pattern of motion during the time period.

9. The system of claim 1, comprising the one or more processors to: identify one or more machine learning models having a temporal attention mechanism trained on timing of actions captured by a plurality of video streams of a plurality of medical procedures; and determine, based at least on a plurality of frames of the plurality of video streams and kinematics data on movement of the instrument applied to the temporal attention mechanism, one or more locations of the instrument over the plurality of frames.

10. The system of claim 1, comprising the one or more processors to: identify one or more machine learning models having a spatial attention mechanism trained on spatial arrangement of a plurality of instruments and a plurality of anatomical structures captured by a plurality of video streams of a plurality of medical procedures; and determine, based at least on a plurality of frames of the data stream applied to the spatial attention mechanism, the movement of the instrument and the pattern of motion of the anatomical structure.

11. The system of claim 1, comprising the one or more processors to: identify a time period corresponding to a plurality of frames of the data stream; determine, based at least on kinematics data of the instrument and the plurality of frames, a plurality of locations of the instrument corresponding to the time period; and determine, based at least on the plurality of locations, a velocity of the instrument during the time period.

12. The system of claim 11, comprising the one or more processors to: determine, based at least on the velocity of the instrument and the interaction between the instrument and the anatomical structure a performance metric of a task corresponding to the interaction; and provide for display the performance metric overlaid over the plurality of frames.

13. The system of claim 1, comprising the one or more processors to: identify, using machine learning, a plurality of locations of the instrument within a time period corresponding to a plurality of frames of the data stream; and determine, based at least on the plurality of locations, one or more velocities of one or more objects within the time period.

14. The system of claim 1, comprising the one or more processors to: identify, using a plurality of frames of the data stream and one or more machine learning (ML) models, a first one or more vectors corresponding to the movement of the instrument; identify, using the plurality of frames and the one or more ML models, a second one or more vectors corresponding to the pattern of motion of the anatomical structure; and detect the interaction between the instrument and the anatomical structure based at least on the comparison of the first one or more vectors and the second one or more vectors.

15. A method, comprising: identifying, by one or more processors coupled with memory from a data stream of a medical procedure implemented using a robotic medical system, a movement of an instrument used in the medical procedure over a plurality of frames of the data stream; identifying, by the one or more processors using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure; comparing, by the one or more processors, the movement of the instrument and the pattern of motion of the anatomical structure; detecting, by the one or more processors based at least on the comparison, an interaction between the instrument and the anatomical structure; and providing, by the one or more processors via an interface, an indication of the interaction.

16. The method of claim 15, comprising: determining, by the one or more processors, a type of the interaction using a model trained using machine learning and based at least on the comparison of the movement of the instrument and the pattern of motion of the anatomical structure; and providing, by the one or more processors, via the interface, an indication of the type of interaction.

17. The method of claim 15, comprising: determining, by the one or more processors, a metric indicative of a degree of the interaction; and providing, by the one or more processors, via an interface, an indication of the metric.

18. The method of claim 15, comprising: identifying, by the one or more processors, based at least on the plurality of frames of a video stream, a type of the instrument used in the medical procedure; identifying, by the one or more processors, based at least on the plurality of frames, a type of the anatomical structure; and detecting, by the one or more processors, based at least on the type of the instrument and the type of the anatomical structure, a type of the interaction.

19. The method of claim 15, comprising: identifying, by the one or more processors from the data stream, kinematics data indicative of the movement of the instrument and video stream data of the anatomical structure; identifying, by the one or more processors, one or more machine learning (ML) models having one or more spatial attention mechanisms and one or more temporal attention mechanisms trained on a dataset of a plurality of interactions between a plurality of instruments and a plurality of anatomical structures in a plurality of medical procedures; and detecting, by the one or more processors, the interaction based at least on the kinematics data and the video stream data applied to the one or more spatial attention mechanisms and the one or more temporal attention mechanisms.

20. A non-transitory computer readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to: identify, from a data stream of a medical procedure with a robotic medical system, a movement of an instrument used in the medical procedure over a plurality of frames of the data stream; identify, using the data stream, a pattern of motion of an anatomical structure over at least a portion of the medical procedure; detect, based at least on a comparison of the movement of the instrument and the pattern of motion of the anatomical structure, an interaction between the instrument and the anatomical structure; and provide, via an interface, an indication of the interaction.