[go: up one dir, main page]

WO2025226595A1 - Génération de sous-titres spécifiques de l'utilisateur pour vidéo via des données multimodales pour actes médicaux - Google Patents

Génération de sous-titres spécifiques de l'utilisateur pour vidéo via des données multimodales pour actes médicaux

Info

Publication number
WO2025226595A1
WO2025226595A1 PCT/US2025/025606 US2025025606W WO2025226595A1 WO 2025226595 A1 WO2025226595 A1 WO 2025226595A1 US 2025025606 W US2025025606 W US 2025025606W WO 2025226595 A1 WO2025226595 A1 WO 2025226595A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
medical procedure
caption
segment
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/025606
Other languages
English (en)
Inventor
Omid MOHARERI
Muhammad Abdullah Jamal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intuitive Surgical Operations Inc
Original Assignee
Intuitive Surgical Operations Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intuitive Surgical Operations Inc filed Critical Intuitive Surgical Operations Inc
Publication of WO2025226595A1 publication Critical patent/WO2025226595A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images

Definitions

  • the present implementations relate generally to medical devices, including but not limited to generation of user-specific captions of video via multimodal data for medical procedures.
  • captions for medical procedure videos can reduce or even eliminate the need for live in-person classroom presentations of medical procedure videos for training purposes by enabling medical staff being trained to consume such medical procedure videos in accordance with their own schedules.
  • medical procedures videos depict complicated scenes in which complex tasks and activities occur concurrently.
  • the needs for training of different roles of medical staff are drastically different.
  • a bed-side assist staff member of a robotically- assisted surgical operation team needs to be trained on setup and docking of the robotically- assisted surgical system.
  • a surgeon of the same team needs to be trained on performing operations using the robotically-assisted surgical system.
  • entirely different captions need to be generated to train different staff members having different roles.
  • different captions may need to be generated for medical personnel having different skill levels.
  • captions can be generated using multi-modal data of the medical procedure.
  • the multi-modal data can include video data and non-video data relating to the medical procedure.
  • the video data can include endoscopic video data captured by an endoscopic imaging device (e.g., an endoscope supported by or attached to a robotic surgical system) and/or medical environment videos captured of the medical environment during the medical procedure.
  • automatic caption generation for medical procedures that is tailored to specific viewers can be readily used in medical training contexts to generate captions that explain and provide contextual information of the tasks and activities being performed during the medical procedure videos.
  • any medical procedure being performed can be turned into training material for consumption for training purposes by medical staff members or other personnel.
  • automatic caption generation can be deployed in contexts outside of medical training. Captions for medical procedures can be generated to formulate reports for various purposes. As one example, captions generated with a caption viewer input specifying a role of a billing personnel can be used to formulate billing or insurance report that provides information needed to submit a billing or insurance claim. As another example, captions generated with a caption viewer input specifying a role of a hospital administrator can be used to formulate reports detailing efficiency of the activities and tasks being performed during the medical procedure. [0011] Automatic caption generation can also be performed live during a medical procedure. Such live caption generations can be useful in both medical training and non-training contexts.
  • live captions generated for consumption by trainees and different live captions can be automatically generated for different trainees (e.g., based on role, based on skill level, based on language preference, etc.).
  • live captions of a medical training session can be provided for a proctor of a medical training session.
  • Live captions can also be used in non-training contexts. For instance, a new medical staff member entering the medical environment in which a medical procedure is being performed can be provided automatically generated captions summarizing the medical procedure (or portions thereof) that are relevant to the role of the new medical staff member.
  • the system includes or implements a descriptor generation model (or a descriptor generation layer) and a translation model (or a translation layer).
  • the descriptor generation model can generate a plurality of descriptors.
  • the plurality of descriptors generated for the particular segment can provide a comprehensive set of descriptors that are descriptive and/or representative of, for example, the tasks, events, and/or analytics information relevant to the particular segment.
  • the plurality of descriptors can be descriptive and/or representative of medical environment tasks being performed in the medical environment by medical staff during the particular segment, surgical tasks being performed by the surgeon during the particular segment, system events or data relating to the computer-assisted medical system during the particular segment, surgical analytics information relevant to the particular segment, operating room analytics information relevant to the particular segment, and the like.
  • multiple tasks may be performed concurrently by different members of the medical staff during the particular segment and the plurality of descriptors generated for the particular segment can include corresponding descriptors that are descriptive and/or representative of each of the multiple tasks being performed during the segment.
  • the descriptors can be in the form of text or natural language descriptors or can be generated in the form of machine learning embeddings or vectors.
  • the translation model can generate natural language captions for the particular segment based on the plurality of descriptors and a viewer role input. In this manner, the caption generated for the particular segment can be specifically tailored to one or more of: the role of the viewer, the skill level of the viewer, and/or the language preference of the viewer.
  • the descriptor generation model receives as input video data (e.g., endoscopic video(s) and/or medical environment video(s)) to generate the plurality of descriptors that are descriptive and/or representative of the scene(s) depicted during the video(s) of the particular segment of the medical procedure.
  • the translation model receives as input the plurality of descriptors, other modalities of data associated with the medical procedure (e.g., robotic system data generated by a computer-assisted medical system, analytics information associated with the medical procedure, etc.), and the caption viewer input to generate a natural language caption of the particular segment of the medical procedure that is specifically tailored for the viewer of the caption.
  • other modalities of data associated with the medical procedure e.g., robotic system data generated by a computer-assisted medical system, analytics information associated with the medical procedure, etc.
  • the caption viewer input to generate a natural language caption of the particular segment of the medical procedure that is specifically tailored for the viewer of the caption.
  • the descriptor generation model receives as input multi-modal data associated with the medical procedure to generate the plurality of descriptors that are descriptive and/or representative of the particular segment of the medical procedure. At least some of the plurality of descriptors can be descriptive and/or representative of scene(s) depicted in the video(s) of the medical procedure during the particular segment (e.g., endoscopic video(s) and/or medical environment video(s)). Furthermore, some of the plurality of descriptors of the particular segment can be descriptive and/or representative of contextual or other information relevant to the particular segment that is not readily discernible from the video(s) of the particular segment.
  • the translation model receives as input the plurality of descriptors and the caption viewer input to generate a natural language caption of the particular segment of the medical procedure that is specifically tailored for the viewer of the caption.
  • a system is configured to receive multi-modal data of a medical procedure, the multi-modal data comprising video data captured during the medical procedure and system data generated by a computer-assisted medical system, identify, based at least in part on the multi-modal data of the medical procedure, a first segment of the medical procedure, generate, by a descriptor generation model based at least in part on the multi-modal data of the medical procedure, a plurality of descriptors for the first segment of the medical procedure, and generate, by a translation model based at least in part on the plurality of descriptors for the first segment of the medical procedure and a caption viewer input, a caption that is descriptive of the first segment of the medical procedure.
  • a method includes receiving multi-modal data of a medical procedure, the multi-modal data comprising video data captured during the medical procedure and system data generated by a computer-assisted medical system, identifying, based at least in part on the multi-modal data of the medical procedure, a first segment of the medical procedure, generating, by a descriptor generation model based at least in part on the multimodal data of the medical procedure, a plurality of descriptors for the first segment of the medical procedure, and generating, by a translation model based at least in part on the plurality of descriptors for the first segment of the medical procedure and a caption viewer input, a caption that is descriptive of the first segment of the medical procedure.
  • At least one non-transitory computer readable medium comprising one or more instructions stored thereon and executable by a processor to receive multi-modal data of a medical procedure, the multi-modal data comprising video data captured during the medical procedure and system data generated by a computer-assisted medical system, identify, based at least in part on the multi-modal data of the medical procedure, a first segment of the medical procedure, generate, via a descriptor generation model based at least in part on the multi-modal data of the medical procedure, a plurality of descriptors for the first segment of the medical procedure, and generate, via a translation model based at least in part on the plurality of descriptors for the first segment of the medical procedure and a caption viewer input, a caption that is descriptive of the first segment of the medical procedure.
  • FIG. 1 A depicts an example architecture of a system according to some arrangements of this disclosure.
  • FIG. IB depicts an example environment of a system according to some arrangements of this disclosure.
  • FIG. 2 depicts an example computer system according to some arrangements of this disclosure.
  • FIG. 3 depicts an example layer model architecture according to some arrangements of this disclosure.
  • FIG. 4 depicts an example caption transformer architecture according to some arrangements of this disclosure.
  • FIG. 5 A depicts an example user interface presenting scene captions according to some arrangements of this disclosure.
  • FIG. 5B depicts an example user interface presenting surgeon captions according to some arrangements of this disclosure.
  • FIG. 5C depicts an example user interface presenting medical staff captions according to some arrangements of this disclosure.
  • FIG. 6 depicts an example method of generation of user-specific captions of video via multimodal data for medical procedures according to some arrangements of this disclosure.
  • FIG. 7 depicts an example method of generation of user-specific captions of video via multimodal data for medical procedures according to some arrangements of this disclosure.
  • a system can output, a set of captions that is tailored to a given viewer role.
  • a language model can receive one or more captions, robotic system data, and an indication of a viewer role, to generate the set of captions for the viewer role.
  • a viewer role can be surgeon-specific, OR staff-specific, customer trainer- or consultant-specific, clinical engineer-specific, or any combination thereof.
  • a system can provide a prompt to a user to request a viewer role or a characteristic of the user associated with a viewer role (e.g., identify user as a surgeon, nurse, or OR staff member).
  • a caption is a natural language description of a portion of an entirety of media such as videos and images.
  • the capture essentially provides a textual description of the content and context of the media, thus can be used to provide insights by summarizing and improving understanding of the media by a viewer.
  • a caption can describe a scene captured in the media, objects (e.g., a patient, a medical team member, a technician, a surgical tool or instrument, a robotic system, and so on) in the scene (e.g., the environment 100B), relationships among the objects, actions and interactions depicted in the scene, and so on.
  • the caption can also provide contextual or background information that may not be obvious by observing or analyzing the scene itself.
  • captions can be presented (e.g., visually, audibly, etc.) concurrently with the media for consumption by a viewer of the media.
  • captions can be presented independently of the media to provide, for example, a summary or report of the media to be provided for downstream consumption separate from the provision of the media itself.
  • this technical solution can provide captions for different medical team members who can perform multiple different actions during various phases of medical procedures, by (e.g., surgeon, bed-side staff, OR staff), by leveraging the multi-modal data and a translation model for captioning.
  • a system can generate one or more captions for each activity detected in each data modality, and select one or more captions that describes activities linked with a given viewer role.
  • a system generates a first caption for activities done by the surgeon, a second caption for bedside staff, and a third caption for OR staff.
  • the system can augment one or more of the first, second and third captions with additional multi-modal data such as the non-video data, surgeon skill data, or workflow efficiency, or any combination thereof, but is not limited thereto.
  • FIG. 1A depicts an example architecture of a system according to this disclosure.
  • an architecture of a system 100 A can include at least a data processing system 110, a communication bus 120, and robotic systems 130.
  • the data processing system 110 can include a computer system operatively coupled or that can be coupled with one or more components of the system 100 A, either directly or directly through an intermediate computing device, system, connection, or network.
  • the data processing system 110 can include a virtual computing system, an operating system, and a communication bus to effect communication and processing.
  • the data processing system 110 can include a system processor 112 and a system memory 114.
  • the system processor 112 can execute one or more instructions associated with the system processor 112.
  • the system processor 112 can include an electronic processor, an integrated circuit, or the like including one or more of digital logic, analog logic, digital sensors, analog sensors, communication buses, volatile memory, nonvolatile memory, and the like.
  • the system processor 112 can include, but is not limited to, at least one microcontroller unit (MCU), microprocessor unit (MPU), central processing unit (CPU), graphics processing unit (GPU), physics processing unit (PPU), embedded controller (EC), or the like.
  • the system processor 112 can include a memory operable to store or storing one or more instructions for operating components of the system processor 112 and operating components operably coupled to the system processor 112.
  • the one or more instructions can include at least one of firmware, software, hardware, operating systems, embedded operating systems, and the like.
  • the system processor 112 can include at least one communication bus controller to effect communication between the system processor 112 and the other elements of the system 100 A.
  • the system memory 114 can store data associated with the data processing system 110.
  • the system memory 114 can include one or more hardware memory devices to store binary data, digital data, or the like.
  • the system memory 114 can include one or more electrical components, electronic components, programmable electronic components, reprogrammable electronic components, integrated circuits, semiconductor devices, flip-flops, arithmetic units, or the like.
  • the system memory 114 can include at least one of a non-volatile memory device, a solid-state memory device, a flash memory device, or a NAND memory device.
  • the system memory 114 can include one or more addressable memory regions disposed on one or more physical memory arrays.
  • a physical memory array can include a NAND gate array disposed on, for example, at least one of a particular semiconductor device, integrated circuit device, and printed circuit board device.
  • the system memory 114 can correspond to a non-transitory computer-readable medium as discussed herein.
  • the non-transitory computer readable medium can include one or more instructions executable by the system processor 112.
  • the system processor 112 can filter, by the second machine learning model based on the input and the indication of the role, the caption data into the set of the one or more captions.
  • the network 120 can communicatively couple the data processing system 110 with the robotic systems 130.
  • the network 120 can communicate one or more instructions, signals, conditions, states, or the like between the data processing system 110 and the robotic systems 130.
  • the communication bus 120 can include one or more digital, analog, or like communication channels, lines, traces, or the like. As an example, the network bus 120 can include at least one serial or parallel communication line among multiple communication lines of a communication interface.
  • the network 120 can include one or more wireless communication devices, systems, protocols, interfaces, or the like.
  • the network 120 can include one or more logical or electronic devices, including but not limited to integrated circuits, logic gates, flip-flops, gate arrays, programmable gate arrays, and the like.
  • the network 120 can include one or more telecommunication devices, including but not limited to antennas, transceivers, packetizers, and wired interface ports.
  • Each robotic system 130 can include one or more robotic devices configured to perform one or more actions of a medical procedure (e.g., a surgical procedure).
  • a robotic device can include, but is not limited to, a surgical device that can be manipulated by a robotic device.
  • a surgical device can include, but is not limited to, a scalpel or a cauterizing tool.
  • the robotic system 130 can include various motors, actuators, or electronic devices whose position or configuration can be modified according to input at one or more robotic interfaces.
  • a robotic interface can include a manipulator with one or more levers, buttons, or grasping controls that can be manipulated by pressure or gestures from one or more hands, arms, fingers, or feet.
  • the robotic system 130 can include a surgeon console in which the surgeon can be positioned (e.g., standing or seated) to operate the robotic system 130.
  • the robotic system 130 is not limited to a surgeon console co-located or on-site with the robotic system 130.
  • FIG. IB depicts an example environment of a system according to this disclosure.
  • an environment 100B in which a system 100A can be deployed can include at least the robotic system 130, a first sensor system 140, a second sensor system 150, persons 160, and objects 170.
  • the robotic system 130 can include one or more sensors located at one or more corresponding locations on the robotic system 130.
  • the one or more sensors can be arranged either fixedly or movably in one or more orientations relative to the environment 100B.
  • such sensor can include a visual sensor (e.g., a camera) mounted on the robotic system 130 to have a particular orientation of the environment 100B, which corresponds to the field of view 132.
  • a visual sensor e.g., a camera
  • the visual sensor is operable to capture one or more images in a visible light spectrum within the field of view 132.
  • the environment 100B is illustrated by way of example as a plan view of an OR having the robotic system 130, the first sensor system 140, the second sensor system 150, the persons 160, and the objects 170 disposed therein or thereabout.
  • the presence, placement, orientation, and configuration, for example, of one or more of the robotic system 130, the first sensor system 140, the second sensor system 150, the persons 160, and the objects 170 can correspond to a given medical procedure or given type of medical procedure that is being performed, is to be performed, or can be performed in the OR corresponding to the environment 100B.
  • the field of view 132 of the robotic system 130 can correspond to a physical volume within the environment 100B that is within the range of detection of one or more sensors of the robotic system 130.
  • the field of view 132 is positioned above a surgical site of a patient.
  • the field of view 132 is oriented toward a surgical site of a patient.
  • the first sensor system 140 can include one or more sensors oriented to a first portion of the environment 100B.
  • the first sensor system 140 can include one or more cameras configured to capture images or video in visual or near-visual spectra and/or one or more depth-acquiring sensors for capturing depth data (e.g., three-dimensional point cloud data).
  • the first sensor system 140 can include a plurality of cameras configured to collectively capture images or video in a stereoscopic view.
  • the first sensor system 140 can include a plurality of cameras configured to collectively capture images or video in a panoramic view.
  • the first sensor system 140 can include a field of view 142.
  • the field of view 142 can correspond to a physical volume within the environment 100B that is within the range of detection of one or more sensors of the first sensor system 140.
  • the field of view 142 is oriented toward a surgical site of a patient.
  • the field of view 152 is located behind a surgeon at the surgical site of a patient.
  • the second sensor system 150 can include one or more sensors oriented to a second portion of the environment 100B.
  • the second sensor system 150 can include one or more cameras configured to capture images or video in visual or near-visual spectra and/or one or more depth-acquiring sensors for capturing depth data (e.g., three-dimensional point cloud data).
  • the second sensor system 150 can include a plurality of cameras configured to collectively capture images or video in a stereoscopic view.
  • the second sensor system 150 can include a plurality of cameras configured to collectively capture images or video in a panoramic view.
  • the second sensor system 150 can include a field of view 152.
  • the field of view 152 can correspond to a physical volume within the environment 100B that is within the range of detection of one or more sensors of the second sensor system 150.
  • the field of view 152 is oriented toward the robotic system 130.
  • the field of view 152 is located adjacent to the robotic system 130.
  • the persons 160 can include one or more individuals present in the environment 100B.
  • the persons can include, but are not limited to, assisting surgeons, supervising surgeons, specialists, nurses, or any combination thereof.
  • the objects 170 can include, but are not limited to, one or more pieces of furniture, instruments, or any combination thereof.
  • the objects 170 can include tables and surgical instruments.
  • FIG. 2 depicts an example computer system 200according to this disclosure.
  • a computer system 200 can include at least a case data storage 210, an descriptor generation model 220, a caption data storage 230, a translation model 240, and a caption output data storage 250.
  • the computer system 200 can obtain (e.g., receive) via a suitable user interface a caption viewer input 228.
  • the computer system 200 can be realized using the system 100 A.
  • one or more of the components, models, or processors can be implemented by the system processor 112.
  • one or more of the case data storage 210, the caption data storage 230, and the caption outputs storage 250 can be implemented by the system memory 114.
  • the processes, methods, and algorithms described relative to the computer system 200 can be at least partially embodied as one or more machine-readable instructions stored at the system memory 114.
  • the computer system 200 can generate customized captions from multi-modal medical procedure data based on attributes of the viewer. For example, the computer system 200 can identify a viewer role requesting to view a medical procedure, according one or more aspects of the viewer. For example, the computer system 200 can identify a viewer role as a “surgeon” or a “nurse” according to various metrics associated with the viewer, but is not limited thereto. For example, a viewer role can include a skill level of the viewer.
  • the computer system 200 can identify a viewer role as a “skilled surgeon,” a “novice surgeon,” an “expert surgeon,” a “skilled nurse,” a “novice nurse,” or an “expert nurse,” according to various metrics associated with the viewer, but is not limited thereto.
  • the computer system 200 can identify a preferred language (e.g., English, Spanish, French, etc.) of the reviewer.
  • the computer system 200 can generate or select one or more captions associated with a medical procedure that correspond to the viewer role, skill language, and preferred language. That is, the caption is generated to be most relevant to the role and skill level of the viewer, and in the preferred language of the viewer.
  • example roles of a view include, a surgeon (e.g., a surgeon operating a robotically-assisted surgical system such as the robotic systems 130), an anesthesiologist, an operating room staff member (e.g., staff member in the operating room responsible for setting up, docking, and monitoring the robotically-assisted surgical system), a hospital administrator, a medical billing personnel, a clinical engineer (e.g., engineers responsible for troubleshooting medical equipment such as the robotically-assisted surgical system), a training personnel, a consultant, a student, and so on.
  • a surgeon e.g., a surgeon operating a robotically-assisted surgical system such as the robotic systems 130
  • an anesthesiologist e.g., an operating room staff member in the operating room responsible for setting up, docking, and monitoring the robotically-assisted surgical system
  • a hospital administrator e.g., a medical billing personnel, a clinical engineer (e.g., engineers responsible for troubleshooting medical equipment such
  • the computer system 200 can generate customized and different captions from multi-modal medical procedure data for different roles of the medical staff. For example, the computer system 200 can generate, based on a set of at least one video, a first caption for the role of a bed-side assist staff member (with a specific skill level) of a robotically-assisted surgical operation team, where the first caption includes or is related to setup and docking of the robotically-assisted surgical system.
  • the computer system 200 can generate, based on the same set of at least one video, a second caption for a surgeon (with a specific skill level) of the same robotically- assisted surgical operation team, where the second caption includes or is related to performing operations using the robotically-assisted surgical system. Therefore, based on a particular training video of the same medical team, entirely different captions can be based on different reviewer roles (including for different skill levels).
  • the case data storage 210 can obtain (e.g., receive) and store multi-modal data having a variety of sources or formats.
  • the multi-modal data include video data depicting one or more medical procedures from one or more viewpoints associated with corresponding medical procedures.
  • the video data can include still images or frames of videos that depict at least a portion of a medical procedure, medical environment, or patient site from a given viewpoint.
  • visual sensors e.g., cameras
  • the sensor systems 140 and 150 can be used to generate videos captured in the medical environment during the medical procedure.
  • the video data can include endoscopic video data captured by an endoscopic imaging device (e.g., an endoscope such as a laparoscopic endoscope supported by or attached to a robotic surgical system or a robotically-assisted surgical system).
  • the video data can further include egocentric video data captured using visual sensors (e.g., cameras) located on (e.g., worn by) the medical team member (e.g., a surgeon, nurse, or technician) who is performing the medical procedure in the medical environment.
  • the multi-modal data also includes non-video data which can provide additional information such as contextual or background information.
  • the non-video data includes or can be converted into strings of characters or natural language.
  • Examples of the non-video data can include robotic system data generated by computer-assisted medical systems (e.g., robotically-assisted medical or surgical systems), analytics information (e.g., performance indicators of the medical procedure), and depth or three-dimensional point cloud data captured by depth sensors in the medical environment.
  • the depth or three- dimensional point cloud data can be captured by the depth sensors in the first sensor system 140 and/or the second sensor system 150.
  • the robotic system data (or system data), such as robotic system data 232 includes kinematics data of a robotic system 130, system events data of the robotic system 130, input received by the console of the robotic system 130from a user, and timestamps associated therewith.
  • the robotic system data of a robotic system 130 can be generated by the robotic system 130 (e.g., in the form of a robotic system log) in its normal course of operations.
  • the kinematics data can indicate configuration(s) of one or more manipulators or manipulator assemblies of the robotic system 130 over time throughout the medical procedure.
  • the system events data can be generated by the robotic system 130 and can indicate system events of the robotic system 130.
  • Examples of system events can include, for example, a docking event (e.g., in which manipulator arms are docked to cannulas inserted into a patient anatomy), operator (e.g., surgeon) head-in or head-out event (e.g., indicating a surgeon’s head being present or absent at a viewer on a input or control console of the robotic system), an instrument attachment or removal event (e.g., indicating attachment or removal of an instrument, such as a medical instrument or an imaging instrument, on a manipulator of the robotic system, a tool exchange event), an instrument change event (e.g., indicating performance of an exchange of one instrument for another instrument for attachment on a manipulator on the robotic system), a draping-start event or a sterile adapter attachment event (e.g., which may indicate beginning of a sterile draping process), and the like.
  • Such robotic system data can be in natural language (e.g., a string of characters).
  • the robotic system data can be indicative of one or more states of one or more components of the robotic system 130.
  • Components of the robotic system 130 can include, but are not limited to, actuators of the robotic system 130 as discussed herein.
  • the robotic system data can include one or more data points indicative of one or more of an activation state (e.g., activated or deactivated), a position, or orientation of a component of the robotic system 130.
  • the robotic system data can be linked with or correlated with one or more medical procedures, one or more phases of a given medical procedure, or one or more tasks of a given phase of a given medical procedure.
  • a type of robotic system data can be indicative of a position of one or more actuators of a given arm of the robotic system 130 at a given time or over a given time interval.
  • Such time interval can be associated with a given task or phase of a workflow as occurring during that task or phase.
  • the analytics information (e.g., performance indicators 234 of the medical procedure) is indicative of one or more actions during one or more medical procedures.
  • the performance indicators include metrics (e.g., a metric value or a range of metric values) determined via the workflow analytics using the multimodal data. The metrics are indicative of the spatial and temporal efficiency of the at least one medical procedure performed with tine environment 100B (e.g., using at least one robotic system 130) for which the multimodal data is collected.
  • the metrics include an “efficiency” metric, “consistency” metric, “adverse event” metric, “case volume” metric, a “first case turnovers” metric, “delay” metric, “headcount to complete tasks” metric, “OR Traffic” metric, “room layout” metric, or a “modality conversion” metric.
  • the analytics information can be in natural language (e.g., a string of characters), in the format of a type or description of a metric (e.g., “efficiency metric”) followed by a value (e.g., “6”).
  • the “efficiency” metric includes each of or combines two or more nonoperative metrics that measure temporal workflow efficiency in a medical environment for a duration of one or more of temporal interval such as periods, phases, and tasks. To combine two or more nonoperative metrics, such nonoperative metrics can be averaged, as a mean or median, over all cases collected from a team, a medical environment, hospital, or region.
  • the “consistency” metric includes each of or combines (e.g., is a sum, mean, or median of) the standard deviations of two or more nonoperative metrics across all cases collected from a team, a medical environment, or hospital.
  • the “adverse event” metric includes each of or combines (e.g., is sum of) negative outliers of nonoperative metrics.
  • the nonoperative metrics corresponds to a length of time for a temporal interval in which non-operative, non-productive activities or idleness occurs.
  • the “case volume” metric includes the mean or median number of cases operated per medical environment, per day, for a team, medical environment, or hospital, normalized by the expected case volume for a typical medical.
  • a “first case turnovers” metric is a ratio of first cases in an operating day that were turned over compared to the total number of first cases captured from a team, medical environment, or hospital.
  • a more general “case turnovers” metric is the ratio of all cases that were turned-over compared to the total number of cases as performed by a team, in a medical environment, or in hospital.
  • a “delay” metric is a mean or median positive (behind a scheduled start time of an action) or negative (before a scheduled start time of an action) departure from a scheduled time in minutes for each case, normalized by the acceptable delay (e.g., a historical mean or median benchmark).
  • the negative or positive definition may be reversed (e.g., wherein starting late is instead negative and starting early is instead positive) if other contextual parameters are likewise adjusted.
  • the “headcount to complete tasks” metric combines the mean or median headcount (the largest number of detected personnel throughout the procedure in the OR at one time) over all cases collected for the team, medical environment, or hospital needed to complete each of the temporal nonoperative tasks for each case, normalized by the recommended headcount for each task (e.g., a historical benchmark median or mean).
  • An “OR Traffic” metric measures the mean amount of motion in the OR during each case, averaged (itself as a median or mean) over all cases collected for the team, theater, or hospital, normalized by the recommended amount of traffic (e.g., based upon a historical benchmark as described above).
  • this metric may receive (two or three-dimensional) optical flow, and convert such raw data to a single numerical value, e.g., an entropy representation, a mean magnitude, a median magnitude, etc.
  • the “room layout” scoring metric includes a ratio of robotic cases with multi-part rollups or roll-backs, normalized by the total number of robotic cases for the team, medical environment, or hospital. That is, ideally, each roll up or back of the robotic system would include a single motion. When, instead, the team member moves the robotic system back and forth, such a “multi-part” roll implies an inefficiency, and so the number of such multi-part rolls relative to all the roll up and roll back events may provide an indication of the proportion of inefficient attempts.
  • At least one metric value or range of metric values can be determined for the entire medical procedure, for a period of the medical procedure, for a phase of the medical procedure, for a task of the medical procedure, for a surgeon, for a care team, for a medical staff, for a medical environment, and so on.
  • at least one metric value or range of metric values can be determined for temporal workflow efficiency, for a number of medical staff members, for time duration of each segment (e.g., phase or task) of the medical procedure, for motion, for room size and layout, for timeline, for non-operative periods or adverse events, and so on.
  • the metrics can be provided for each temporal segment (e.g., period, phase, task, and so on) of a medical procedure. Accordingly, for a given medical procedure, a metric value or a range of metric values can be provided for each of two or more multiple temporal segments (e.g., periods, phases, and tasks) of a medical procedure.
  • the descriptor generation model 220 can generate one or more text objects based on the multi-modal data of the case data storage 210.
  • the descriptor generation model 220 can correspond to a video-to-text processor including a machine learning model structured to receive the non-video data robotic as input, to generate output corresponding to or including text objects indicative of the non-video data and the video data corresponding to the non-video data.
  • the descriptor generation model 220 can include a video segmenter 222, a multimodal feature processor 224, and a text embedding processor 226.
  • the video segmenter 222 can divide the video data for a given medical procedure into one or more segments according to one or more metrics. For example, the video segmenter 222 can obtain the multi-modal data for a given medical procedure. The video segmenter 222 can determine, based on the robot system data, at least one of a workflow for the medical procedure, a phase of the workflow, or a task of the phase. The video segmenter 222 can correlate the workflow, task, or phase, or any combination thereof, to one or more portions of the video data for the medical procedure at one or more given times.
  • the video segmenter 222 can determine that Task A occurs between 00:05:27 (e.g., 0 hours, five minutes, and 27 seconds from the start of a medical procedure), and 00:54:01 of the medical procedure. In response, the video segmenter 222 can associate all frames of the video data for the given medical procedure with an identifier (e.g., metadata) indicative of Task A.
  • identifier e.g., metadata
  • the multimodal feature processor 224 can generate one or more features based on one or more types or modalities of data and metrics of the case data storage 210.
  • the multimodal feature processor 224 can generate image features that identify one or more depictions in an image or across a plurality of images. Each time can, for example, be associated with a given task or phase of a workflow as occurring during that task or phase.
  • the depictions can include portions of a patient site, one or more medical instruments, or any combination thereof, but are not limited thereto.
  • the multimodal feature processor 224 can identify one or more edges, regions, or a structure within an image and associated with the depictions.
  • an edge can correspond to a line in an image that separates two depicted objects (e.g., a delineation between an instrument and a patient site).
  • the multimodal feature processor 224 can generate multimodal features for one or more of the non- video data, such as the robot system data, depth or three-dimensional point cloud data), or the analytics information (e.g., the metrics).
  • the multimodal feature processor 224 can generate multimodal features that combine the image features with one or more non-video features based on the non-video data.
  • combining image features with non-video features can include appending the features or generating hashes based on multiple types of features to generate multimodal features.
  • the multimodal feature processor 224 can obtain metrics determined using data of different modalities.
  • the different modalities of data can include, but are not limited to, data formats (e.g., text, video, binary), encodings, sources or types (e.g., operating room state data, endoscopic video data, robotic system data, or analytics information), or any combination thereof.
  • surgeon performance metrics e.g., a metric as described herein for a surgeon
  • OR staff performance metrics e.g., a metric as described herein for an OR staff
  • the text embedding processor 226 can receive one or more features from the multimodal feature processor 224, and can generate one or more text objects associated with one more frames of the video data.
  • the text embedding processor 226 can be structured to receive multimodal features based on the multi-modal data, and can generate text that is indicative of the non-video data based on the multimodal features.
  • the text embedding processor 226 can provide a technical solution of compatibility with multimodal features beyond image features, to provide a technical improvement to generate text objects (e.g., captions) for video data that are not limited to information depicted in the video data.
  • the text embedding processor 226 can receive one or more features from the multimodal feature processor 224 in the form of a data structure including one or more data elements that correspond to a given scene (e.g., a segment of a medical procedure such as a task or phase).
  • the text embedding processor 226 can receive a data structure corresponding to a triplet (e.g., a structured text object including, but not limited to a JSON object).
  • a triplet includes three data elements.
  • a triplet can include three data elements each identifying a respective one of an instrument of the robotic system, an action associated with the scene, and an anatomical feature implicated in the scene.
  • an anatomical feature implicated in the scene can include, but is not limited to, an anatomical structure visible in the scene, an anatomical structure located at a given position within an image or a video of the scene, an anatomical structure located at a given position or a given distance from a person in the scene, an anatomical structure located at a given position or a given distance from an instrument in the scene, or a weighted combination of any permutation of the above.
  • the triplet can have a structure of (INSTRUMENT, ACTION, ANATOMY).
  • the INSTRUMENT data element identifies the instrument of the robotic system.
  • the ACTION data element identifies an action associated with the scene.
  • the ANATOMY data element identifies the anatomical feature implicated in the scene.
  • Other data structures in other than a triplet having a data structure that includes any number of data elements can be likewise implemented, where each data element of such data structure can correspond at least to at least one metric as discussed herein.
  • the caption data storage 230 can include text objects (e.g., text) associated with one or more viewer roles and one or more portions of video data.
  • the caption data storage 230 can store text structured in a natural language (e.g., English-language text).
  • the caption data can be associated with one or more frames of video data of a given medical procedure, or a task or phase thereof.
  • the text can be associated with one or more timestamps of a video.
  • the text can be indicative of one or more actions or states of one or more people or objects in the medical environment, or the medical environment, at or between the one or more timestamps.
  • the caption viewer input 228 can include one or more of: (i) a role of a viewer for which the caption is being generated, also referred to as a viewer role input, (ii) a skill level of the viewer, and/or (iii) a language preference of the viewer.
  • the viewer role input can include or is determined based on user input or selection provided by viewer that correspond to a label indicative of a role (e.g., surgeon, nurse) of a plurality of roles for a medical procedure.
  • a role can correspond to a specialized role particular to a medical procedure, a medical environment, a robotic system, but is not limited thereto.
  • the skill levels of the viewers having a same role can be different, thus the skill level can also be specified as part of the input 228.
  • the language preference of the viewer allow the viewer to select a language in which the viewer prefers view, thus allowing deployment of the systems described herein in regions that prefer to speak in different language.
  • any texts for the non-video data can also be in any language to allow a more diversified source of input data.
  • the data processing system 110 can receive the caption viewer input 228 based on user input or selection received at an application via a suitable user input device such as a keyboard, touchscreen, microphone, etc., coupled to the system 200.
  • the caption data storage 230 can store one or more text objects or data structures descriptive of one or more triplets as discussed herein.
  • the translation model 240 can generate one or more captions based on the caption viewer input 228 and one or more captions of the caption data storage 230.
  • the translation model 240 can associate one or more captions with one or more viewer roles, via a training mode of an artificial intelligence including a large language model.
  • the translation model 240 can provide as output one or more captions associated with a given caption viewer input 228 , via a deployment mode of a trained artificial intelligence including a large language model.
  • the translation model 240 can receive a triplet as input, and can generate a caption having a natural language structure including content from at least one of the data elements of the triplet.
  • the large language model is configured to receive as input the multi-modal data in the case data storage 210 in the training mode and the deployment mode.
  • the translation model 240 can generate caption output in one or more formats, including text and audio.
  • the translation model 240 can include a text- to-speech processor configured to obtain at least one of the caption data 230 or the caption outputs 250, and to generate text output that can be presented at a display of a computing device, or audio output that can be presented at a speaker of a computing device.
  • the translation model 240 can include a caption input tokenizer 242, a multimodal data tokenizer 244, and a large language model processor 246.
  • the caption input tokenizer 242 can modify one or more text objects to include, or augment one or more text objects with, one or more descriptive tokens.
  • the descriptive tokens can correspond to the content of the text objects, timestamps of the text objects, correlations of the text objects with one or more medical procedures, tasks, or phases, or any combination thereof.
  • the caption input tokenizer 242 can tokenize one or more captions to be compatible with an input format of a large language model.
  • the caption input tokenizer 242 can tokenize one or more non-text components of the captions (e.g., timestamp data) into one or more tokens compatible with a text-based large-language model processor.
  • the caption input tokenizer 242 can tokenize a triplet into one or more tokens respectively corresponding to the data elements of the triplet.
  • the caption input tokenizer 242 can provide the tokenized triplet as input to the multimodal data tokenizer 244.
  • the caption input tokenizer 242 can convert multi-modal data from any multi-modal format as discussed herein into a tokenized format compatible with input to a large language model.
  • the caption input tokenizer 242 can provide a technical improvement to expand compatibility of an artificial intelligence system including a large language model beyond natural language, by a technical solution to tokenize non-text caption data.
  • the caption input tokenizer 242 can tokenize one or more non-text components of the captions (e.g., the nonvideo data such as robot system data or metrics) into one or more tokens compatible with a text-based large-language model processor.
  • the multimodal data tokenizer 244 can provide a technical improvement to expand compatibility of an artificial intelligence system including a large language model beyond natural language, by a technical solution to tokenize non-text data corresponding to a medical procedure.
  • the large language model processor 246 can generate a trained LLM model that links one or more captions with one or more viewer roles via a training mode of the large language model processor 246.
  • the large language model processor 246 can provide as output one or more captions based on one or more tokens corresponding to the data elements of the triplet.
  • the caption input tokenizer 242 can generate the output based on the tokenized triplet.
  • the large language model processor 246 can provide as output one or more captions with one or more viewer roles via the trained LLM model.
  • the large language model processor 246 can generate captions having grammatical structures, topics, or content that differ based on a viewer role. For example, the large language model processor 246 can generate a caption based on a given triplet that described actions by a nurse as recommendations, for a nurse viewer type. For example, the large language model processor 246 can generate a caption based on the same given triplet that described actions by a nurse as observations, for a surgeon viewer type.
  • the caption output data storage 250 can include one or more caption objects including one or more caption objects.
  • a caption object e.g., including one or more captions
  • one or more captions can each include a tag (e.g., label) indicative of a given viewer role.
  • the caption data storage can store one or more captions and correlations of various captions with one or more viewer roles.
  • the large language model processor 246 can store one or more captions, links between viewer roles and one or more captions corresponding to various ones of the viewer roles, trained models to link viewer roles with captions, or any combination thereof.
  • FIG. 3 depicts an example layer model architecture according to this disclosure.
  • a layer model architecture 300 can include at least a first layer 310, a second layer 312, a third layer 314, a fourth layer 316, and a mixer 350.
  • the layer model architecture 300 can generate one or more of the features of the multimodal features processor 224 as discussed herein.
  • the layer model architecture 300 can generate one or more image features as discussed herein, one or more non-video features as discussed herein, or any combination thereof.
  • the layer model architecture 300 can fuse one or more image features, one or more non-video features, one or more image features with one or more non-video features, or any combination thereof.
  • the multimodal features processor 224 can include the layer model architecture 300.
  • the layer model architecture 300 can be integrated with or included in the video segmenter 222.
  • the layer model architecture 300 can implement one or more computer vision models to identify one or more segments of video data based on activities in the video data, to identify one or more anatomical structures of an open patient site or surgical site, to identify one or more individuals or objects in a medical environment, to identify poses of one or more individuals in a medical environment, to identify faces of one or more individuals in a medical environment, or any combination thereof.
  • the layer model architecture 300 can identify such attributes with respect to a given segment of video data that corresponds to or can be linked or associated with a corresponding task or phase of a medical procedure.
  • the layer model architecture 300 can provide image recognition that can be converted into one or more inputs to the computer system 200.
  • the layer model architecture 300 can generate one or more image features that can be tokenized as input to a large language model as discussed herein, but is not limited thereto.
  • the first layer 310 can correspond to first portion of the multimodal feature processor 224 as discussed herein.
  • the first layer 310 can include a first clip model 320, a first layer processor 330, and a first feature processor 340, and can provide output to a layer output 354 that is also provided to the mixer 350.
  • the first clip model 320 can include one or more instructions to receive a video divided into one or more frames, and to identify one or more timestamps or times of capture associated with those one or more frames.
  • the first layer processor 330 can include a first recursive neural network (RNN) to identify one or more image features or non-video features as input to the first feature processor 340.
  • the first feature processor 340 can generate one or more of the image features or non-video features for a portion of the data of the case video data storage 210 input to the first layer 310 (e.g., video data).
  • the second layer 312 can correspond to second portion of the multimodal feature processor 224 as discussed herein.
  • the second layer 312 can include a second clip model 322, a second layer processor 332, and a second feature processor 342, and can provide output to a layer output that is provided to the mixer 350.
  • the second clip model 322 can include one or more instructions to receive a video divided into one or more frames, and to identify one or more timestamps or times of capture associated with those one or more frames.
  • the second layer processor 332 can include a second recursive neural network (RNN) to identify one or more image features or non-video features as input to the second feature processor 342.
  • the second feature processor 342 can generate one or more of the image features or non-video features for a portion of the data of the case video data storage 210 input to the second layer 312 (e.g., video data).
  • the third layer 314 can correspond to third portion of the multimodal feature processor 224 as discussed herein.
  • the third layer 314 can include a third clip model 324, a third layer processor 334, and a third feature processor 344, and can provide output to a layer output that is provided to the mixer 350.
  • the third clip model 324 can include one or more instructions to receive a video divided into one or more frames, and to identify one or more timestamps or times of capture associated with those one or more frames.
  • the third layer processor 334 can include a third recursive neural network (RNN) to identify one or more image features or nonvideo features as input to the third feature processor 344.
  • the third feature processor 344 can generate one or more of the image features or non-video features for a portion of the data of the case video data storage 210 input to the third layer 314 (e.g., video data).
  • the fourth layer 316 can correspond to fourth portion of the multimodal feature processor 224 as discussed herein.
  • the fourth layer 316 can include a fourth clip model 326, a fourth layer processor 336, and a fourth feature processor 346, and can provide output to a layer output that is provided to the mixer 350.
  • the fourth clip model 326 can include one or more instructions to receive a video divided into one or more frames, and to identify one or more timestamps or times of capture associated with those one or more frames.
  • the fourth layer processor 336 can include a fourth recursive neural network (RNN) to identify one or more image features or non-video features as input to the fourth feature processor 346.
  • the fourth feature processor 346 can generate one or more of the image features or non-video features for a portion of the data of the case video data storage 210 input to the fourth layer 316 (e.g., video data).
  • RNN recursive neural network
  • the mixer 350 can aggregate output from each of the first, second, third, and fourth layers 310, 312, 314 and 316.
  • the mixer 250 can fuse one or more of the image features, the non-video features, or any combination thereof, as discussed herein.
  • the mixer 350 can provide a fused output 352 by fusing the predictions output of the first, second, third, and fourth layers 310, 312, 314 and 316.
  • the layer output 354 can correspond to an output of the first layer 310.
  • the layer output 354 can correspond to a prediction output by the first layer 330.
  • the layer output 354 is not limited to the example illustrated herein.
  • one or more of the second, third and fourth layers 312, 314 and 316 can provide layer outputs that correspond at least partially in one or more of structure and operation to the layer output 354.
  • FIG. 4 depicts an example caption transformer architecture 400 according to this disclosure.
  • a caption transformer architecture 400 can include a data input layer 402, a tokenization layer 404, a language model layer 406, and a caption output layer 408.
  • the translation model 240 can include the caption transformer architecture 400.
  • the data input layer 402 can correspond to an input layer of the caption transformer architecture 400.
  • the data input layer 402 can receive as input at least caption data 230, robotic system data 232, performance indicators 234, and the caption type indicators 236.
  • the caption input tokenizer 242 can include the data input layer 402, and can receive the caption data from the caption data storage 230, robotic system data 232, performance indicators 234, and the caption type indicators 236.
  • the robotic system data 232 can correspond at least partially in one or more of structure and operation to the robotic system data discussed herein.
  • the robotic system data 232 can include the robotic system data and metadata associated with the robotic system data indicative of timestamps or ranges of timestamps correlated with the robotic system data.
  • the caption input tokenizer 242 can obtain the robotic system data 232 from the case data storage 210 via the data input layer 402.
  • the performance indicators 234 can correspond at least partially in one or more of structure and operation to the metrics discussed herein.
  • performance indicators 234 can include the metrics and metadata associated with the metrics indicative of timestamps or ranges of timestamps correlated with the metrics.
  • the caption input tokenizer 242 can obtain the performance indicators 234 from the case data storage 210 via the data input layer 402.
  • the caption type indicators 236 can map one or more of structure and operation to the viewer roles, skill level, language preference, captions, or trained models to link captions with viewer roles, as discussed herein.
  • the caption type indicators 236 can correspond to the caption viewer input 228.
  • the caption input tokenizer 242 can obtain the caption viewer input 228 from the case data storage 210 via the data input layer 402.
  • the tokenization layer 404 can correspond to a compatibility layer of the caption transformer architecture 400, to translate input data into one or more formats compatible with a large language model as discussed herein.
  • the multimodal data tokenizer 244 can correspond at least partially in one or more of structure and operation to the tokenization layer 404.
  • the tokenization layer 404 can include a text embedding layer 410, a robotic state embedding layer 420, a performance state embedding layer 430, and a word embedding layer 440.
  • the text embedding layer 410 can translate or augment text objects from the caption data storage 230 into tokenized output compatible with a large language model.
  • the text embedding layer 410 can modify one or more of the text objects into one or more tokens including data indicative of the text objects and compatible as input to a large language model.
  • the text embedding layer 410 can augment one or more of the text objects to include one or more tokens including data indicative of the text objects and compatible as input to a large language model.
  • the tokenization layer 404 can advantageously include the robotic state embedding layer 420 and the performance state embedding layer 430 to provide multimodal input to a large language model.
  • the robotic state embedding layer 420 can translate or augment robotic system data from the case data storage 210 into tokenized output compatible with a large language model.
  • the robotic system data embedding layer 420 can modify one or more of the robotic system data into one or more tokens including data indicative of the robotic system data and compatible as input to a large language model.
  • the robotic state embedding layer 420 can augment one or more of the robotic system data to include one or more tokens including data indicative of the robotic system data and compatible as input to a large language model.
  • the performance state embedding layer 430 can translate or augment metrics from the case data storage 210 into tokenized output compatible with a large language model. For example, the performance state embedding layer 430 can modify one or more of the metrics into one or more tokens including data indicative of the metrics and compatible as input to a large language model. For example, the performance state embedding layer 430 can augment one or more of the metrics to include one or more tokens including data indicative of the metrics and compatible as input to a large language model.
  • the word embedding layer 440 can translate or augment words of text objects from the case data storage 210 into tokenized output indicative of a viewer role and compatible with a large language model. For example, the word embedding layer 440 can modify one or more of the words into one or more tokens including data indicative of the viewer role and compatible as input to a large language model. For example, the word embedding layer 440 can augment one or more of the words to include one or more tokens including data indicative of the viewer role and compatible as input to a large language model.
  • the word embedding layer 440 can include word embeddings 442.
  • the word embeddings 442 can each correspond to words tokenized or augmented with tokens indicative of viewer roles.
  • the word embeddings 442 can corresponds to portions of captions (e.g., words or phrases) that are tokenized as associated with a given viewer role (e.g., one or more captions for a surgeon, one or more captions for a nurse, or one or more captions common to all medical staff).
  • the language model layer 406 can correspond at least partially in one or more of structure and operation to the large language model processor 246 as discussed herein.
  • the language model layer 406 can include a large language token model 450.
  • the large language token model 450 can receive tokens from the tokenization layer 404, including one or more of the text embedding layer 410, the robotic state embedding layer 420, the performance state embedding layer 430, and the word embedding layer 440.
  • the language model layer 406 can generate one or more captions, and provide the captions to the caption output layer 408.
  • the caption output layer 408 can provide captions to a user interface or a remote device.
  • the caption output layer 408 can include text caption outputs 460.
  • the caption output layer 408 can store the text caption outputs 460 to the caption outputs storage 250.
  • the large language model processor 246 can include the caption output layer 408.
  • the translation model 240 (e.g., the caption transformer architecture 400) based on multi-modal input such as the caption data 230, the robotic system data 232, the performance indicators 234, and the caption type indicators 236, the provided output of the text caption outputs 460 can not only benefit from capturing multiple concurrent activities in the video data, but also allows for mapping of different captions for different roles, skill levels, and language preference for the same video data. Accordingly, the translation model 240 (e.g., the translation layer) converts raw caption data 230 into a variety of different targeted text caption outputs 460 based on viewer attributes.
  • FIG. 5A depicts an example user interface presenting one or more captions associated with a scene according to this disclosure.
  • a user interface 500A presenting one or more captions associated with a scene, can include at least a video data caption 510, a robot system data caption 512, and a performance indicator caption 514.
  • the user interface 500A can correspond to a video display device of a client system (e.g., desktop computer, tablet computer) configured to present one or more of video and text as discussed herein.
  • the client system can be coupled with the data processing system 110 via a network connection (e.g., Internet or intranet).
  • the user interface 500A can correspond to a viewer role for a general surgical viewer.
  • the translation model 240 can receive the caption viewer input 228 having a value indicative of “general” that is associated with a general surgical viewer.
  • the translation model 240 can provide one or more captions for the general surgical viewer to the client system to be presented at the user interface 500A.
  • the user interface 500A can include a presentation of one or more frames of video corresponding to a portion of a medical procedure associated with the video data caption 510, the robot system data caption 512 and the performance indicator caption 514.
  • the video data caption 510 can present text that is descriptive of the one or more frames of video corresponding to the portion of the medical procedure of the user interface 500A.
  • the video data caption 510 can include text describing that “Bedside staff are docking the robotic system on the patient.”
  • the robotic system data caption 512 can present text that is descriptive of robotic system data during the one or more frames of video corresponding to the portion of the medical procedure of the user interface 500A.
  • the robotic system data caption 512 can include text describing that “Robot instrument/arm 3 is in clutch mode.”
  • the performance indicator caption 514 can present text that is descriptive of one or more metrics during the one or more frames of video corresponding to the portion of the medical procedure of the user interface 500A.
  • the performance indicator caption 514 can include text describing that a “Docking procedure took 8 minutes, which is below average for this hospital.”
  • FIG. 5B depicts an example user interface presenting surgeon captions according to this disclosure.
  • a user interface presenting surgeon captions 500B can include at least a surgeon caption 520.
  • the user interface 500B can correspond to a video display device of a client system (e.g., desktop computer, tablet computer) configured to present one or more of video and text as discussed herein.
  • the client system can be coupled with the data processing system 110 via a network connection (e.g., Internet or intranet).
  • the user interface 500A can correspond to a viewer role for a surgeon viewer.
  • the translation model 240 can receive the caption viewer input 228 having a value indicative of “surgeon,” e.g., the viewer role is a surgeon.
  • the translation model 240 can provide one or more captions for the surgeon viewer to the client system to be presented at the user interface 500B.
  • the user interface 500B can include the presentation of the one or more frames of the video of the user interface 500A, and is associated with the surgeon caption 520.
  • the surgeon caption 520 can present text that is descriptive of the one or more frames of video corresponding to the portion of the medical procedure of the user interface 500A, and is associated with the “surgeon” viewer role.
  • FIG. 5C depicts an example user interface presenting medical staff captions according to this disclosure.
  • a user interface presenting medical staff captions 500C can include at least a staff caption 530.
  • the user interface 500C can correspond to a video display device of a client system (e.g., desktop computer, tablet computer) configured to present one or more of video and text as discussed herein.
  • the client system can be coupled with the data processing system 110 via a network connection (e.g., Internet or intranet).
  • the user interface 500A can correspond to a viewer role for a nurse viewer.
  • the translation model 240 can receive the caption viewer input 228 having a value indicative of “nurse,” e.g., the viewer role is a nurse.
  • the translation model 240 can provide one or more captions for the nurse viewer to the client system to be presented at the user interface 500C.
  • the user interface 500C can include the presentation of the one or more frames of the video of the user interface 500A, and is associated with the staff caption 530.
  • the staff caption 530 can present text that is descriptive of the one or more frames of video corresponding to the portion of the medical procedure of the user interface 500A, and is associated with the “nurse” viewer role.
  • the staff caption 530 can include text summarizing a nurse action to “Prep patient for [table] with [medical staff]” that indicates a location associated with the caption within the medical environment depicted in the one or more frames, and indicates one or more roles associated with action in addition to the nurse role of the viewer role, based on one or more image features.
  • FIG. 6 depicts an example method of generation of user-specific captions of video via multimodal data for medical procedures according to this disclosure.
  • the data processing system 110 can perform method 600.
  • the method 600 can receive multimodal data of a medical procedure.
  • the method 600 can receive multi-modal data including video data captured during the medical procedure.
  • the method 600 can receive multi-modal data including system data generated by a computer-assisted medical system.
  • the method 600 can include segmenting, based on the multi-modal data, the video data into one or more portions that each correspond to respective segments of the segments of the workflow.
  • the method 600 can include generating, by the descriptor generation model based on an input can include the portions of the video data, the output.
  • the method 600 can identify a first segment of the medical procedure.
  • the method 600 can identify the first segment based on the multi-modal data of the medical procedure.
  • the text includes one or more captions can include the caption, and where each of the one or more captions corresponds to at least one of a plurality of segments can include the first segment.
  • the multi-modal data includes one or more robotic system data indicative of a robotic system to perform at least a portion of the medical procedure.
  • FIG. 7 depicts an example method of generation of user-specific captions of video via multimodal data for medical procedures according to this disclosure.
  • the data processing system 110 can perform method 700.
  • the method 700 can generate a plurality of descriptors for the first segment of the medical procedure.
  • the method 700 can generate the plurality of descriptors by a descriptor generation model.
  • the method 700 can generate the plurality of descriptors based on the multi-modal data of the medical procedure.
  • the method 700 can include converting the descriptors into one or more tokens corresponding to the output, the tokens having a structure corresponding to the translation model.
  • the method 700 can converting, by a plurality of tokenizers each configured to receive portions of the multi-modal data each having different formats, one or more of the descriptors into the one or more tokens.
  • the method 700 can generate a caption that is descriptive of the first segment of the medical procedure.
  • the method 700 can generate the caption by a translation model.
  • the method 700 can generate the caption based on the plurality of descriptors for the first segment of the medical procedure.
  • the method 700 can generate the caption based on a caption viewer input.
  • the method 700 can include filtering, by the translation model based on the indication of the role according to the caption viewer input, caption data into the caption, where the caption data can include one or more captions descriptive of a plurality of segments of the medical procedure can include the first segment.
  • the method 700 can include filtering the caption data according to one or more data elements of a triplet based on the multi-modal data. In an aspect, the method 700 can include filtering the caption data according to a portion of the multi-modal data indicative of a performance indicator associated with at least one of the medical procedure, or the first segment of the.
  • At least one aspect is directed to a system.
  • the system can include one or more processors coupled with memory.
  • the system can receive multi-modal data of a medical procedure, the multi-modal data can include video data captured during the medical procedure and system data generated by a computer-assisted medical system.
  • the system can identify, based on the multi-modal data of the medical procedure, a first segment of the medical procedure.
  • the system can generate, by a descriptor generation model based on the multimodal data of the medical procedure, a plurality of descriptors for the first segment of the medical procedure.
  • the system can generate, by a translation model based on the plurality of descriptors for the first segment of the medical procedure and a caption viewer input, a caption that is descriptive of the first segment of the medical procedure.
  • the system can filter, by the translation model based on the indication of the role according to the caption viewer input, caption data into the caption, where the caption data can include one or more captions descriptive of a plurality of segments of the medical procedure can include the first segment.
  • the system can filter the caption data according to one or more data elements of a triplet based on the multi-modal data.
  • the system can filter the caption data according to a portion of the multi-modal data indicative of a performance indicator associated with at least one of the medical procedure, or the first segment.
  • the system can convert the descriptors into one or more tokens corresponding to the output, the tokens having a structure corresponding to the translation model.
  • the system can convert, by a plurality of tokenizers each configured to receive portions of the multi-modal data each having different formats, one or more of the descriptors into the one or more tokens.
  • the system can segment, based on the multi-modal data, the video data into one or more portions that each correspond to respective segments of the medical procedure.
  • the system can generate, by the descriptor generation model based on an input can include the portions of the video data, the output.
  • the text includes one or more captions can include the caption, and where each of the one or more captions corresponds to at least one of a plurality of segments can include the first segment.
  • the multi-modal data includes robotic system data indicative of a robotic system to perform at least a portion of the medical procedure.
  • At least one aspect is directed to a method.
  • the method can include receiving multimodal data of a medical procedure, the multi-modal data can include video data captured during the medical procedure and system data generated by a computer-assisted medical system.
  • the method can include identifying, based on the multi-modal data of the medical procedure, a first segment of the medical procedure.
  • the method can include generating, by a descriptor generation model based on the multi-modal data of the medical procedure, a plurality of descriptors for the first segment of the medical procedure.
  • the method can include generating, by a translation model based on the plurality of descriptors for the first segment of the medical procedure and a caption viewer input, a caption that is descriptive of the first segment of the medical procedure.
  • At least one aspect is directed to a non-transitory computer readable medium can include one or more instructions stored thereon and executable by a processor.
  • the processor can receive multi-modal data of a medical procedure, the multi-modal data can include video data captured during the medical procedure and system data generated by a computer-assisted medical system.
  • the processor can identify, based on the multi-modal data of the medical procedure, a first segment of the medical procedure.
  • the processor can generate, via a descriptor generation model based on the multi-modal data of the medical procedure, a plurality of descriptors for the first segment of the medical procedure.
  • the processor can generate, via a translation model based on the plurality of descriptors for the first segment of the medical procedure and a caption viewer input, a caption that is descriptive of the first segment of the medical procedure.
  • the non-transitory computer readable medium can include the non- transitory computer readable medium further can include one or more instructions executable by a processor.
  • the processor can filter, via the translation model and based on the indication of the role according to the caption viewer input, caption data into the caption, where the caption data can include one or more captions descriptive of a plurality of segments of the medical procedure can include the first segment.
  • the text includes one or more captions, and where the captions include instances of the caption data and instances of the caption type indicator.
  • the multi-modal data includes robotic system data indicative of a state of a robotic system when the robotic system is used to perform at least a portion of the medical procedure.
  • references to “or” may be construed as inclusive so that any terms described using “or” may indicate any of a single, more than one, and all of the described terms. References to at least one of a conjunctive list of terms may be construed as an inclusive OR to indicate any of a single, more than one, and all of the described terms. For example, a reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both “A’ and ‘B’. Such references used in conjunction with “comprising” or other open terminology can include additional items. References to “is” or “are” may be construed as nonlimiting to the implementation or action referenced in connection with that term. The terms “is” or “are” or any tense or derivative thereof, are interchangeable and synonymous with “can be” as used herein, unless stated otherwise herein.
  • Directional indicators depicted herein are example directions to facilitate understanding of the examples discussed herein, and are not limited to the directional indicators depicted herein. Any directional indicator depicted herein can be modified to the reverse direction, or can be modified to include both the depicted direction and a direction reverse to the depicted direction, unless stated otherwise herein. While operations are depicted in the drawings in a particular order, such operations are not required to be performed in the particular order shown or in sequential order, and all illustrated operations are not required to be performed. Actions described herein can be performed in a different order. Where technical features in the drawings, detailed description or any claim are followed by reference signs, the reference signs have been included to increase the intelligibility of the drawings, detailed description, and claims. Accordingly, neither the reference signs nor their absence have any limiting effect on the scope of any clam elements.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

Des aspects de cette solution technique permettent de recevoir des données multimodales d'un acte médical. Les données multimodales comprennent des données vidéo capturées pendant l'acte médical et des données système générées par un système médical assisté par ordinateur. Sur la base, au moins en partie, des données multimodales de l'acte médical, un premier segment de l'acte médical est identifié. Un modèle de génération de descripteurs génère, sur la base, au moins en partie, des données multimodales de l'acte médical, une pluralité de descripteurs pour le premier segment de l'acte médical. Un modèle de traduction génère, sur la base, au moins en partie, de la pluralité de descripteurs pour le premier segment de l'acte médical et d'une entrée de visualiseur de sous-titres, un sous-titre qui décrit le premier segment de l'acte médical.
PCT/US2025/025606 2024-04-22 2025-04-21 Génération de sous-titres spécifiques de l'utilisateur pour vidéo via des données multimodales pour actes médicaux Pending WO2025226595A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463637113P 2024-04-22 2024-04-22
US63/637,113 2024-04-22

Publications (1)

Publication Number Publication Date
WO2025226595A1 true WO2025226595A1 (fr) 2025-10-30

Family

ID=95743751

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/025606 Pending WO2025226595A1 (fr) 2024-04-22 2025-04-21 Génération de sous-titres spécifiques de l'utilisateur pour vidéo via des données multimodales pour actes médicaux

Country Status (1)

Country Link
WO (1) WO2025226595A1 (fr)

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
VAN AMSTERDAM BEATRICE ET AL: "Gesture Recognition in Robotic Surgery With Multimodal Attention", IEEE TRANSACTIONS ON MEDICAL IMAGING, IEEE, USA, vol. 41, no. 7, 1 February 2022 (2022-02-01), pages 1677 - 1687, XP011913324, ISSN: 0278-0062, [retrieved on 20220202], DOI: 10.1109/TMI.2022.3147640 *
YAMADA YUTARO ET AL: "Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows", INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY; A JOURNAL FOR INTERDISCIPLINARY RESEARCH, DEVELOPMENT AND APPLICATIONS OF IMAGE GUIDED DIAGNOSIS AND THERAPY, vol. 19, no. 6, 1 April 2024 (2024-04-01), Berlin, DE, pages 1075 - 1083, XP093285647, ISSN: 1861-6429, Retrieved from the Internet <URL:https://link.springer.com/content/pdf/10.1007/s11548-024-03101-6.pdf> [retrieved on 20250611], DOI: 10.1007/s11548-024-03101-6 *
ZAPPELLA: "Surgical gesture classification from video and kinematic data", MEDICAL IMAGE ANALYSIS, 27 April 2013 (2013-04-27), XP093097053, DOI: https://doi.org/10.1016/j.media.2013.04.007 *

Similar Documents

Publication Publication Date Title
US12380990B2 (en) Surgical tracking and procedural map analysis tool
US12354186B2 (en) Customization of overlaid data and configuration
US12272463B2 (en) Methods for surgical simulation
Gorman et al. Simulation and virtual reality in surgical education: real or unreal?
Davis et al. Looking at plastic surgery through Google Glass: part 1. Systematic review of Google Glass evidence and the first plastic surgical procedures
US7966269B2 (en) Intelligent human-machine interface
US11205306B2 (en) Augmented reality medical diagnostic projection
US20230134195A1 (en) Systems and methods for video and audio analysis
US9526586B2 (en) Software tools platform for medical environments
EP3367387A1 (fr) Procédés et système pour fournir un guidage chirurgical en temps réel
CN115315729A (zh) 用于促进远程呈现或交互的方法和系统
Gupta et al. Augmented reality based human-machine interfaces in healthcare environment: benefits, challenges, and future trends
Cao et al. Effects of new technology on the operating room team
CA3176315A1 (fr) Procedes et systemes de collaboration video
Grespan et al. The route to patient safety in robotic surgery
Wachs et al. Telementoring systems in the operating room: a new approach in medical training
Maktabi et al. Situation-dependent medical device risk estimation: design and evaluation of an equipment management center for vendor-independent integrated operating rooms
WO2025226595A1 (fr) Génération de sous-titres spécifiques de l&#39;utilisateur pour vidéo via des données multimodales pour actes médicaux
Burgess et al. Telemedicine: teleproctored endoscopic sinus surgery
CN111613280B (zh) 一种用于医疗的h.i.p.s多点触控宣教交互系统
US11968408B2 (en) System and method for teaching a surgical procedure
US11751972B2 (en) Methods and systems for remote augmented reality communication for guided surgery
US20250218126A1 (en) Multi-modal query and response architecture for medical procedures
Shluzas et al. Mobile augmented reality for distributed healthcare
US20250316371A1 (en) Ai-based inventory prediction and optimization for medical procedures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25725660

Country of ref document: EP

Kind code of ref document: A1