WO2025174815A1

WO2025174815A1 - Intraoperative surgical guidance using three-dimensional reconstruction

Info

Publication number: WO2025174815A1
Application number: PCT/US2025/015501
Authority: WO
Inventors: Chandra Jonelagadda; Dmitrii BYCHKOV; Bipul Kumar; Aneesh JONELAGADDA
Original assignee: Kaliber Labs Inc
Current assignee: Kaliber Labs Inc
Priority date: 2024-02-12
Filing date: 2025-02-12
Publication date: 2025-08-21
Anticipated expiration: 2026-08-12

Abstract

A method for providing three-dimensional surgical assistance is disclosed. Three-dimensional models may be constructed from a patient's magnetic resonance imaging (MRI) and computerized tomography (CT) data. Anatomies, surgical tools, pathologies, or other objects that are included within a real-time surgical video can be identified. The identified objects may be replaced with the three-dimensional models in a reconstructed view of the surgical region.

Description

INTRAOPERATIVE SURGICAL GUIDANCE USING THREE-DIMENSIONAL

RECONSTRUCTION

CLAIM OF PRIORITY

[0001] This patent application claims priority to U.S. provisional patent application no. 63/552,650, titled “INTRAOPERATIVE SURGICAL GUIDANCE USING THREE- DIMENSIONAL RECONSTRUCTION,” and filed on February 12, 2024, which is herein incorporated by reference in its entirety.

INCORPORATION BY REFERENCE

[0002] All publications and patent applications mentioned in this specification are herein incorporated by reference in their entirety to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.

FIELD

[0003] The present disclosure relates generally to surgical procedures and more specifically to providing real-time assistance for surgical procedures.

BACKGROUND

[0004] Many minimally invasive surgical procedures, such as arthroscopic joint surgical procedures, involve a surgical camera that provides real-time surgical video data. The surgical video data may be shown on a display and may be used to guide actions of the surgeon, doctor, or other clinician performing the surgery.

[0005] The surgeon’s view, however, is indirect and may offer only a limit field of view. In some surgeries, the patients’ preoperative diagnostic radiological imaging may be used to plan surgeries. The planning could include deciding resection margins, estimate bone losses, etc. However, when these preoperative images are not available in the surgical field of view.

[0006] Providing real-time surgical guidance and three-dimensional views of the patient’s anatomy and surgical tools may be beneficial to the surgeon increasing positive outcomes and reducing procedure times.

SUMMARY OF THE DISCLOSURE

[0007] Described herein are apparatuses, systems, and methods to provide surgical guidance to a surgeon operating on a patient. The surgical guidance may include a display showing a reconstructed view of the surgical area. The reconstructed view may include three- dimensional (3D) models associated with the patient. For example, the reconstructed view can include 3D models of anatomies, pathologies, and implants associated with the patient. The 3D models may be based, at least in part, on the patient's magnetic resonance imaging (MRI) and/or computerized tomography (CT) data.

[0008] A real-time surgical video of an operation or procedure is received. The surgical video is processed, and any recognized objects are identified. For example, a processor can execute one or more trained neural networks to identify anatomies, pathologies, implants, surgical tools, and the like that may be included within the surgical video. A computergenerated 3D model of any (or all) of the identified items may be included in a constructed 3D view provided to the surgeon as surgical guidance.

[0009] The 3D view may be manipulated (rotated, any included joints virtually moved through a range of motion) to advantageously provide the surgeon access to views of the surgical area that may otherwise be occluded or hidden from the view of a surgical camera. [0010] Any of the methods described herein may be used to generate a 3D surgical view. Any of the methods may include receiving or obtaining a patient’s radiological data, segmenting one or more anatomical structures from the patient’s radiological data, and generating one or more three-dimensional (3D) models based on the one or more segmented anatomical structures. Furthermore, any of the methods may include receiving a live surgical video stream, identifying an anatomical structure within the live surgical video stream, and generating a reconstructed 3D model of the identified anatomical structure based at least in part on the one or more 3D models based on the one or more segmented anatomical structures.

[0011] Any of the methods described herein may also include identifying a surgical tool in the live surgical video stream, and superimposing a digital representation of the surgical tool onto the reconstructed 3D model. In some examples, identifying the surgical tool may include executing a neural network trained to recognize the surgical tool within a video stream.

[0012] Any of the methods described herein may further include identifying an implant in the live surgical video stream, and superimposing digital representation of the implant onto the reconstructed 3D model. In some examples, identifying the implant may include executing a neural network trained to recognize the implant within a video stream.

[0013] Any of the methods described herein may include identifying an anatomical structure that includes a pathology, and highlighting the anatomical structure. Any of the methods described herein can include animating the 3D model to move in a realistic manner. [0014] Any of the methods described herein may include displaying the 3D model to replace the live surgical video stream. In some examples, the displayed 3D model is a frame- by-frame replacement of the live surgical video stream.

[0015] In any of the methods described herein, the patient’s radiological data may include magnetic resonance imagery (MRI) data, computed tomography (CT) data, or a combination thereof.

[0016] In any of the methods described herein, identifying the anatomical structure can include executing a neural network trained to recognize anatomical structures included within a video stream.

[0017] In any of the methods described herein, identifying the anatomical structure within the live surgical video stream comprises estimating a camera angle based on the identified anatomical structure. In some examples, generating the reconstructed 3D model of the identified anatomical structure can be based at least in part on the estimated camera angle. [0018] Any of the devices described herein may include one or more processors, and a memory configured to store instructions that, when executed by the one or more processors cause the device to receive or obtain a patient’s radiological data, segment one or more anatomical structures from the patient’s radiological data, generate one or more three- dimensional (3D) models based on the segmented anatomical structures, receive a live surgical video stream, identify an anatomical structure within the live surgical video stream, and generate a reconstructed 3D model of the identified anatomical structure based at least in part on the one or more 3D models based on the one or more segmented anatomical structures.

[0019] Any of the non-transitory computer-readable storage mediums described herein may include instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising receiving or obtaining a patient’s radiological data, segmenting one or more anatomical structures from the patient’s radiological data, generating one or more three-dimensional (3D) models based on the segmented anatomical structures, receiving a live surgical video stream, identifying an anatomical structure within the live surgical video stream, and generating a reconstructed 3D model of the identified anatomical structure based at least in part on the one or more 3D models based on the one or more segmented anatomical structures.

[0020] Any of the methods described herein can generate an automatically annotated surgical video. Any of the methods may include receiving or obtaining a surgical video stream, identifying a region of interest within the surgical video stream, and generating an annotated view based on the identified region of interest. [0021] In any of the methods described herein, generating the annotated view can include rendering a bounding box around the region of interest. In any of the methods described herein, identifying the region of interest can include executing a neural network trained to recognize an anatomy within a video stream.

[0022] In any of the methods described herein, identifying the region of interest can include executing a neural network trained to recognize a pathology within a video stream. In any of the methods described herein, identifying the region of interest may include executing a neural network trained to recognize a surgical tool within a video stream.

[0023] In any of the methods described herein, generating the annotated view may include rendering labels near the region of interest. Any of the methods described herein may include identifying an anatomical region, and identifying specific anatomical structures within the anatomical region, by executing a neural network trained to detect specific anatomical features within the anatomical region.

[0024] Any of the devices described herein may include one or more processors, and a memory configured to store instructions that, when executed by the one or more processors cause the device to receive or obtain a surgical video stream, identify a region of interest within the surgical video stream, and generate an annotated view based on the identified region of interest.

[0025] Any of the non-transitory computer-readable storage mediums described herein may include instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising receiving or obtaining a surgical video stream, identifying a region of interest within the surgical video stream, and generating an annotated view based on the identified region of interest.

[0026] Any of the methods described herein may provide surgical guidance. Any of the methods may include associating a surgical instrument with a first fiducial marker, associating a surgical region with a second fiducial marker, determining a location of the surgical instrument with respect to the surgical region, and displaying surgical guidance based on the location of the surgical instrument and the location of the surgical region.

[0027] In any of the methods described herein, determining the location of the surgical instrument with respect the surgical region may include determining a location of the first fiducial marker and determining a location of the second fiducial marker.

[0028] In any of the methods described herein, determining the location of the surgical instrument with respect to the surgical region can include viewing the first fiducial marker with a fixed camera. [0029] In any of the methods described herein, associating a surgical instrument with the first fiducial marker can include affixing the first fiducial marker to the surgical instrument. In some examples, in the methods described herein determining the location of the surgical instrument with respect to the surgical region can include viewing the second fiducial marker with a camera mechanically coupled to the surgical instrument.

[0030] In any of the methods described herein, associating a surgical region with a second fiducial marker may include affixing the second fiducial marker onto a reference surface adjacent to a patient.

[0031] In any of the methods described herein, determining the location of the surgical instrument with respect to the surgical region may include viewing the second fiducial marker with a fixed camera. In some cases, in any of the methods described herein displaying surgical guidance can include displaying a three-dimensional (3D) model of a patient’s anatomy.

[0032] In any of the methods described herein, displaying surgical guidance includes displaying a 3D model of the surgical instrument. In some cases, in any of the methods described herein displaying the surgical guidance may include displaying an implant associated with a surgery.

[0033] In any of the methods described herein, determining the location of the surgical instrument with respect to the surgical region may include viewing the first fiducial marker and the second fiducial marker with a plurality of cameras. Furthermore, in any of the methods described herein displaying surgical guidance may include displaying a planned placement of implants and anchor points.

[0034] In any of the methods described herein, displaying surgical guidance can include separately displaying a video stream from a camera associated with the surgical instrument. [0035] In any of the methods described herein, the surgical instrument can include an orthoscopic camera.

[0036] Any of the devices described herein may include one or more processors and a memory configured to store instructions that, when executed by the one or more processors cause the device to associate a surgical instrument with a first fiducial marker, associate a surgical region with a second fiducial marker, determine a location of the surgical instrument with respect to the surgical region, and display surgical guidance based on the location of the surgical instrument and the location of the surgical region.

[0037] Any of the non-transitory computer-readable storage mediums described herein may include instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising associating a surgical instrument with a first fiducial marker, associating a surgical region with a second fiducial marker, determining a location of the surgical instrument with respect to the surgical region, and displaying surgical guidance based on the location of the surgical instrument and the location of the surgical region.

[0038] All of the methods and apparatuses described herein, in any combination, are herein contemplated and can be used to achieve the benefits as described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

[0039] A better understanding of the features and advantages of the methods and apparatuses described herein will be obtained by reference to the following detailed description that sets forth illustrative embodiments, and the accompanying drawings of which:

[0040] FIG. l is a simplified block diagram of an example interoperative surgical assistance system.

[0041] FIG. 2 shows a block diagram of analysis modules.

[0042] FIG. 3 is a flowchart showing an example method for generating an annotated view.

[0043] FIG. 4 is a flowchart showing an example method for generating a three- dimensional view.

[0044] FIG. 5 is a flowchart showing an example method for generating activity summaries.

[0045] FIG. 6 shows an example three-dimensional surgical assistance system.

[0046] FIG. 7 is a flowchart showing an example method for generating three- dimensional surgical guidance.

[0047] FIG. 8 shows a block diagram of a device that may be one example of the intraoperative surgical assistance system of FIG. 1 and/or the three-dimensional surgical assistance system of FIG. 6.

[0048] FIGS. 9A-9C illustrate examples of reconstructed 3D views showing surgical tools as described herein.

DETAILED DESCRIPTION

[0049] Described herein are systems and methods for providing surgical guidance to a surgeon performing a surgical procedure. The surgical guidance may include a navigable three-dimensional (3D) model of the patient’s anatomy and, in some cases, the patient’s pathology. The 3D model may be based on pre-operative radiological images including magnetic resonance images and/or computed tomography scan data. In some examples, the 3D model. In some examples, the 3D model may be constructed based on a real-time surgical video, such that elements of the real-time surgical video are replaced with equivalent 3D models. In some implementations, the 3D models may be displayed concurrently with the real-time surgical video.

[0050] In some examples, the surgical guidance may be enhanced by systems and methods for determining a location of a surgical camera with respect to an operating region. A surgical assistance system can simultaneously determine the location of the surgical camera and the location of the operating (surgical) region. The surgical assistance system can determine when the surgical camera is within the operating region and determine a 3D model representative of the view seen by the surgical camera.

[0051] FIG. l is a simplified block diagram of an example interoperative surgical assistance system 100. The surgical assistance system 100 may receive or obtain a real-time surgical video 110 and patient data 120 that may include magnetic resonance imagery (MRI) data, computed tomography (CT) data, a patient’s pre-operative plan, and the like. The surgical assistance system 100 may use the surgical video 110 and the patient data 120 to generate surgical assistance data. In some examples, the surgical assistance data may include an annotated view 135, a three-dimensional (3D) view 145, an information view 175, and a database 170.

[0052] In some examples, the surgical assistance data may be displayed on one or more displays to provide a surgeon (or other technician) guidance regarding an ongoing surgical procedure. The guidance may include a 3D rendering of any feasible anatomy of the patient. In some cases, the guidance may include a 3D rendering of any tools that may be used during a procedure or inserted into the patient and moved relative to the patient’s anatomy.

[0053] In some examples, the real-time surgical video 110 may be a real-time video stream from a surgical camera, such as an orthoscopic or arthroscopic surgical camera. In some examples, other cameras may be used to generate the real-time surgical video 110. The real-time video stream may be processed concurrently through any feasible number of processing pathways.

[0054] In one example, a first processing pathway may use the real-time surgical video 110 to generate the annotated view 135. In some cases, the real-time surgical video 110 may be received by an anatomy, tool, and pathology processing block 130. The anatomy, tool, and pathology processing block 130 can process a video stream in order to recognize any feasible anatomies, implants, surgical tools and/or pathologies that may be included in the video stream. In some examples, the anatomy , tool, and pathology processing block 130 may include a processor that executes one or more neural networks trained to recognize anatomies, implants, surgical tools, and/or pathologies.

[0055] In some examples, the anatomy, tool, and pathology processing block 130 can generate labels associated with any recognized anatomies, implants, tools, and pathologies. The anatomy, tool, and pathology processing block 130 can render the generated labels over or near the associated recognized objects on the video stream. In another example, the anatomy, tool, and pathology processing block 130 can generate and render bounding boxes that can surround or encompass any recognized objects. In this manner, the anatomy, tool, and pathology processing block 130 can generate the annotated view 135.

[0056] In another example, a second processing pathway may use the real-time surgical video 110 to generate the 3D view 145. The 3D view 145 may be generated by a 3D scene construction block 140. In addition to the real-time surgical video 110, the 3D scene construction block 140 can receive or obtain the patient data 120 and recognition data directly or indirectly (via a metadata stream) from the anatomy, tool, and pathology processing block 130.

[0057] In some examples, the anatomy, tool, and pathology processing block 130 may output recognition data into a message bus. The message bus, which may also be referred to as the metadata stream, can be a low-latency message bus that conveys data (video stream data, as well as any other feasible data from any feasible processing block) to other processing blocks or to the database 170.

[0058] Using the patient data 120, the 3D scene construction block 140 can generate a 3D model of the patient’s anatomy. In some examples, the 3D scene construction block 140 can segment the patient’s anatomy into separate parts (bones, tendons, muscle and the like). The segmentation may be guided or assisted by recognition data from the anatomy, tool, and pathology processing block 130. In some examples, the segmentation may be based on one or more neural network (artificial intelligence) algorithms. In some cases, the 3D model may be computed prior to a surgical procedure using the patient data 120.

[0059] In some examples, the 3D scene construction block 140 can receive the real-time surgical video 110 and/or video stream data from the metadata stream and match the position and orientation of a rendered 3D model to match and track the video stream (and/or metadata stream) data. In some cases, the 3D scene construction block 140 can estimate a camera position based on the real-time surgical video 110 and/or the video stream data from the metadata stream. In some examples, the 3D scene construction block 140 can determine a best fit between anatomical structures on the metadata stream and structures included within the patient data 120. [0060] The 3D scene construction block 140 can render digital representations of tools and implants detected in a video stream onto the 3D model. In some examples, the 3D scene construction block 140 can determine a relative position of a surgical tool against an anatomical structure. The determined relative position may be used to position the surgical tool within the rendering of the patient’s anatomy.

[0061] In some examples, the 3D scene construction block 140 can detect and output an instantaneous position of a recognized surgical tool to the database 170 through the metadata stream. In some cases, the surgical tools may be specified or described in a configuration file. When outputting the instantaneous position of a surgical tool, the 3D scene construction block 140 can output a digital representation of the surgical tool in the 3D view 145. Operations of the 3D scene construction block 140 are described in more detail with respect to FIG. 2.

[0062] In some examples, the 3D scene construction block 140 may respond to a user input (via a keyboard, mouse, trackball, or the like). The user input may be used to interact with the 3D model. For example, the user can change a view of the 3D model by spinning or moving the 3D model with respect to the viewer’s point of view. In some other examples, the user can interact with the model by virtually moving an anatomical joint through a predicted range of motion. The predicted range of motion may be based on an a priori knowledge of an anatomical model.

[0063] The 3D view 145 can be a frame-by-frame rendered 3D view of the patient’s anatomy, surgical tools, and pathologies that may be included within a video stream, such as the real-time surgical video 110. Thus, the 3D view 145 generated by the 3D scene construction block 140 may have a real-time, frame-by-frame correspondence to the real-time surgical video 110 or any other feasible video stream.

[0064] In another example, a third processing pathway may use the real-time surgical video 110 to generate scene descriptions and summarize surgical activities with a language and vision model 150 and a scene summary large language model 160. The scene descriptors and surgical summaries may be shown in an information view 175 or may be stored in the database 170. In some implementations, the third processing pathway may operate at a slower rate (slower frame rate) than the real-time surgical video 110.

[0065] The language and vision model 150 may be implemented as a neural network that can be executed by one or more processors. The neural network can be trained using images as inputs. The images can be images from, or related to, any feasible surgery. The neural network may be trained to output a scene description in human-readable text. The training may be analogous to training a neural network for a large language model, however in this example, an input phrase or sentence is replaced with an input video image, scene, and/or frame. In this manner, the language and vision model 150 may be trained to provide human- readable phrases that describe the related video image, scene, etc. Thus, the output of the language and vision model 150 may be human-readable scene descriptors 155. The scene descriptors 155 may be output to the metadata stream and/or stored in the database 170. Thus, the surgical assistance system 100 may segment the real-time surgical video based at least in part on visual/language-based neural networks.

[0066] In some cases, the language and vision model 150 can advantageously enable a user to ignore or de-prioritize less important information. For example, the language and vision model 150 may not describe unimportant background information in the video stream. In this manner, the language and vision model 150 can naturally draw the user’s attention to objects or activities of interest.

[0067] The scene summarization large language model (LLM) 160 may include a neural network trained to summarize surgical scenes that are included within the real-time surgical video 110. In some examples, the scene summary LLM 160 may receive or obtain the scene descriptors 155 either from the metadata stream, the database 170, or in some cases directly from the language and vision model 150.

[0068] Since each scene descriptor 155 can describe the contents of each video frame, the scene summary LLM 160 can examine multiple scene descriptors 155 and generate a human- readable activity summary 165 regarding surgical actions, recognized anatomies, implants, or pathologies, or the like that have occurred within or during one or more video scenes. In some examples, the scene summary LLM 160 can operate on a “rolling window” of scene descriptors 155 to generate the activity summary 165. In some examples, the scene summary LLM 160 may be trained with the scene descriptors 155. The activity summary 165 may be output to the metadata stream, stored in the database 170, and/or shown in the information view 175.

[0069] In some examples, the activity summary 165 may be used to summarize surgical procedures and simplify future surgical procedures and/or workflows. In some other examples, the activity summary 165 may be used to provide training or education for device representatives, doctors, support staff, families, patients, or the like. In some cases, a chat bot can be implemented to respond to inquiries regarding the activity summary 165.

The scene summary LLM 160 can advantageously provide context to visual data. The output of the scene summary LLM 160 can be a natural language description and/or interpretation of scenes and actions. [0070] In some examples, the arrangement of the language and vision model 150 and the scene summary LLM 160 may be hierarchical. That is, the language and vision model 150 may operate on a lower level to provide the scene descriptors 155 which may be a first level of abstraction. The scene summary LLM 160 may generate the activity summary 165 from the scene descriptors 155, where the activity summary 165 is a higher level of abstraction than the scene descriptors 155.

[0071] A fourth processing pathway can include minimal (or no) processing. In this manner the real-time surgical video 110 may be displayed as an original view 115. In some examples, the original view 115 may be shown in the operating room through a dedicated display.

[0072] The surgical assistance system 100 can display the original view 115, the annotated view 135, the 3D view 145, and the information view 175 together or separately, on a single display (using picture-in-picture) or across multiple displays. In some implementations, the surgeon may rely primarily on the 3D view 145 which include solely or mostly a reconstructed 3D view. That is, the 3D view 145 may exclude any real-time video elements, and not rely on blending of any other video streams with the reconstructed 3D view. The surgical assistance system 100 can be used to provide real-time surgical guidance through the annotated view 135 and/or the 3D view 145. Alternatively, or in addition, the surgical assistance system 100 can provide a human-readable text summary that describes any feasible operation or procedure included in the real-time surgical video 110.

[0073] FIG. 2 shows a block diagram of analysis modules 200. In some examples, the analysis modules 200 can include different modules that can analyze a live or recorded surgical video, such as a surgical video 202 (in some examples, the surgical video 202 may include the real-time surgical video 110 of FIG. 1 and be referred to as a video stream). Operations associated with the analysis modules 200 may be included wholly or partially within the anatomy, tool, and pathology processing block 130. Thus, operations of the analysis modules 200 may be used to generate the annotated view 135 or any related metadata that may be passed to the metadata stream. The surgical video 202 may also include metadata 204 (in some cases different from metadata generated by the analysis modules 200). The metadata 204 may include any data that is associated with the surgical video 202. Examples of metadata 204 can include timestamps, metadata regarding the surgery, patient demographics and the like. In some examples, the timestamps may be used by other modules to identify events, procedures, or surgical tools that are noted or recognized within the surgical video 202. A timestamp may be a unique identifier that may be associated with a particular frame of the surgical video 202. [0074] The surgical video 202 may be processed by an artificial intelligence pipeline 206. The artificial intelligence (Al) pipeline 206 may include one or more neural networks that may be executed by one or more processors to perform one or more of the operations described herein. In some cases, execution of a neural network may include execution of a neural network that has been trained to perform a task or recognize an image, as described herein.

[0075] The Al pipeline 206 can include an input video preprocessing module 210 to preprocess the surgical video 202. In some examples, the input video preprocessing module 210 can center crop the surgical video 202. For example, the input video preprocessing module 210 can remove black space surrounding one or more objects centrally located within a field of view (FOV) of the surgical video 202. The FOV of a video may refer to any areas that may be within a central portion of the surgical video 202. In some examples, removing black space may include removing black space around the central portion of the view with removing any non-important image content.

[0076] In some examples, the input video preprocessing module 210 can resize the surgical video 202. For example, the input video preprocessing module 210 can scale the surgical video 202 to a size of approximately 256x256 pixels. In other implementations, other sizes are possible. In some variations, the resized surgical video 202 may be reshaped to any size as may be required for any subsequent processing steps.

[0077] In some examples, the input video preprocessing module 210 can remove color information from the surgical video 202. For example, some subsequent processing steps or modules may include a neural network trained using black and white images. Thus, color information within the surgical video 202 may impede subsequent processing. The input video preprocessing module 210 may change or modify color information in the surgical video 202. In some examples, the input video preprocessing module 210 can convert colored image information of the surgical video 202 (RGB, YUV, or other color spaces) to a grayscale color space. In some examples, the input video preprocessing module 210 can remove color information after resizing. In some other examples, the input video preprocessing module 210 can remove color information prior to any resizing operations. [0078] Al pipeline 206 may also include an inside/outside detection module 212 to determine whether contents of the surgical video 202 are inside or outside a body. Such a determination may be useful for other modules, particularly those based on recognizing operations or tools. Recognizing a tool or anatomy outside of the body may not be useful. In some examples, a processor or processing node may execute an artificial intelligence program or neural network to determine or detect when the surgical video 202 shows images from inside a body.

[0079] The Al pipeline 206 may include primary Al modules 240 that can provide some or all of the preliminary analysis of the surgical video 202. In some examples, the primary Al modules 240 can include an FOV quality module 241, a monocular depth estimation module 242, a surgical procedure recognition module 243, a surgical action recognition module 244, a view recognition module 245, an implant recognition module 246, a tool recognition module 247, a pathology recognition module 248, and an anatomy recognition module 249. Outputs of any of the primary Al modules 240 can be referred to as intermediate assessment data. The intermediate assessment data may be used by other modules to perform further, deeper analysis on the surgical video 202. In general, the outputs from any of the primary Al modules 240 may determine any notable clinical aspects. The clinical aspects may be identified with text labels, masks (video masks), and or timestamps associated with particular video frames of the surgical video 202.

[0080] The FOV quality module 241 can provide a quality assessment of the surgical video 202. In some examples, the FOV quality module 241 may output one or more text labels and associated video timestamps (video frame identifiers) associated with image quality. For example, the FOV quality module 241 may output blurry, bloody, and/or excessive debris labels along with the video timestamps of the video associated with the labels. Image quality may affect the ability of other modules to process or review the surgical video 202. In some embodiments, the FOV quality module 241 may include one or more neural networks trained to identify various image quality characteristics, including blurry images, blood within an image, and excessive debris within an image.

[0081] The monocular depth estimation module 242 may provide or generate a heatmap of an anatomical structure. The heatmap can indicate a perceived depth or distance from the camera. In some examples, more distant regions are depicted with darker tones and more proximal regions are depicted with brighter tones. Thus, the output of the monocular depth estimation module 242 may be tonal image. In some embodiments, the monocular depth estimation module 242 may include one or more neural networks trained to determine perceived depth.

[0082] The surgical procedure recognition module 243 may be used to determine whether a surgical procedure has been recognized within the surgical video 202. In some examples, the surgical procedure recognition module 243 may output a text label the indicates which particular surgical procedure has been recognized as well a timestamp indicating when the surgical procedure appears in the surgical video 202. In some examples, the surgical procedure recognition module 243 may include one or more neural networks that have been trained to recognize any feasible surgical procedure and determine any associated video timestamps.

[0083] The surgical action recognition module 244 may be used to determine whether any surgical procedure has been performed. Thus, while the surgical procedure recognition module 243 can recognize that a surgical procedure has been identified, the surgical action recognition module 244 can determine whether the surgical procedure has been performed and/or completed. The surgical action recognition module 244 may output text labels and video timestamps that indicate whether any surgical action has been recognized as well as the video frame that includes the recognized surgical action. In some examples, the surgical action recognition module 244 may include one or more neural networks that have been trained to recognize the performance and/or completion of any feasible surgical procedure as well as determine the associated video timestamps.

[0084] The view recognition module 245 may be used to indicate a recognized view in the surgical video 202. In some examples, a surgery can be considered to be a sequence of actions where the surgeon treats a pathology through the sequence of actions. These actions or activities may be bookended by distinct scenes, especially at the end of a procedure. These distinct scenes form visual landmarks to aid in the evaluation and/or understanding of a surgical procedure. For example, when a surgeon completes a ‘site preparation,’ surgical best practices indicate that the region ‘looks’ a given way. The view recognition module 245 can recognize various scenes and output an associated text label. The view recognition module 245 can also output video timestamps indicating where in the surgical video the recognized view appears. The view recognition module 245 may include one or more neural networks trained to recognize various scenes and also output corresponding text labels and timestamps. [0085] The implant recognition module 246 can indicate whether the surgical video 202 includes any feasible recognized implant. Implants can include any hard or soft implant, replacement bone structure or the like. The implant recognition module 246 can output a text label that includes the name of the recognized implant. The implant recognition module 246 may include one or more neural networks trained to recognize various implants and also output the corresponding text labels and associated timestamps.

[0086] The tool recognition module 247 can indicate whether the surgical video 202 includes any feasible surgical tool. Surgical tools, scalpels, retractors, probes, and the like may be used in a variety of surgical procedures. In some examples, the tool recognition module 247 may include one or more neural networks trained to recognize any feasible surgical tool and also output a corresponding text label and a mask that outlines the recognized surgical tool. The tool recognition module 247 may also output any timestamps associated with the recognition of the surgical tool.

[0087] The pathology recognition module 248 can indicate whether the surgical video 202 includes any recognized pathology. A pathology can include any feasible disease or other physical malady. In some examples, the pathology recognition module 248 may include one or more neural networks that have been trained to recognize any feasible pathology and output a corresponding text label and mask that outlines or highlights the recognized pathology. The pathology recognition module 248 can also output a video timestamp associated with the recognized pathology.

[0088] The anatomy recognition module 249 can indicate whether the surgical video 202 includes any recognized anatomy. The recognized anatomy can include any feasible anatomy. In some examples, the anatomy recognition module 249 may include one or more neural networks that have been trained to recognize any feasible anatomy and output a corresponding text label and a mask that outlines or highlights the recognized anatomy. The anatomy recognition module 249 may also output a video timestamp associated with the recognized anatomy.

[0089] Outputs of some or all of the primary Al modules 240 may be stored in the datastore 250. The datastore 250 can be any feasible memory providing permanent or semipermanent data storage.

[0090] In addition, outputs from some or all of the primary Al modules 240 may be used as inputs for further Al processing. Thus, some or all of the outputs of any of the primary Al modules 240 may be intermediate data for further processing. Further Al processing is described below in conjunction with FIGS. 3-7.

[0091] FIG. 3 is a flowchart showing an example method 300 for generating an annotated view. In some examples, the method 300 can generate the annotated view 135 of FIG. 1. In some implementations, operations of the method 300 may be included in the anatomy, tool, and pathology processing block 130 of FIG. 1. Some examples may perform the operations described herein with additional operations, fewer operations, operations in a different order, operations in parallel, and some operations differently. The method 300 is described below with respect to the block diagram of FIG. 1, however, the method 300 may be performed by any suitable system or device.

[0092] The method 300 begins in block 302 where a surgical video stream is obtained or received. For example, a real-time surgical video 110 may be received. In some examples, the surgical video stream may be from a surgical camera (such as but not limited to, an arthroscopic, laparoscopic, and/or endoscopic camera). In some other examples, the surgical video stream can be any feasible live video stream from any feasible source, including a network ("cloud-based”) source.

[0093] Next, in block 304 anatomy, tool, and pathology recognition is performed. In some examples, the anatomy, tool (surgical tool), and pathology recognition may be performed by the anatomy, tool, and pathology processing block 130 of FIG. 1. Although anatomy, tools, and pathologies are mentioned here, any region of interested may be recognized or identified. For example, the anatomy, tool, and pathology processing block can also recognize or identify implants. In some other examples, anatomy, tool, and pathology recognition may be performed by any feasible processor executing one or more neural networks as described in conjunction with FIG. 2. In some cases, any recognized anatomy, tool, and/or pathology data may be provided to the metadata stream for use by other processing blocks or units. In some cases, the recognition may be through a hierarchical arrangement of algorithms. For example, a first recognition algorithm may broadly recognize an anatomical region, and a second algorithm may then be used to recognize a specific anatomical structure. Similarly, for hierarchal recognition of a pathology, pathology recognition models may be invoked on the anatomical structures segmented by recognition modules, thereby recognizing pathology specific to an anatomical structure.

[0094] Next, in block 306 an annotated view is generated. In some examples, the annotated view 135 may be generated by the anatomy, tool, and pathology processing block 130. In some implementations, the recognized anatomy, tool, and pathology data (from block 304) may be used to generate labels that are rendered next to or over the associated item. That is, any identified items (identified in block 304) may be annotated or identified in the annotated view 135. In some other implementations, the recognized anatomy, tool, and pathology data is used to determine (or render) a highlight, bounding box, or a label for any recognized item, particularly when the highlight, bounding box, or the label is shown near or over the surgical video stream. In general, some or all of the recognized anatomy, tool, and pathology information may be output to a metadata stream.

[0095] In some examples, the metadata stream may be implemented as a low-latency message bus which is also output to a database optimized for handling time-series data. Any feasible module or processing block can “listen” to the metadata stream to recieve any data output to the metadata stream by any other module or processing block.

[0096] In some cases, the recognition data may be output in a variety of formats. In some cases, the recognized structures are isolated while preserving the scale and orientation. In some cases, regions of interest are identified in the form of bounding boxes, in other cases, regions of interest are identified with labels classifying the objects. These bounding boxes and/or labels may be displayed on the annotated view.

[0097] FIG. 4 is a flowchart showing an example method 400 for generating a 3D view. In some examples, the method 400 can generate the 3D view 145 of FIG. 1. In some implementations, operations of the method 400 may be included in the 3D scene construction block 140 of FIG. 1. The method 400 is described below with respect to the block diagram of FIG. 1, however, the method 400 may be performed by any suitable system or device.

[0098] The method 400 begins in block 402 where a patient’s MRI and/or CT data is received or obtained. The MRI and CT data may have been collected as part of a preoperative procedure (or part of any radiological studies) to assess the patient and formulate a diagnosis and treatment. In some examples, the MRI and CT data may be stored in a database such as the database 170. In other examples, the MRI and CT data may be received through any feasible network.

[0099] Next, in block 404 the MRI and/or CT data is segmented. In some examples, the MRI and CT data may be segmented by an artificial intelligence (neural network) process or procedure. The segmented objects can include any feasible anatomical parts including individual joints, bones, tendons, muscles, appliances, and the like.

[0100] Next, in block 406 a 3D model is generated based on the segmented MRI and/or CT data. A three-dimensional model may be generated (rendered) since the segmentation procedure of block 404 is applied to each slice of the radiological image. Any segmented output from block 404 can inherit any 3D spatial information associated with the MRI and CT data. In this manner, any individual 3D models may be generated based on MRI and/or CT data.

[0101] Next, in block 408 a surgical video stream is received or obtained. In some examples, the surgical video stream can be the real-time surgical video stream 110 of FIG. 1. The surgical stream may be provided by any feasible surgical camera, or in some cases may be streamed through any feasible wired or wireless network (including the Internet or any cellular network).

[0102] Next, in block 410 recognized anatomy, tool, and/or pathology data is received or obtained. In some examples, the received anatomy, tool, and pathology data may be provided by the anatomy, tool, and pathology processing block 130. In some cases, the anatomy, tool, and pathology data may be received or obtained from the metadata stream. The recognized anatomy, tool, and/or pathology data indicates whether the video stream contains or includes any known (trained for) anatomies, implants, tools, or pathologies. For example, the anatomy, tool, and pathology processing block 130 may include one or more neural networks that have been trained to recognize (identify) various anatomies, implants, surgical tools, and pathologies within video streams.

[0103] Next, in block 412 a reconstructed 3D surgical view is generated. The 3D surgical view may be based on the received surgical video stream and the anatomy, tool, and pathology data. That is, the 3D surgical view may include the elements that are shown in the received surgical video stream, but replaced with a computer-generated representation (e.g., replaced with reconstructed 3D models of the elements in the surgical video stream). For example, if the received surgical video included a knee joint, associated tendons, and a surgical tool, then the reconstructed 3D surgical view can include 3D representations of the knee joint, associated tendons, and surgical tool. The 3D surgical view may be displayed on a video monitor. In some examples, the received anatomy, tool, and pathology data may be used to determine which elements or objects are included in the reconstructed 3D surgical view. In some other examples, the relative position of a recognized tool may be used to align the tool within the reconstructed 3D surgical view. In still other examples, any pathology data may be highlighted in the 3D surgical view.

[0104] In some examples, the reconstructed 3D surgical view can be generated based on an estimated camera position with respect to the received/obtained surgical video stream. In some cases, the estimated camera position may be from the point of view of a surgical camera associated with/providing the video stream. In some other cases, the estimated camera position may be a virtual camera position. For example, the surgical assistance system 100 may receive inputs (keyboard, mouse, or other inputs) from a user to rotate or otherwise move the reconstructed 3D surgical view. In rotating the reconstructed 3D surgical view, a position of a virtual camera (a camera pose) may shift or move in order to properly render the view. In some cases, rotating the reconstructed 3D surgical view can allow the surgeon to better visualize various anatomical landmarks or parts of the anatomy that may otherwise be hidden.

[0105] In some cases, the user may manipulate an anatomical joint shown in the reconstructed 3D surgical view. For example, the user can provide inputs to move a reconstructed 3D joint through a range of motion. The range of motion may be determined by a priori knowledge of the anatomical joint.

[0106] Some or all of the data associated with the reconstructed 3D surgical view may be output into the metadata stream. For example, reconstructed 3D elements (portions of segmented anatomy, recognized surgical tools, anatomical landmarks, etc.) can be output to the metadata stream. The output reconstructed 3D elements can correspond to any repositioning of the surgical view (rotation, manipulation, or other motion) performed by the surgeon, medical technician, or other operator. In some cases, the data output to the metadata stream may be specified in a configuration file.

[0107] In some examples, the reconstructed 3D surgical view can include digital landmarks. In some cases, the digital landmarks may be included with the received CT and/or MRI data. The digital landmarks can be transferred to the reconstructed 3D surgical view. In some other cases, a digital landmark can be determined from interaction with a surgical tool and an object included within a real-time surgical video. Although digital landmarks are mentioned here with particularity, any feasible pre-operative annotations may be transferred from the MRI/CT data to the reconstructed 3D surgical view or 3D model. In some examples, additional information associated with the digital landmarks or pathologies may be determined and/or displayed. For example, the surgical assistance system 100 can determine a distance between any two objects (digital landmarks, or the like) and display the determined distance on a display. In some cases, the scale (distances) of the MRI/CT data may be well defined. Thus, since the reconstructed 3D surgical view is based on the MRI/CT data, the distances between any two points in the reconstructed 3D surgical view may also be known. In some examples, the surgeon (or other user) can interactively measure points and determine distance by selecting start and end points on the reconstructed

[0108] In some implementations, measurements of various pathologies and anatomical structures may automatically be measured and reported on a display. Any of the measurements may be output to the metadata stream and stored in a database.

[0109] FIG. 5 is a flowchart showing an example method 500 for generating activity summaries. In some examples, the method 500 can generate the screen descriptors 155 and/or the activity summary 165 of FIG. 1. The method 500 is described below with respect to the block diagram of FIG. 1, however, the method 500 may be performed by any suitable system or device.

[0110] The method 500 begins in block 502 where a surgical video stream is received or obtained. In some examples, the surgical video stream can be the real-time surgical video stream 110 of FIG. 1. The surgical stream may be provided by any feasible surgical camera, or in some cases may be streamed through any feasible wired or wireless network (including the Internet or any cellular network).

[OHl] Next, in block 504 scene descriptors are generated based on a language and vision model. In some examples, the scene descriptors may be the scene descriptors 155 generated by the language and vision model 150. The language and vision model 150 may be implemented as a neural network that can be executed by one or more processors as described above with respect to FIG. 1. That is, block 504 may be implemented as a neural network trained similar to a large language model, however, in this example, inputs are a video image instead of words, phrases, or sentences.

[0112] Next, in block 506 the generated scene descriptors 155 may be output to the metadata stream. In some examples, the scene descriptors 155 may be stored in the database 170. The scene descriptors 155 may include human-readable phrases, or the like that describe one or more elements of a scene or frame of the real-time surgical video. In some cases, the scene descriptors may include timestamps that may be associated with one or more video frames that are included in the real-time surgical video. In some examples, the scene descriptors 155 may be stored in the database 170.

[0113] Next, in block 508 an activity summary is generated based on the previously generated scene descriptors 155. In some examples, the activity summary 165 may be generated by a large language model neural network trained to identify activities such as surgical activities, entry into an operating region, and the like, from the human-readable phrases that are included in the scene descriptors 155. Any feasible surgery or surgical activity may be summarized. For instance, a sequence of scene descriptions could indicate the arrival of a specific tool into the field of view, its activation as indicated by the appearance of the involved anatomical structure and the withdrawal of the tool from the field of view. These sequences of frames describe a surgical activity, such as debridement. The system may summarize these scene descriptions in terms of the surgical activity. Similarly, a sequence of these summarizations are abstracted into a higher-level activity, ex., site preparation. The activity summary 165 may be stored in the database 170. In some other examples, the activity summary 165 may be shown on a display.

[0114] In some examples, the activity summary 165 may not be provided or updated at the surgical video frame rate. For example, the activity summary 165 may be provided or updated at a rate slower than the surgical video frame rate. The slower rate of the activity summary 165 may be due to changes in anatomy position or surgical procedures occurring over the course of many video frames, in some cases over several seconds or minutes. In some cases, a rolling window of screen descriptors may be used to determine the activity summary.

[0115] In some examples, the relationship between generating the screen descriptors and the activity summary can be hierarchical. For example, the screen descriptor generation may be operating with lower-level inputs and the screen descriptors themselves may be used as inputs to for higher level activity summary generation.

[0116] In some examples, the activity summary 165 from many procedures may be reviewed with respect to surgical tools that have been recognized. For example, a typical level of bloodiness may be associated with a typical or normal use of the surgical tool. If within one particular activity summary 165 the level of bloodiness exceeds a typical level of bloodiness, then the activity summary 165 may also include a note indicating that the tool is producing more bleeding than normal and suggesting that the surgeon may need assistance using the tool. Advantageously, the activity summary 165 does not necessarily correspond to text labels, image borders, or parameters. Instead, the activity summary 165 may be based on the natural language output of the language and vision model 150.

[0117] FIG. 6 shows an example 3D surgical assistance system 600. A 3D surgical assistance system 600 can provide a surgeon or other medical technician real-time guidance regarding an ongoing surgical operation. In some examples, the surgical guidance may include a reconstructed 3D surgical view as described with respect to FIG. 4. That is, the surgical guidance may include any recognized anatomies, implants, tools, or pathologies that are associated with a real-time video stream, such as a video stream from a surgical camera. In some cases, the displayed 3D surgical view may be displayed with respect to an orientation of an external device, tool, surgical camera, or the like.

[0118] The 3D surgical assistance system 600 may include a 3D surgical assistance processor 610 coupled to a fixed video camera 630, a display 611, a non-fixed surgical camera 640, and a data store 620. The 3D surgical assistance processor 610 may determine or generate 3D models and images of anatomies, implants, tools, and/or pathologies. For example, the 3D surgical assistance processor 610 may generate a 3D view of a knee joint of a patient 660. (Although the knee joint is described here as an example, the 3D surgical assistance processor 610 can generate a reconstructed 3D surgical view of any feasible anatomy of the patient 660.) The generated 3D model may be shown on the display 611. In some examples, the reconstructed 3D view may be based on the patient’s previous MRI and CT data as well as real-time video stream data from the non-fixed surgical camera 640 as described with respect to FIG. 4. FIGS. 9A-9C illustrate examples of reconstructed 3D views showing surgical tools that may be used as described herein. Although described herein as a processor, some implementations of the 3D surgical assistance processor 610 may include more than just a processing device. In some examples, the 3D surgical assistance processor 610 may include one or more processors, memories, input/output interfaces, and the like. In some other examples, the 3D surgical assistance processor 610 may include state machines, programmable logic devices, microcontrollers, or any other feasible device or devices capable of performing operations of the 3D surgical assistance processor 610.

[0119] For example, in FIG. 9A, the imaging system includes the orientation of the external fiducial marker 901 and the tool 903. The tools position in external shape is shown, including relative to the anatomy to be operated on (e.g., the panned tunnel 905). The image also shows a plurality of sections (bottom left, middle and right) of the patient tissue, shown segmented. As the tool is moved relative to the tissue the orientation outside of the body may be tracked, as shown in FIG. 9B and 9C. In FIG. 9B, the tool 903 is shown not aligned to the target (e.g., planned tunnel) 905 and the tool may be marked (e.g., by changing the color) to show when it not FIG. 9B) and is (FIG. 9C) aligned.

[0120] In addition, fiducial marker or labels may be disposed on the non-fixed surgical camera 640 as well as an operating table 650 supporting the patient 660. For example, a first fiducial marker 670 may be affixed to the non-fixed surgical camera 640 and a second fiducial marker 671 may be disposed on a flat (or relatively flat) surface near the patient 660, and preferably near the anatomy (surgical region) that is the subject of an ongoing surgical procedure. In some other examples, the second fiducial marker 671 need not be placed on the operating table 650, but may be placed on any feasible surface, preferably near the patient 660. In still other examples, the second fiducial marker 671 need not be placed on a flat surface. The 3D surgical assistance system 610 can track the relative positions of the nonfixed surgical camera 640 with respect to the patient 660 through the fixed camera 630 viewing the first and second fiducial markers 670 and 671, respectively. The surgical assistance processor 610 can use external coordinates of the first and second fiducial markers 670 and 671 to determine positions of the non-fixed surgical camera 640 and an operating region, respectively. Furthermore, the surgical assistance processor 610 can determine a physical orientation of an anatomical structure (associated with a fiducial marker) with respect to any feasible surgical tool (associated with another, separate fiducial marker).

[0121] FIG. 7 is a flowchart showing an example method 700 for generating 3D surgical guidance. The method 700 is described with respect to the 3D surgical assistance system 600 of FIG. 6, however, the method 700 may be performed by any suitable system or device. [0122] The method 700 begins in block 702 where fiducial markers are affixed. For example, a first fiducial marker may be affixed to a surgical tool, such as but not limited to a surgical cameras, drill guides and the like. A second fiducial marker may be affixed to a reference surface associated with the patient. In some other examples, more than two fiducial markers may be used. In some cases, the additional fiducial markers may provide an increase in accuracy with respect to determining tool and anatomy positions.

[0123] The fiducial markers in block 702 may be the first and second fiducial markers 670 and 671 of FIG. 6. In practice, the fiducial markers can be any distinctive discernable marking. In some examples, the fiducial markers may be labels. In some other examples, the fiducial markers may be any feasible machine-readable markers such as aruco markers. In general, the size and shape characteristics of the fiducial markers may be predetermined and provided to the surgical assistance processor 610.

[0124] Next, in block 704 one or more cameras view the fiducial markers. In some examples, the fixed camera 630 may view the fiducial markers, although any camera with a field of view of sufficient to capture the fiducial markers affixed in block 702 can be used. For example, in block 704, the fixed camera 630 can view both the fiducial marker of the surgical instrument and the fiducial marker associated with the patient. The location and/or position of the fixed camera 630 (with respect to local surroundings, operating room, and the like) may be provided to the surgical assistance processor 610.

[0125] In some implementations, the fixed camera 630 may view a first fiducial marker and a second camera (such as the non-fixed surgical camera 640) can view a second fiducial marker. For example, the fixed camera 630 can view the first fiducial marker 670 affixed to the non-fixed surgical camera 640 and the non-fixed surgical camera 640 can view the second fiducial marker 671. Viewing the second fiducial marker 671 with the non-fixed surgical camera 640 may advantageously provide the surgical assistance processor 610 position and orientation information associated with the non-fixed surgical camera 640.

[0126] Next, in block 706, the surgical assistance processor 610 associates a first fiducial marker with a surgical instrument. The first fiducial marker may be one of the fiducial markers affixed in block 702. The surgical instrument can be any feasible surgical instrument including a surgical camera, or any feasible surgical tool. In some cases, the surgical assistance processor 610 may also ascertain and/or establish a location that is associated with the first fiducial marker.

[0127] Next, in block 708 the surgical assistance processor 610 associates a second fiducial marker with a surgical region. As described above, the second fiducial marker may be affixed to a reference surface. The reference surface can be any surface that is associated with a surgical region. In some cases, the reference surface can be near, adjacent, or next to the surgical region. A surgical region can be any region of the patient that may be involved or associated with any feasible surgical procedure. The surgical assistance processor 610 may also ascertain and/or establish a location associated with the second fiducial marker.

[0128] Next, in block 710 the surgical assistance processor 610 determines the location of the surgical instrument with respect to the surgical region. In some examples, the surgical assistance processor 610 can determine the location of the surgical instrument and the location of the surgical region. Then, the surgical assistance processor 610 can determine if and how the surgical instrument interacts with the surgical region. In some examples, the surgical assistance processor 610 determines the location of surgical instruments and surgical regions by determining the location of the associated fiducial markers.

[0129] In some examples, the surgical assistance processor 610 can determine the location of the surgical instrument with the fixed camera. For example, the fixed camera 630 can view the first fiducial marker 670. The fiducial markers, in general, can be a predetermined shape and size, such that when the fiducial marker is viewed through a camera, the distance to, and alignment of the fiducial marker can be determined. In some cases, the surgical assistance processor 610 can determine a “best fit” between a computed geometry (e.g., a reconstructed 3D surgical view) and the observed geometry of the first fiducial marker.

[0130] The surgical assistance processor 610 may be programmed to understand dimensions and layout of the operating room. For example, details regarding the length and position of walls, operating table, and the like may be determined and provided to the surgical assistance processor 610. In addition, the position of the fixed camera 630 with respect to the operating room, as well as camera characteristics such as resolution, zoom capability, and the like can be provided to the surgical assistance processor 610. Thus, using the operating room data and the characteristics of the fixed camera 630, the surgical assistance processor 610 can determine the location of the surgical instrument based at least in part on the view of the first fiducial marker 670 through the fixed camera 630.

[0131] The surgical assistance processor 610 can also determine the location of the surgical region. In some examples, the fixed camera 630 can view the second fiducial marker 671. Then, using the predetermined dimensions and layout of the operating room, as well as information regarding the location of the second fiducial marker 671 with respect to the surgical region, the surgical assistance processor 610 can determine the location of the surgical region. In some cases, the surgical assistance processor 610 can determine a “best fit” between a computed geometry (e.g., a reconstructed 3D surgical view of the patient’s anatomy) and the observed geometry of the second fiducial marker 671.

[0132] In some examples, the location of the surgical instrument may be determined based on the second fiducial marker 671. For example, when the surgical instrument includes a non-fixed camera, the non-fixed camera may be moved such that the second fiducial marker 671 can be viewed by the non-fixed camera. Since fiducial markers can have a predetermined shape and size, therefore, the view of the second fiducial marker through the non-fixed camera may provide information regarding the position/location of the surgical instrument. [0133] Because the surgical assistance processor 610 has determined the location of the surgical instrument and the location of the surgical region, the surgical assistance processor 610 can then determine the location of the surgical instrument with respect to the surgical region. In some examples, the surgical assistance processor 610 can determine relative positions and interactions of the surgical instrument with respect to the surgical region. [0134] Next, in block 712, the surgical assistance processor 610 can display surgical guidance. In some cases, the surgical guidance can include an associated reconstructed 3D surgical view as described above with respect to FIG. 4, as well as relative positions of a surgical region and surgical instruments as described in blocks 702, 704, 706, 708, and 710. The surgical guidance may be displayed on the display 611 which may be visible to the surgeon or any other medical technician in the operating room. In some other examples, the surgical guidance may be transmitted through a network for viewing at locations separated from the operating room. In some cases, the surgical guidance can include 3D models of any feasible surgical instruments, anatomies, pathologies, implants, anchors, anchor points, and the like.

[0135] In some cases, more than two fiducial markers may be used. Additional fiducial markers may increase a resolution or sensitivity of the surgical assistance processor 610 in determining the location of the surgical instrument or the location of the surgical region. For example, as the surgeon manipulates the surgical instrument, the view of the first fiducial marker by the fixed camera 630 may be impaired or occluded. Thus, some additional fiducial markers placed on different areas of the surgical instrument may help ensure a more continuous visibility independent of surgical instrument movement. Continuous visibility may increase resolution and/or sensitivity of the surgical assistance processor 610.

[0136] In a similar manner, multiple fixed cameras may be used to increase resolution or sensitivity. For example, if the surgeon or other medical technician impairs, blocks, or occludes the view of the fixed camera 630 (more particularly, limiting the sighting of one or more fiducial markers), then the surgical assistance processor 610 may have difficulty in resolving the positions of the surgical instrument and the surgical region. In those cases, the accuracy of the resolved positions may be reduced when vision of a single fixed camera is reduced or impeded.

[0137] In some examples, the reconstructed 3D surgical view displayed in the context of the method 700 may be based, at least in part, on pre-operative MRI or CT data. In this manner, the surgical guidance displayed in block 712 can show anatomical structures even when the structures are only partially visible with an inserted surgical camera (surgical tool). [0138] As described above, the example shown in FIG. 6 illustrates external tool fixation. The methods and apparatuses described herein may be used with a printed codes and/or pattern applied to the tool (e.g., printed sticker) that may be affixed to the tool, this code/pattem may be used as an identifying marker outside of the body, prior to insertion, to orient and/or normalize the position of the tool in space, relative to the body. For example, external tool fixation may provide an indicator of the relative distance and dimensions of the tool outside of the body, which may be used by the system to scale in relation to the tool inside of the body. For example, the printed sticker with the pattern may be imaged outside of the body by the same imaging platform. The system or method may identify the printed sticker pattern on the tool prior to insertion and the shape and/or orientation can be used to determine a plane, magnification, etc. of the imaging system, which may be used to orient the tool in the surgical space. When the tools and camera are inserted into the body, the orientation and scaling relative may be used as a reference. For example, inserting the tool and scope into the body (e.g., joint) after having the camera ‘see’ the tool (e.g., sticker) on the table outside of the body may provide a differential signal, e.g., an outside image, along with a relative position of the scope/camera with respect to the plane of the sticker (e.g., image). Within the body the captured images may be mapped/scaled to the original coordinates identified outside of the body.

[0139] In some cases, there may be two sets of fiducial markers (reference images, e.g., stickers) seen by the scope, e.g. on the tool and on a surface outside of the body (e.g., table, bed, patient, etc.). The camera may visualize these reference images, computing a relative position of the camera and the reference surface (e.g., table). Thus, the system may determine an alignment of the camera relative to the external reference, and this external reference may be used where the camera is inserted into the body, providing position/orientation in real space. This technique may use an external camera and an internal camera (scope) that is inserted into the body. When inserting the scope into the body, the scope may initially see/image the external sticker (e.g., the table, bed, etc.) and may give an instant computation of the relative orientation.

[0140] The markers (e.g., on the sticker) may be printed in a unique shape and size so that the external camera can immediately compute a distance and alignment from the marker(s). The shape may be continuously sensed and processed.

[0141] FIG. 8 shows a block diagram of a device 800 that may be one example of the intraoperative surgical assistance system 100 of FIG. 1 and/or the 3D surgical assistance system 600 of FIG. 6. The device 800 may include a communication interface 820, a processor 830, and a memory 840.

[0142] The communication interface 820, which may be coupled to a network (such as network 812) and to the processor 830, may transmit signals to and receive signals from other wired or wireless devices, including remote (e.g., cloud-based) storage devices, cameras, processors, compute nodes, processing nodes, computers, mobile devices (e.g., cellular phones, tablet computers and the like) and/or displays. For example, the communication interface 820 may include wired (e.g., serial, ethernet, or the like) and/or wireless (Bluetooth, Wi-Fi, cellular, or the like) transceivers that may communicate with any other feasible device through any feasible network. In some examples, the communication interface 820 may receive video stream data from a camera 810. In some implementations, the camera 810 can be a surgical camera, and the received video stream can be real-time surgical data. In some other examples, the communication interface 820 can receive a patient’s radiological data including MRI and CT data. In still other examples, the communications interface 820 can transmit display data from the device 800 to a display 811.

[0143] The processor 830, which is also coupled to the communications interface 820, and the memory 840, may be any one or more suitable processors capable of executing scripts or instructions of one or more software programs stored in the device 800 (such as within memory 840).

[0144] The memory 840 may include a data store 841 that may be used to locally store patient radiological data. For example, the data store 841 may include a patient’s MRI and/or CT scan data. In some examples, the data store 841 may include reconstructed 3D data. For example, the data store 841 may include 3D models of anatomy, any feasible surgical tool, and possible patient pathologies. In other examples, the data store 841 may store generated scene descriptors and activity summaries.

[0145] The memory 840 may also include a non-transitory computer-readable storage medium (e.g., one or more nonvolatile memory elements, such as EPROM, EEPROM, Flash memory, a hard drive, etc.) that may store the following software modules:

• an anatomy, tool, and pathology module 842 to recognize anatomies, implants, surgical tools and/or pathologies with a video stream or video data;

• a 3D scene construction module 843 to generate one or more 3D models and/or scenes;

• language and vision model module 844 to generate human-readable phrases that describe any feasible video image, scene, and the like;

• a scene summary LLM module 845 to generate scene summaries;

• a location determination module 846 to determine the location of surgical tools, surgical regions, and the like; and

• communication control module 847 to control communications through the communications interface 820. [0146] Each software module includes program instructions that, when executed by the processor 830, may cause the device 800 to perform the corresponding function(s). Thus, the non-transitory computer-readable storage medium of memory 840 may include instructions for performing all or a portion of the operations described herein.

[0147] The processor 830 may execute the anatomy, tool, and pathology module 842 to determine and/or recognize any feasible anatomies, implants, surgical tools, and/or pathologies that may be included in a video stream. In some examples, execution of the anatomy, tool, and pathology module 842 may execute one or more neural networks that have been trained to recognize and/or identify various anatomies, implants, surgical tools, and/or pathologies. For example, execution of the anatomy, tool, and pathology module 842 may cause the execution of one or more of the operations or neural networks described with respect to FIG. 1 to identify anatomies, implants, surgical tools, and/or pathologies.

[0148] In some cases, a label may be placed next to the recognized anatomy, tool, or pathology in the source video stream. In some other examples, a bounding box may surround or highlight the recognized anatomy, tool, or pathology. In still other examples, the recognized anatomy, tool, and pathology data may be provided to the metadata stream. Execution of the anatomy, tool, and pathology module 842 may generate an annotated view, such as the annotated view 115 of FIG. 1.

[0149] The processor 830 may execute the 3D scene construction module 843 to generate a 3D model of one or more objects that have been detected or recognized in a video stream. For example, execution of the 3D scene construction module 843 may generate a 3D model of a recognized anatomy , tool, or pathology. In some cases, the 3D scene construction module 843 may receive data indicating a recognized anatomy, tool, or pathology directly or indirectly from the anatomy, tool, and pathology module 842. Furthermore, the 3D scene construction module 843 can receive a patient’s MRI and/or CT data with which to construct associated 3D models.

[0150] Execution of the 3D scene construction module 843 may generate a 3D image for each recognized anatomy , tool, or pathology. In some aspects, the generated 3D images may be composited together to compose a rendered (computer-based) scene including representative objects included in the real-time video stream. The generated 3D images may be displayed in the 3D view 145.

[0151] The processor 830 can execute the language and vision model module 844 to generate descriptive, human-readable phrases that describe objects and activities that have been recognized within a video stream, including any feasible real-time video stream. In some examples, execution of the language and vision model module 844 may cause the execution of one or more neural networks trained to output a scene description in human- readable text (scene descriptors) based on an input of one or more frames of video as described with respect to FIG. 5. In some implementations, execution of the language and vision model module 844 may generate the human-readable scene descriptors 155. The human-readable scene descriptors 155 may be output to the metadata stream or stored in a database.

[0152] The processor 830 may execute the scene summary LLM module 845 to generate a human-readable activity summary based at least in part on the human-readable scene descriptors generated by the language and vision model module 844. In some examples, execution of the scene summary LLM module 845 may cause the execution of one or more large language model neural networks to operate on scene descriptors from multiple scenes (video frames) to determine or generate one or more activity summaries. The neural networks may be trained with the scene descriptors from the language and vision model module 844. In some examples, execution of the scene summary LLM module 845 can operate on a “rolling window” of scene descriptors to generate an activity summary.

[0153] The processor 830 may execute the location determination module 846 to determine the location of one or more objects visible by a camera, such as the camera 810. Execution of the location determination module 846 may process video data from one or more cameras (including camera 810 and/or a surgical camera, not shown) to locate one or more fiducial markers. In some examples, the location determination module 846 can determine the location of objects (surgical cameras, and the like) and regions (surgical regions and the like) based on view the one or more fiducial markers. Execution of the location determination module 846 may cause the processor 830 to perform one or more operations described with respect to FIGS. 6 and 7.

[0154] The processor 830 may execute the communication control module 847 to communicate with any other feasible devices. For example, execution of the communication control module 847 may enable the device 800 to communicate through the communications interface 820 via networks 812 including cellular networks conforming to any of the LTE standards promulgated by the 3^rd Generation Partnership Project (3GPP) working group, WiFi networks conforming to any of the IEEE 802.11 standards, Bluetooth protocols set forth by the Bluetooth Special Interest Group (SIG), Ethernet protocols, or the like. In some embodiments, execution of the communication control module 847 may enable the device 800 to communicate with one or more cameras 810, displays 811, or input devices such as keyboards, mice, touch pads, touch screens, or the like (not shown). In some other embodiments, execution of the communication control module 847 may implement encryption and/or decryption procedures.

Examples

[0155] In general, the methods and apparatuses described herein may use three or more video streams. In some examples two or more of them may be linked. In some examples one or more of them have a lower frame rate. In particular, the video stream used for natural language interpretation may have a lower famer rate, as used by the language/vision module. These methods and apparatuses may also include patient imaging date, e.g., MR.I/CT data. [0156] For example, the natural language module(s) may use 1-2 frames to describing the scene using human-understandable language (natural language). This language may be expressed by the module(s) in human-readable text that may be output by the one or more (e.g., language) modules. In addition, these methods and apparatuses may use scene descriptions and then scene summarizations, including determining what is occurring in the scene and describing, in natural language, the scene in order to develop an understanding of what the procedure is being performed, e.g., based on an understanding of the scene. These methods and apparatuses may include or access a database.

[0157] In general, the different data streams, e.g., shown in FIG. 1 schematically, do not need to be and may not be merged. For example, the text annotations may not be merged. In some examples the methods and apparatuses described herein may keep the video streams separate and distinct. For example, in some cases multiple streams (e.g., four distinct streams) may be displayed; e.g., the display may be presented as a picture-in-picture and/or may be toggled.

[0158] In some examples, the use of scene description by natural language may be use to describe the actions taken by the physician (e.g., surgeon) during a procedure; the scene description may be performed in real time and/or after the procedure and may be shown during the procedure, particularly to parties other than the surgeon, such as technician, nurses, etc. and/or may be entered into the patient or hospital records. In some cases, these method and apparatuses may use a scene activity database. Thus, the scene description using natural language describe herein may be used to provide an assistant (nurse, technician, etc.) context and information about what may be coming next during the procedure, e.g., tools needed, etc. in order to improve the flow of the procedure and assist a surgeon. Thus, the use of scene summarization may provide natural language interpretation of the ongoing procedure.

[0159] Any of these methods and apparatuses may include a database of videos (e.g., instruction videos, procedure videos, tool demos, etc.). These methods and apparatuses may be used or useful to suggest one or more tools and/or to provide a video of how a tool may be deployed, including in the context of the procedure being performed or to be performed.

Thus, in general the scene descriptions described herein may infer activities using natural language based on the scene. In general, a scene is distinct from a frame. Scene description is not necessarily a comparison between different frames, but is an interpretation, typically in natural language of the action occurring in the context of the images.

[0160] The use of a scene description, e.g., using video-to-natural language, as described herein, may be part of any system, or may be implemented independently of the other data streams described herein, e.g., as an independent tool, apparatus or method. The methods and apparatuses described herein may be a natural language scene description including description and/or inference of a range that a tool is moving. The scene description may provide a history and context of the content of a video stream using natural language.

[0161] In general, the language/vision (and scene summarization) modules may determine the context of the scenes and express them in natural language. This allows the scene summarization to provide a relatively high level of abstraction and summarization describing not just the immediate characteristics of the scene (e.g. location, presence of particular tools, identification of landmarks, etc.) but commenting on the context of these characteristics, including indicating the procedure type, possible complications or expectations, risks, etc. described in words.

[0162] As mentioned, the use of the language and scene summarization modules (e.g., the LLM) may be performed (and/or in some examples, output) in parallel with the other video data streams, including the 3D stream, the unmodified/original stream, the Al stream, etc. As mentioned, the LLM data may be provided to, and/or may use information from, a metadata stratum, as shown in FIG. 1. Alternatively, the LLM may be separated from the other streams. The frame rate for the LLM may be different, and in particular, may be slower than the other streams, including slower than video rate.

[0163] In general, the various streams, including but not limited to the LLM stream, may be unblended, so that they are not overlaid onto each other. In some cases, the primary view may remain unblended.

[0164] The LLM may provide a language-based (rather than pixel-based) segmentation of the video images/scenes. For example, the LLM may identify components of the scene in natural language, and describes these ‘segment’ descriptions in the summary. This may provide a more robust tracking, e.g., across images (scenes). For example, the natural language segmentation (in sentences) may describe a region of the tissue (e.g., lesion) from when it appears, it’s relative location (on a wall of the lumen, etc.), a diagnostic characteristic (discolored, inflamed, swollen, bleeding, etc.), changes during the scene(s), etc. Similarly, the use of natural language (sentences) for the description of tools (identifying the type of tool, the portion of the tool visible, etc.) may provide significantly more context and information than simple visual segmentation.

[0165] The LLM modules may be trained machine learning (e.g., deep learning) agents. These machine learning agents may be trained using a language-based, rather than tag-based approach. The approach may be iterative. For example, an initial approach may provide frame recognition, e.g., describe a scene initially identifying structures and the relative appearance of the structures (e.g., a first natural language module may describe the structures as being in the top left, describe the appearance of a tool, the rem oval/di sappearance of the tool, etc.); based on this language, the language module(s) may infer the action being performed (based on the language-inferred actions describing what is going on). In some cases, the module(s), such as the scene summarization modules, may be trained purely on the model langue. Images may be used to get the frame description. This description may then be integrated into a natural language narrative, e.g., summarizing the action(s) being performed. [0166] Although the use of the LLM stream is described herein in the medical/surgical context, it should be understood that these techniques may be applied more broadly, e.g., for narration of scenes not limited to the medical context.

[0167] It should be appreciated that all combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein and may be used to achieve the benefits described herein.

[0168] The process parameters and sequence of steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed. The various example methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or include additional steps in addition to those disclosed.

[0169] Any of the methods (including user interfaces) described herein may be implemented as software, hardware or firmware, and may be described as a non-transitory computer-readable storage medium storing a set of instructions capable of being executed by a processor (e.g., computer, tablet, smartphone, etc.), that when executed by the processor causes the processor to control perform any of the steps, including but not limited to: displaying, communicating with the user, analyzing, modifying parameters (including timing, frequency, intensity, etc.), determining, alerting, or the like. For example, any of the methods described herein may be performed, at least in part, by an apparatus including one or more processors having a memory storing a non-transitory computer-readable storage medium storing a set of instructions for the processes(s) of the method.

[0170] While various embodiments have been described and/or illustrated herein in the context of fully functional computing systems, one or more of these example embodiments may be distributed as a program product in a variety of forms, regardless of the particular type of computer-readable media used to actually carry out the distribution. The embodiments disclosed herein may also be implemented using software modules that perform certain tasks. These software modules may include script, batch, or other executable files that may be stored on a computer-readable storage medium or in a computing system. In some embodiments, these software modules may configure a computing system to perform one or more of the example embodiments disclosed herein.

[0171] As described herein, the computing devices and systems described and/or illustrated herein broadly represent any type or form of computing device or system capable of executing computer-readable instructions, such as those contained within the modules described herein. In their most basic configuration, these computing device(s) may each comprise at least one memory device and at least one physical processor.

[0172] The term “memory” or “memory device,” as used herein, generally represents any type or form of volatile or non-volatile storage device or medium capable of storing data and/or computer-readable instructions. In one example, a memory device may store, load, and/or maintain one or more of the modules described herein. Examples of memory devices comprise, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Hard Disk Drives (HDDs), Solid-State Drives (SSDs), optical disk drives, caches, variations or combinations of one or more of the same, or any other suitable storage memory.

[0173] In addition, the term “processor” or “physical processor,” as used herein, generally refers to any type or form of hardware-implemented processing unit capable of interpreting and/or executing computer-readable instructions. In one example, a physical processor may access and/or modify one or more modules stored in the above-described memory device. Examples of physical processors comprise, without limitation, microprocessors, microcontrollers, Central Processing Units (CPUs), Field-Programmable Gate Arrays (FPGAs) that implement softcore processors, Application-Specific Integrated Circuits (ASICs), portions of one or more of the same, variations or combinations of one or more of the same, or any other suitable physical processor. [0174] Although illustrated as separate elements, the method steps described and/or illustrated herein may represent portions of a single application. In addition, in some embodiments one or more of these steps may represent or correspond to one or more software applications or programs that, when executed by a computing device, may cause the computing device to perform one or more tasks, such as the method step.

[0175] In addition, one or more of the devices described herein may transform data, physical devices, and/or representations of physical devices from one form to another. Additionally or alternatively, one or more of the modules recited herein may transform a processor, volatile memory, non-volatile memory, and/or any other portion of a physical computing device from one form of computing device to another form of computing device by executing on the computing device, storing data on the computing device, and/or otherwise interacting with the computing device.

[0176] The term “computer-readable medium,” as used herein, generally refers to any form of device, carrier, or medium capable of storing or carrying computer-readable instructions. Examples of computer-readable media comprise, without limitation, transmission-type media, such as carrier waves, and non-transitory-type media, such as magnetic-storage media (e.g., hard disk drives, tape drives, and floppy disks), optical-storage media (e.g., Compact Disks (CDs), Digital Video Disks (DVDs), and BLU-RAY disks), electronic-storage media (e.g., solid-state drives and flash media), and other distribution systems.

[0177] A person of ordinary skill in the art will recognize that any process or method disclosed herein can be modified in many ways. The process parameters and sequence of the steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed.

[0178] The various exemplary methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or comprise additional steps in addition to those disclosed. Further, a step of any method as disclosed herein can be combined with any one or more steps of any other method as disclosed herein.

[0179] The processor as described herein can be configured to perform one or more steps of any method disclosed herein. Alternatively or in combination, the processor can be configured to combine one or more steps of one or more methods as disclosed herein.

[0180] When a feature or element is herein referred to as being "on" another feature or element, it can be directly on the other feature or element, or intervening features and/or elements may also be present. In contrast, when a feature or element is referred to as being "directly on" another feature or element, there are no intervening features or elements present. It will also be understood that, when a feature or element is referred to as being "connected", "attached" or "coupled" to another feature or element, it can be directly connected, attached or coupled to the other feature or element or intervening features or elements may be present. In contrast, when a feature or element is referred to as being "directly connected", "directly attached" or "directly coupled" to another feature or element, there are no intervening features or elements present. Although described or shown with respect to one embodiment, the features and elements so described or shown can apply to other embodiments. It will also be appreciated by those of skill in the art that references to a structure or feature that is disposed "adjacent" another feature may have portions that overlap or underlie the adjacent feature. [0181] Terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. For example, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items and may be abbreviated as "/".

[0182] Spatially relative terms, such as "under", "below", "lower", "over", "upper" and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as "under" or "beneath" other elements or features would then be oriented "over" the other elements or features. Thus, the exemplary term "under" can encompass both an orientation of over and under. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly. Similarly, the terms "upwardly", "downwardly", "vertical", "horizontal" and the like are used herein for the purpose of explanation only unless specifically indicated otherwise.

[0183] Although the terms “first” and “second” may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element. Thus, a first feature/element discussed below could be termed a second feature/element, and similarly, a second feature/element discussed below could be termed a first feature/element without departing from the teachings of the present invention.

[0184] Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising” means various components can be co-jointly employed in the methods and articles (e.g., compositions and apparatuses including device and methods). For example, the term “comprising” will be understood to imply the inclusion of any stated elements or steps but not the exclusion of any other elements or steps.

[0185] In general, any of the apparatuses and methods described herein should be understood to be inclusive, but all or a sub-set of the components and/or steps may alternatively be exclusive, and may be expressed as “consisting of’ or alternatively “consisting essentially of’ the various components, steps, sub-components or sub-steps. [0186] As used herein in the specification and claims, including as used in the examples and unless otherwise expressly specified, all numbers may be read as if prefaced by the word "about" or “approximately,” even if the term does not expressly appear. The phrase “about” or “approximately” may be used when describing magnitude and/or position to indicate that the value and/or position described is within a reasonable expected range of values and/or positions. For example, a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc. Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value " 10" is disclosed, then "about 10" is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein. It is also understood that when a value is disclosed that "less than or equal to" the value, "greater than or equal to the value" and possible ranges between values are also disclosed, as appropriately understood by the skilled artisan. For example, if the value "X" is disclosed the "less than or equal to X" as well as "greater than or equal to X" (e.g., where X is a numerical value) is also disclosed. It is also understood that the throughout the application, data is provided in a number of different formats, and that this data, represents endpoints and starting points, and ranges for any combination of the data points. For example, if a particular data point “10” and a particular data point “15” are disclosed, it is understood that greater than, greater than or equal to, less than, less than or equal to, and equal to 10 and 15 are considered disclosed as well as between 10 and 15. It is also understood that each unit between two particular units are also disclosed. For example, if 10 and 15 are disclosed, then 11, 12, 13, and 14 are also disclosed.

[0187] Although various illustrative embodiments are described above, any of a number of changes may be made to various embodiments without departing from the scope of the invention as described by the claims. For example, the order in which various described method steps are performed may often be changed in alternative embodiments, and in other alternative embodiments one or more method steps may be skipped altogether. Optional features of various device and system embodiments may be included in some embodiments and not in others. Therefore, the foregoing description is provided primarily for exemplary purposes and should not be interpreted to limit the scope of the invention as it is set forth in the claims.

[0188] The examples and illustrations included herein show, by way of illustration and not of limitation, specific embodiments in which the subject matter may be practiced. As mentioned, other embodiments may be utilized and derived there from, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. Such embodiments of the inventive subject matter may be referred to herein individually or collectively by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept, if more than one is, in fact, disclosed. Thus, although specific embodiments have been illustrated and described herein, any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.

Claims

CLAIMS What is claimed is:

1. A method for generating a three-dimensional surgical view comprising: receiving or obtaining a patient’s radiological data; segmenting one or more anatomical structures from the patient’s radiological data; generating one or more three-dimensional (3D) models based on the one or more segmented anatomical structures; receiving a live surgical video stream; identifying an anatomical structure within the live surgical video stream; and generating a reconstructed 3D model of the identified anatomical structure based at least in part on the one or more 3D models based on the one or more segmented anatomical structures.

2. The method of claim 1, further comprising: identifying a surgical tool in the live surgical video stream; and superimposing a digital representation of the surgical tool onto the reconstructed 3D model.

3. The method of claim 2, wherein identifying the surgical tool includes executing a neural network trained to recognize the surgical tool within a video stream.

4. The method of claim 1, further comprising: identifying an implant in the live surgical video stream; and superimposing digital representation of the implant onto the reconstructed 3D model.

5. The method of claim 4, wherein identifying the implant includes executing a neural network trained to recognize the implant within a video stream.

6. The method of claim 1, further comprising: identifying an anatomical structure that includes a pathology; and highlighting the anatomical structure.

7. The method of claim 1, further comprising animating the 3D model to move in a realistic manner.

8. The method of claim 1, further comprising displaying the 3D model to replace the live surgical video stream.

9. The method of claim 8, wherein the displayed 3D model is a frame-by-frame replacement of the live surgical video stream.

10. The method of claim 1, wherein the patient’s radiological data includes magnetic resonance imagery (MRI) data, computed tomography (CT) data, or a combination thereof.

11. The method of claim 1, wherein identifying the anatomical structure includes executing a neural network trained to recognize anatomical structures included within a video stream.

12. The method of claim 1, wherein identifying the anatomical structure within the live surgical video stream comprises estimating a camera angle based on the identified anatomical structure.

13. The method of claim 12, wherein generating the reconstructed 3D model of the identified anatomical structure is based at least in part on the estimated camera angle.

14. A device comprising: one or more processors; and a memory configured to store instructions that, when executed by the one or more processors cause the device to: receive or obtain a patient’s radiological data; segment one or more anatomical structures from the patient’s radiological data; generate one or more three-dimensional (3D) models based on the segmented anatomical structures; receive a live surgical video stream; identify an anatomical structure within the live surgical video stream; and generate a reconstructed 3D model of the identified anatomical structure based at least in part on the one or more 3D models based on the one or more segmented anatomical structures.

15. A non-transitory computer-readable storage medium comprising instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising: receiving or obtaining a patient’s radiological data; segmenting one or more anatomical structures from the patient’s radiological data; generating one or more three-dimensional (3D) models based on the segmented anatomical structures; receiving a live surgical video stream; identifying an anatomical structure within the live surgical video stream; and generating a reconstructed 3D model of the identified anatomical structure based at least in part on the one or more 3D models based on the one or more segmented anatomical structures.

16. A method for generating an automatically annotated surgical video, the method comprising: receiving or obtaining a surgical video stream; identifying a region of interest within the surgical video stream; and generating an annotated view based on the identified region of interest.

17. The method of claim 16, wherein generating the annotated view includes rendering a bounding box around the region of interest.

18. The method of claim 16, wherein identifying the region of interest includes executing a neural network trained to recognize an anatomy within a video stream.

19. The method of claim 16, wherein identifying the region of interest includes executing a neural network trained to recognize a pathology within a video stream.

20. The method of claim 16, wherein identifying the region of interest includes executing a neural network trained to recognize a surgical tool within a video stream.

21. The method of claim 16, wherein generating the annotated view includes rendering labels near the region of interest.

22. The method of claim 16, wherein identifying the region of interest includes: identifying an anatomical region; and identifying specific anatomical structures within the anatomical region, by executing a neural network trained to detect specific anatomical features within the anatomical region.

23. A device comprising: one or more processors; and a memory configured to store instructions that, when executed by the one or more processors cause the device to: receive or obtain a surgical video stream; identify a region of interest within the surgical video stream; and generate an annotated view based on the identified region of interest.

24. A non-transitory computer-readable storage medium comprising instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising: receiving or obtaining a surgical video stream; identifying a region of interest within the surgical video stream; and generating an annotated view based on the identified region of interest.

25. A method for providing surgical guidance comprising: associating a surgical instrument with a first fiducial marker; associating a surgical region with a second fiducial marker; determining a location of the surgical instrument with respect to the surgical region; and displaying surgical guidance based on the location of the surgical instrument and the location of the surgical region.

26. The method of claim 25, wherein determining the location of the surgical instrument with respect the surgical region includes determining a location of the first fiducial marker and determining a location of the second fiducial marker.

27. The method of claim 25, wherein determining the location of the surgical instrument with respect to the surgical region includes viewing the first fiducial marker with a fixed camera.

28. The method of claim 25, wherein associating a surgical instrument with the first fiducial marker includes affixing the first fiducial marker to the surgical instrument.

29. The method of claim 25, wherein determining the location of the surgical instrument with respect to the surgical region includes viewing the second fiducial marker with a camera mechanically coupled to the surgical instrument.

30. The method of claim 25, wherein associating a surgical region with a second fiducial marker includes affixing the second fiducial marker onto a reference surface adjacent to a patient.

31. The method of claim 25, wherein determining the location of the surgical instrument with respect to the surgical region includes viewing the second fiducial marker with a fixed camera.

32. The method of claim 25, wherein displaying surgical guidance includes displaying a three-dimensional (3D) model of a patient’s anatomy.

33. The method of claim 25, wherein displaying surgical guidance includes displaying a 3D model of the surgical instrument.

34. The method of claim 25, wherein displaying the surgical guidance includes displaying an implant associated with a surgery.

35. The method of claim 25, wherein determining the location of the surgical instrument with respect to the surgical region includes viewing the first fiducial marker and the second fiducial marker with a plurality of cameras.

36. The method of claim 25, wherein displaying surgical guidance includes displaying a planned placement of implants and anchor points.

37. The method of claim 25, wherein displaying surgical guidance includes separately displaying a video stream from a camera associated with the surgical instrument.

38. The method of claim 25, wherein the surgical instrument includes an orthoscopic camera.

39. A device comprising: one or more processors; and a memory configured to store instructions that, when executed by the one or more processors cause the device to: associate a surgical instrument with a first fiducial marker; associate a surgical region with a second fiducial marker; determine a location of the surgical instrument with respect to the surgical region; and display surgical guidance based on the location of the surgical instrument and the location of the surgical region.

40. Anon-transitory computer-readable storage medium comprising instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising: associating a surgical instrument with a first fiducial marker; associating a surgical region with a second fiducial marker; determining a location of the surgical instrument with respect to the surgical region; and displaying surgical guidance based on the location of the surgical instrument and the location of the surgical region.

41. A method, the method comprising: receiving a stream of video images of a surgical procedure at a rate that is less than video rate; determining, using a first trained machine learning agent, a scene description from the stream of video images, wherein the scene description comprises one or more natural language sentences; determining, using a second trained machine learning agent, a narrative description for the surgical procedure using a plurality of scene descriptions; and outputting the narrative description.

42. The method of claim 41, further comprising segmenting, prior to determining the scene description, the video images to identify one or more of: surgical tools and anatomical features.

43. The method of claim 41, wherein receiving the stream of video images comprises receiving real-time surgical video images.

44. The method of claim 41, wherein determining the scene descriptions comprise generating one or more natural language sentences per image of the video stream images.

45. The method of claim 41, further comprising generating a database of scene descriptions, indexed by time.

46. The method of claim 45, wherein determining the narrative description for the surgical procedure comprises accessing the data of scene descriptions by the second trained machine learning agent.

47. The method of claim 41, wherein determining the narrative description for the surgical procedure comprises one or more natural language paragraphs including an indication of the surgical procedure being performed, wherein the surgical procedure is determined by the second trained machine learning agent based on the scene descriptions.

48. The method of claim 41, wherein outputting comprises outputting in near-real time.

49. The method of claim 41, wherein outputting comprises outputting to a nurse or technician assisting in the surgical procedure.

50. The method of claim 41, wherein outputting comprises outputting as part of an information view.

51. A device comprising: one or more processors; and a memory configured to store instructions that, when executed by the one or more processors cause the device to: receiving a stream of video images of a surgical procedure at a rate that is less than video rate; determining, using a first trained machine learning agent, a scene description from the stream of video images, wherein the scene description comprises one or more natural language sentences; determining, using a second trained machine learning agent, a narrative description for the surgical procedure using a plurality of scene descriptions; and outputting the narrative description.

52. Anon-transitory computer-readable storage medium comprising instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising: receiving a stream of video images of a surgical procedure at a rate that is less than video rate; determining, using a first trained machine learning agent, a scene description from the stream of video images, wherein the scene description comprises one or more natural language sentences; determining, using a second trained machine learning agent, a narrative description for the surgical procedure using a plurality of scene descriptions; and outputting the narrative description.