WO2025137227A1

WO2025137227A1 - Dilator proximal end blocker in a surgical visualization

Info

Publication number: WO2025137227A1
Application number: PCT/US2024/060958
Authority: WO
Inventors: George Charles Polchin
Original assignee: Digital Surgery Systems Inc
Current assignee: Digital Surgery Systems Inc
Priority date: 2023-12-20
Filing date: 2024-12-19
Publication date: 2025-06-26
Anticipated expiration: 2026-06-20

Abstract

Computing systems and methods are disclosed for detecting and tracking a proximal end of a dilator from a surgical video stream, and mitigating the tool's visually distracting effects. In one example system, a memory may store instructions that, when executed by one or more processors, may cause the system to receive, in real-time, image data from an incoming surgical video stream of a field of view of the digital surgical microscope camera. The field of view may show a proximal end of a dilator. The system may generate, using the image data and a trained neural network, a bounding box around a region corresponding to the proximal end of the dilator. A pixel modification algorithm may be applied to the region of the image data corresponding to the proximal end of the dilator. An updated image data may be generated in real-time for an outgoing surgical video stream.

Description

DILATOR PROXIMAL END BLOCKER IN A SURGICAL VISUALIZATION

BACKGROUND

[0001] Surgery can be a visually taxing job, as it typically involves a surgeon to focus for long periods on an anatomy of a patient with minimal distractions. Surgical tools around a surgical site of a patient can be a common source of distraction that may impede or otherwise affect the surgeon’s performance during surgery. One such surgical tool, a dilator, is used to expand an opening or passage into a patient’s inner anatomy (e.g., through a skin or other exterior tissue). For example, the proximal end of a dilator such as those used in certain spine and cranial neurosurgery procedures for operating “down the tube” can be visually distracting to the surgeon.

[0002] There is thus a desire and need to assist in minimizing the distraction caused by the dilator. Various embodiments of the present disclosure address one or more shortcomings described above.

SUMMARY

[0003] The present disclosure provides new and innovative systems and methods for detecting and tracking a proximal end of a dilator from image data, and mitigating the tool’s visually distracting effects. In an example, a system is disclosed that uses deep learning (e.g., neural network) to detect and track a proximal end of a dilator used in surgical procedures. The system may include, e.g., a digital surgical microscope (DSM) camera; optionally additional cameras having a wider field of view than that of the DSM camera; one or more processors; and memory. The memory may store instructions that, when executed by the processors, may cause the system to receive, in real-time, via the DSM camera or the optional additional cameras, image data from an incoming surgical video stream of a field of view of the DSM camera. The field of view may show a proximal end of a dilator. The system may generate, using the image data and a trained neural network, a bounding box around a region of the image data corresponding to the proximal end of the dilator. Based on the location of the bounding box, the system may determine the region of the image data corresponding to the proximal end of the dilator. A blocking overlay or at least one pixel modification algorithm or technique may be applied to the region of the image data corresponding to the proximal end of the dilator. An updated image data may be generated in real-time or near real-time for an outgoing surgical video stream. Thus, the outgoing surgical video stream may show the blocking overlay applied to the region of the image data corresponding to the proximal end of the dilator. Tn some aspects, the system may further comprise a visualization system, which may display the outgoing surgical video stream in real-time. In some embodiments, the blocking overlay may be further applied to other non-relevant regions of the image data, such as those that represent regions outward from the proximal end of the dilator.

[0004] The instructions may also cause the system to track, in real-time, the proximal end of the dilator in the incoming surgical video stream; and cause, in real-time, the blocking overlay to follow the proximal end of the dilator in the outgoing surgical video stream. For example, the blocking overlay can follow the surgical tool by receiving new image data from the incoming surgical video stream of a new field of view of the DSM camera (which may also show the proximal end of the dilator); recognizing, based on a landmarking of the region of the field of view corresponding to the proximal end of the dilator, a new region of the new image data corresponding to the proximal end of the dilator; and causing the blocking overlay to shift to the new region of the new field of view corresponding to the proximal end of the dilator.

[0005] In an example, a method is disclosed for detecting a proximal end of a dilator from image data, and mitigating the tool’s visually distracting effects. The method may be performed by a processor associated with a computing device. The method may include: receiving, in realtime by a computing device having a processor, image data from a surgical video stream of a field of view of a DSM camera, wherein the field of view shows a view of a proximal end of a dilator and a patient anatomy; generating, using the image data and a trained neural network, a bounding box around a region of the image data corresponding to the view of the proximal end of the dilator; determining, based on the location of the bounding box, the region of the image data corresponding to the view of the proximal end of the dilator; applying a blocking overlay or a pixel modification algorithm or technique to the region of the image data corresponding to the view of the proximal end of the dilator; and generating, in real-time, an updated image data for an outgoing surgical video stream, wherein the outgoing surgical video stream shows the patient anatomy and the blocking overlay applied to the region of the image data corresponding to the view of the proximal end of the dilator.

[0006] In an example, a non-transitory computer-readable medium for use on a computer system is disclosed. The non-transitory computer-readable medium may contain computerexecutable programming instructions may cause processors to perform one or more steps or methods described herein. BRIEF DESCRIPTION OF THE FIGURES

[0007] FIG. 1 shows a transformation of an image of a video stream of the digital surgical microscope (DSM) camera where a proximal end of a dilator is detected and its visually distracting effect is mitigated, according to an example embodiment of the present disclosure.

[0008] FIG. 2 shows a flow diagram of a method of training a deep learning model for detecting a proximal end of a dilator from image data, according to an example embodiment of the present disclosure.

[0009] FIG. 3 shows a diagram of an example deep learning model used for detecting a proximal end of a dilator from image data, according to an example embodiment of the present disclosure.

[0010] FIG. 4 shows a flow diagram of a method of applying a deep learning model for detecting and tracking a proximal end of a dilator from image data and mitigating visually distracting effects of the proximal end of the dilator, according to an example embodiment of the present disclosure, according to example embodiments of the present disclosure.

[0011] FIG. 5 shows a diagram of a surgical environment for detecting a proximal end of a dilator from image data, according to an example embodiment of the present disclosure.

[0012] FIG. 6 shows a block diagram of an example system used for detecting and tracking a proximal end of a dilator from image data and mitigating visually distracting effects of the proximal end of the dilator, according to an example embodiment of the present disclosure, according to example embodiments of the present disclosure.

DETAILED DESCRIPTION

[0013] As previously discussed, the dilator can be visually distracting to a surgeon, and there is a desire and need to minimize the visually distracting effects. The present disclosure relates in general to a system and method for detecting and tracking a proximal end of a dilator from image data, and mitigating the tool’s visually distracting effects.

[0014] Various embodiments of the present disclosure may involve training a deep learning (DL) model (e.g., a neural network). The training may include labeling a large data set of reference image data (e.g., a plurality of frames from one or more surgical video streams) with a reference target (e.g., location of a proximal end of a dilator). The trained deep learning model (e.g., the trained neural network) may be stored for use in detecting a proximal end of a dilator from a live surgical video stream in real-time. During a surgery session, image data from an incoming surgical video stream may be processed for application of the trained deep learning model. The incoming surgical video may show a proximal end of a dilator (e.g., dilating a patient anatomy for surgery), which the trained deep learning model may identify via the image data. A computing device may generate a bounding box around the detected region of the proximal end of the dilator in the image data. Furthermore, a blocking overlay or selected pixel value modification algorithm(s) or technique(s) may be applied to the region (e.g., pixels of the image data) corresponding to the detected proximal end of the dilator to at least process the pixels that are detected to be part of the offensive or distracting proximal end region, and process each of these pixels respectively to be less distracting. An outgoing surgical video stream (e.g., as displayed to the user) may show a field of view of a digital surgical microscope (DSM) camera where the region corresponding to the proximal end of the dilator is blocked, thus mitigating the visually distracting effects of the proximal end of the dilator. In some aspects, other areas (e.g., regions of the image data that are outward from the circular proximal end of the dilator) may also or alternatively be blocked (e.g., via the blocking overlay). In some aspects, knowledge of various features of, or landmarks placed in, the region of the image data corresponding to the proximal end of the dilator may be used to track the proximal end of the dilator over time through a surgical video stream. The tracking may allow a faster and efficient way to detect, and subsequently mitigate the visually distracting effects of, the proximal end of the dilator in the surgical video stream.

[0015] In some embodiments, a machine learning model (e.g., a convolutional neural network or other deep learning model) may be trained to detect the inner walls of the dilator. Additionally, or alternatively, a machine learning model (e.g., a convolutional neural network or other deep learning model) may be trained to detect surgical site features, landmarks, or scenes present or typically present near the distal end of the dilator. For example, one or more large data sets of reference image data may include, among its reference targets, the inner walls of the dilator and/or surgical site features, landmarks, or scenes present or typically present near the distal site of the dilator. One or more of the above described trained machine learning models may then be applied to the live surgical video stream. For example, the output of one or more of the above described trained machine learning models may be used to identify and/or assist in identifying the relevant surgical site near the distal end of the dilator used in the live surgical video stream. [0016] FIG. 1 shows a transformation of an image of a video stream of the DSM camera where a proximal end of a dilator is detected and its visually distracting effect is mitigated, according to an example embodiment of the present disclosure. The transformation may occur by receiving image 102 from an incoming surgical video stream. The image 102 may be received (e g., as image data) by a computing device having one or more processors. As shown in in FIG. 1, image 102 may show a proximal end 104 of a dilator. As previously discussed, the dilator may be a surgical tool used to dilate a patient anatomy of interest 106, and the proximal end 104 of the dilator may be a circular region of the dilator surrounding the patient anatomy 106 of interest to a surgeon. The brightness, color, and/or contrast of the proximal end 104 of the dilator may be visually distracting to the surgeon.

[0017] One or more systems and methods presented herein may be used by a computing device to place or generate a bounding box 108 around the proximal end 104 of the dilator, as shown in image 110. For example, a deep learning model may be trained to identify, from a large dataset of reference image data, a bounding box 108 encapsulating a region of interest (e.g., the Cartesian bounds of the proximal end of the dilator). The trained deep learning model may be used to identify the Cartesian bounds of the proximal end of the dilator (as shown by the bounding box 108 in image 110) in real-time or near real-time in the incoming surgical video stream provided via the DSM. In some aspects, the general shape of the proximal end of the dilator within the bounding box 108 may be known, for example, as a simple circumscribed annulus bounded on all four Cartesian directions (e g., +X, -X, +Y, -Y) by the bounding box 108. The knowledge of the general shape of the proximal end of the dilator within a given bounding box may allow the computing device to insert a blocking overlay or apply a selected pixel modification algorithm/technique over the region of the image corresponding to the proximal end of the dilator. For example, as shown in image 112, the blocking overlay is shown as a solid black ring overlaying the region of the image corresponding to the proximal end of the dilator, thereby removing most of the visual distraction associated with the proximal end of the dilator.

[0018] Typically, image data from cameras or microscopes are in pixel units, and to translate these into real-world units like millimeters, a conversion process may be needed, often based on factors such as each camera’s magnification or resolution. In some implementations, the computing device of the present disclosure may use a control algorithm that directly relies on the onscreen positions of the proximal end of the dilator as feedback data to guide or control some aspects of the system. This may eliminate the need to perform a conversion between pixel units (as seen in the digital image) and real-world millimeter units. By avoiding a pixel-to-millimeter conversion, the process may be streamlined, reduce computational overhead, and simplify calibration requirements.

[0019] Further, the Cartesian coordinates of the proximal end of the dilator may be determined in an image space or a robot space. An image space may generally refer to the 2D or 3D space of the image captured by a surgical microscope or camera. Coordinates determined in this space may be based on the pixels of the image (e.g., x and j may be pixel indices, and z may represent intensity or depth in a 3D setup). The image space may be tied to the visual representation seen on the screen. On the other hand, a robot space may generally refer to the coordinate system of the robot or surgical device. Coordinates determined in this space may define the real-world physical positions of the tool or system in relation to the robot, typically in millimeters. The robot space may be tied to the actual mechanical operations and movement of the robot arms or devices. Different devices and systems (e.g., cameras, robots) operate in their own coordinate systems, so understanding the space may clarify the context of those coordinates.

[0020] In one preferred embodiment, the Cartesian coordinates of the proximal end of the dilator may be determined in an image space. A heuristic or calibration process may be used by the computing device to determine (a priori or real-time) a pixel s-to-millimeter conversion over the whole optical operating range of the system to provide the robot commands to move the camera such that the dilator proximal end may be moved onscreen to desired positions. In an additional embodiment, the computing device may be configured to convert stereoscopic onscreen locations fully into a robot space.

[0021] In some embodiments, the same or another machine learning model (e.g., a convolutional neural network or other deep learning model) may be trained to detect the inner walls of the dilator. For example, the large data set of reference image data may include, among its reference targets, the inner walls of the dilator.

[0022] In some embodiments, additional visual distraction may be removed by optionally blocking the remainder of the areas of less interest. For example, as shown in image 114, these areas may comprise areas that are outwards from the region of the image corresponding to the proximal end of the dilator. Also or alternatively, knowledge of the image’s region of interest (RO I) to the surgeon may allow an image processor to use only such an ROI and ignore other parts of the image such as the bright area corresponding to the proximal end 106 of the dilator as well as the region outwards of the dilator, thereby providing better image optimization for the region of interest (e.g., the region interior of the dilator (e.g., the patient anatomy of interest 106))

[0023] As a surgery session progresses, it is to be appreciated that the precise location of surgical tools, such as the dilator, may change in the field of view of the DSM camera captured through a surgical video stream. In some aspects, as the dilator moves along the field of view of the DSM camera, the trained deep learning model may be used to detect a new location for the proximal end of the dilator (e.g., via a bounding box). The blocking overlay or a selected pixel modification algorithm may be reapplied at the new location, e.g., to “follow” the path of the proximal end of the dilator. In some embodiments, the region of the image corresponding to the proximal end of the dilator, once detected via the trained deep learning model, may be landmarked or various features associated with the region may be remembered (e.g., stored) for future recognition. The computing device may thus use those landmarks or features to bypass subsequent application of the deep learning model when tracking the proximal end of the dilator over the course of the surgical video stream.

[0024] In some embodiments, the landmarks, features, or surgical scenes within the ring of the dilator may be detected via another or the same machine learning model (e.g., deep learning model). For example, in some embodiments a segmentation deep learning model (e.g., the SEGMENT ANYTHING MODEL) may be used to identify various regions of interest, such as, or including, the relevant surgical site near the distal end of the dilator.

[0025] The output of one or more of the above described machine learning models may be used to identify or assist in identifying the relevant surgical site near the distal end of the dilator.

[0026] In some embodiments, two or more of the above described machine learning models (e.g., convolutional neural network or other deep learning models) may be combined into one machine learning model. In embodiments where there are multiple deep learning models (e.g., singular or compounded from multiple models as described), inferencing may be run separately on each such singular or compound models. The results of such models may be combined using various heuristics such as combinatorial processing.

[0027] For example, while one machine learning model (a first machine learning model) may be trained (e.g., using reference datasets) and applied to a live surgical video stream to detect the bounding box and center of mass of a dilator proximal end (e.g., using the methods described herein), another machine learning model (second machine learning model) may be trained (e.g., using reference datasets) and applied to a live surgical video stream to detect the average width of the annulus of the dilator proximal end. A combinatorial process may combine the resulting information (e.g., the location and/or dimensions of the bounding box, the center of mass of the dilator proximal end, the average width of the annulus, etc.) the to generate (e.g., overlay on the live surgical video stream) an annulus of a thickness approximately matching that of the dilator proximal end centered at the center of mass of the dilator proximal end. The generated annulus may thus mimic the onscreen look of the dilator proximal end and may optionally be used to block said dilator proximal end from the user’s view.

[0028] Such strategic use and disuse of the trained deep learning model may deliver improvements to the computing device and surgical visualization system by requiring less processing power, and by providing faster and more efficient detection and tracking of the proximal end of the dilator. Also or alternatively, other forms of detecting and tracking the proximal end of the dilator may be used, such as machine vision and a navigated dilator.

[0029] FIG. 2 shows a flow diagram of a method 200 of training a deep learning model (e g., a neural network) for detecting a proximal end of a dilator from image data, according to an example embodiment of the present disclosure. One or more steps of method 200 may be performed by a computing device having one or more processors. In some aspects, the computing device may be communicatively linked to, or may be a part of, a surgical visualization system. The surgical visualization system may be equipped to record video streams via a DSM camera (e.g., incoming surgical video streams) and output the video streams (e.g., outgoing surgical video streams) via a display. In other aspects, the computing device performing one or more steps of method 200 may be separate from the surgical visualization system, and/or may be distinct from the computing device performing one or more steps of method 400 shown in FIG. 4, to be described herein. For example, method 200 may be used to train a deep learning model (e.g., a neural network) that can be stored (e.g., via cloud and/or an electronic storage medium), for use by a computing device performing method 400.

[0030] Method 200 may begin with receiving reference image data for a plurality of reference images showing a reference image target (block 202). For ease and simplicity, the descriptor “reference” may be used before image data, images, and/or feature vectors used in the training of the deep learning model, in order to distinguish from image data, images, features, and/or feature vectors used in the application of the trained deep learning model to a live surgical video stream in real-time or near real-time.

[0031] The computing device may generate, for each of the plurality of reference image data, a reference input feature vector based on a set of relevant features from the image data (block 204). The set of relevant features may involve features or characteristics of the reference image data that may be predictive of the outcome of a reference target. For example, for image data based on images of a surgical video stream, in which the target may be a location of a proximal end of a dilator, relevant features may include the color of an individual pixel of the image data, a contrast of an individual pixel with respect to neighboring pixels, a brightness of an individual pixel, a degree of similarity with a neighboring pixel, a curvature, etc. In some embodiments, the relevant features may comprise visualization pipe parameters. The visualization pipe parameters may include but are not limited to, for example, sensor settings, exposure time, image processing parameters including brightness, contrast, high dynamic range (HDR) and other luma mapping, HDR and other tone mapping, HDR and other blending. A reference input feature vector based on the set of relevant features may include values for each relevant feature of the set of relevant features. For example, the reference input feature vector may comprise a vector comprising a value indicating a degree of red, green or blue for the relevant feature of a color; a value representing a degree of contrast, a value representing a degree of brightness, a value representing the degree and/or an existence (e.g., a binary true or false indication) of similarity with a neighboring pixel in one or more directions, and a value indicating a degree of curvature. The set of features, and corresponding feature vectors may increase or decrease based on an assessment or relevance (e.g., predictability) in relation to an outcome of the reference target (e.g., location of a proximal end of a dilator).

[0032] The computing device may also receive, for each of the plurality of image data, labeled image data identifying a bounding box around the reference target (block 206). The label may comprise one or more Cartesian coordinates (X, Y) indicating vertices of the bounding box in relation to the image. The bounding box may represent the Cartesian bounds of the reference target (e.g., the furthest that a reference target is vertically and horizontally in an image). The reference target may be used to “teach” or train the deep learning model (e.g., the neural network) to be able to recognize a proximal end of a dilator from a live surgical video stream (e.g., the actual target). While a reference target may comprise proximal ends of dilators shown in past or non-live frames of surgical video streams, the reference target may need not be a surgical tool. For example, a reference target may include, e.g., a letter (e.g., an O); a number (e.g., “0”); an animal; and/or a shape (e.g. a ring or a circle). As the deep learning model “learns” to recognize such reference targets from the plurality of reference image data, the trained deep learning model (e.g., the trained neural network) can then apply that learning to detect the proximal end of a dilator from a live surgical video stream, as will be described in relation to FIG. 4. In some aspects, the reference image data may be labeled manually. For example, a user may input, via the computing device, a bounding box around the reference target, thus indicating the location of the reference target.

[0033] The computing device may determine, for each labeled reference image data, a reference output vector based on the bounding box (block 208). The reference output vector may indicate the label of the specific reference image data, e.g., one or more Cartesian coordinates (X, Y) indicating vertices of the bounding box in relation to the image. Moreover, the reference output vector may indicate a location of the bounding box in relation to the image, thereby indicating the location of the reference target that the bounding box encapsulates. In some aspects, the reference output vector may additionally indicate other objects of interest, e.g., locations, dimensions, and/or measurements of the inner walls of the dilator; locations, dimensions, and/or measurements of features, landmarks, and/or objects associated or typically associated with a surgical site; etc.

[0034] The computing device may then associate the reference input feature vectors and the reference output vector with a neural network (block 210). For example, the computing device may associate the reference input feature vectors with an input layer of the neural network, and may associate the reference output vector with an output layer of the neural network. Furthermore, weights in the neural network may be initialized (block 212). In some aspects, biases may be initialized and provided for any layer of the neural network.

[0035] The computing device may train the neural network to be able to use to detect and track a proximal end of a dilator from a live surgical video stream (block 214). A neural network may be trained, for example, if it yields an optimized set of weights for use in the application of the trained neural network. The optimized set of weights may be determined through iterative feedforward and backpropagation processes. As will be described herein, FIG. 3 provides a more detailed description of an example embodiment of the neural network used for the training in method 200, and used for the application of the trained neural network in method 400 (e.g., in FIG. 4). The trained neural network may then be saved (e.g., to cloud and/or an electronic storage medium) for use in identifying and tracking a proximal end of a dilator from a live surgical video stream (block 216).

[0036] FIG. 3 shows a diagram of an example deep learning model (e.g., neural network) used for detecting a proximal end of a dilator from image data, according to an example embodiment of the present disclosure. The example deep learning model is a neural network comprising an input layer 302, one or more hidden layers 304, and an output layer 306. Each of these layers may comprise a plurality of nodes 310. The input layer may be used to associate an input feature vector (e.g., the previously described reference input feature vector or an input feature vector as will be described herein in relation to FIG. 4). For example, nodes of the input layer may correspond to the number of relevant features in the set of features used to construct the input feature vector. The value of each feature may be entered in for each node of the input layer. The output layer may be used to associate an output vector (e.g., the previously described reference output vector).

[0037] Each node of the output layer may correspond to a set of features by which the output feature vector may be constructed. For example, each node of an output feature vector representing the location of a bounding box, and therefore a location of a proximal end of a dilator, may represent one or more Cartesian coordinates, an X value, or a Y value, to indicate the vertices of the bounding box. The nodes between the input layer and the output layer may represent a point in the training process where an activation function may be performed, or a weight maybe reassessed. The activation function may depend on a node from one layer to a node in the next layer, and may rely on a weight 308 assigned between each nodes. The weights 308 may be initialized and may be subsequently adjusted through iterative processes (e.g., backpropagation). The training of the neural network may culminate in an optimized set of weights that allow the neural network to be able to closely predict the actual values or functions of the output layer (e.g., the location of the bounding box) from the input layer, e.g., within a predetermined margin of error.

[0038] In some aspects, the neural network model may be enhanced or optimized, e.g., to reduce runtime, increase accuracy, minimize overfitting and underfitting, etc. For example, one or more features from the set of features may be eliminated if the weights stemming from a node in the input layer representing those features are insignificant. Furthermore, a hidden layer may be swapped, eliminated, or used as an input layer or an output layer in one or more iterations. In some aspects, other machine learning models (e.g., convolutional neural networks, regression, etc.) may be used instead of, or in addition to, the neural network model.

[0039] FIG. 4 shows a flow diagram of a method of applying a trained deep learning model (e.g., a trained neural network) for detecting and tracking a proximal end of a dilator from image data and mitigating visually distracting effects of the proximal end of the dilator, according to an example embodiment of the present disclosure, according to example embodiments of the present disclosure.

[0040] One or more steps of method 400 may be performed by a computing device communicatively linked to, or apart of, the surgical visualization system providing capturing (e.g., via a DSM camera) a surgical video stream of a surgical site. The computing device may process and/or modify the incoming surgical video stream (e.g., via one or more processors performing one or more steps of method 400 to image data of the incoming surgical video stream) and generate the outgoing surgical video stream in real-time or near real-time (e.g., via the one or more steps of method 400).

[0041] Method 400 may begin with receiving, in real-time, image data of a surgical video stream captured by a DSM camera (e.g., an incoming surgical video stream) (block 402). As each video stream may comprise a collection of frames (e.g., images), which the computing device may receive as image data, subsequent steps of method 400 may be described in relation to given image data if a plurality of image data that the incoming video steam comprises of. In some aspects, one or more subsequent steps may be applied to more than one given image data of the plurality of image data, or to each image data of the plurality of image data.

[0042] For given image data of the plurality of image data, the computing device may generate an input feature vector based on a set of relevant feature vectors from the image data (block 404). The set of relevant features may involve features or characteristics of the image data that may be predictive of the location of the proximal end of the dilator within the image represented by the image data. For example, the relevant features may include the color of an individual pixel of the image data, a contrast of an individual pixel in relation to neighboring pixels, a brightness of an individual pixel, a degree of similarity with a neighboring pixel, a curvature, etc. An input feature vector based on the set of relevant features may include values for each relevant feature of the set of relevant features. For example, the input feature vector may comprise a vector comprising a value indicating a degree of red, green or blue for the relevant feature of a color; a value representing a degree of contrast, a value representing a degree of brightness, a value representing the degree and/or an existence (e.g., a binary true or false indication) of similarity with a neighboring pixel in one or more directions, and a value indicating a degree of curvature. In some aspects, the set of features may correspond with the set of features used to form the reference input feature vectors in block 204 of method 200 shown in FIG. 2.

[0043] In some embodiments, the computing device may generate an input feature vector by identifying and relying on visualization pipe parameters. The visualization pipe parameters may include but are not limited to, for example, sensor settings, exposure time, image processing parameters including brightness, contrast, high dynamic range (HDR) and other luma mapping, HDR and other tone mapping, HDR and other blending. The region of interest for which the visualization pipe parameters are obtained may typically be associated with the interior of the tube. Therefore, a deep learning model (e.g., as applied in block 406) may enable detection of the dilator proximal end as well the width of its annulus.

[0044] The input feature vector may be applied to the trained neural network having the stored set of weights (e.g., from block 216 of method 200 shown in FIG. 2) to determine an output vector (block 406). For example, the input feature vector for a given image data may be aligned along the input layer of a neural network model, such as that shown in FIG. 3, and a set of weights optimized in block 214 of method 200 may be applied between the layers of the neural network. Using the applied input feature vector and the optimized set of weights (e.g., from method 200), the trained neural network may yield an output vector, e.g., via values provided on the nodes of the output layer of the neural network. The output vector may provide metrics for a bounding box on the image represented by the image data, and thereby indicate a location for the proximal end of the dilator. For example, the output vector may comprise one or more Cartesian coordinates (X, Y) indicating vertices of the bounding box in relation to the image represented by the image data. In some embodiments, the input feature vector may also or may alternatively be applied to other trained machine learning models, for example machine learning models trained to detect the inner walls of the dilator, and/or machine learning models trained to detect surgical scenes of the type present near the distal end of the dilator.

[0045] Based on the output vector, the computing device may identify the location of the bounding box (block 408). For example, the Cartesian coordinates indicating vertices of the bounding box in relation to the image (e.g., the horizontal and vertical extent of the underlying proximal end of the dilator) provided by the output vector may be used by the computing device to “draw” the bounding box. In some embodiments, the output vector may additionally indicate other objects of interest, e.g., locations, dimensions, and/or measurements of the inner walls of the dilator; locations, dimensions, and/or measurements of features, landmarks, and/or objects associated or typically associated with a surgical site; etc.

[0046] At block 410, the computing device may determine, based on the location of the bounding box, a location of (e.g., the area covered by) the proximal end of the dilator. In some aspects, the proximal end of the dilator may have a recognizable circular ring like shape, with known dimensions. In such aspects, the computing device may automatically determine the location of (e.g., pixels of the image data covered by) the proximal end of the dilator based on its known dimensions and the location of the bounding box. For example, if the proximal end of the dilator comprises two concentric rings, the bounding box may indicate the length of the larger diameter, and the midpoint of the bounding box may indicate a tangent of the outermost ring. The known dimension of the smaller ring may be used to identify the extent (e.g., area covered) by the proximal end of the dilator. For example, the given image data may map the image it represents by way of pixels, and the pixels covered by the ring-like shape of the proximal end of the dilator may be identified.

[0047] The computing device may then apply a blocking overlay or selected pixel value modification algorithm(s) or technique(s) to the area covered by the proximal end of the dilator (block 412). The blocking overlay may comprise a layer on top of, or a modification of pixels, of the area covered by the proximal end of the dilator. For example, as previously shown via FIG. 1, the blocking overlay may comprise a ring of a dull and insignificant color and brightness (e.g., solid black ring as shown in image 112) to mitigate the previously visually distracting color and brightness of the proximal end of the dilator (see, e.g., proximal end 104 of image 102). As is to be appreciated, the pixels of the given image interior of the inner ring of the ring-like proximal end of the dilator (e.g., an area of the image associated with the patient anatomy of interest) is not considered as part of the area covered by the proximal end of the dilator.

[0048] The computing device may then generate, in real-time, an updated image data (414). The updated image data may correspond to an image where the blocking overlay has been applied to the area covered by the proximal end of the dilator (e.g., from block 412). The updated image data may be made a part of the outgoing surgical video stream. For example, the given image data received in block 402 may be replaced by the updated image data, such that the surgical video stream output via a visualization system (e.g., display) shows a blocking overlay covering the proximal end of the dilator.

[0049] In some aspects, for subsequent frames of the incoming surgical video stream, one or more steps of method 400 may be repeated, e.g., for new locations of the proximal end of the dilator for each new frame associated with the incoming surgical video stream. Also or alternatively, the computing device may landmark (e.g., place one or more pins) or “remember” features (e.g., via an identification of nearby objects in the image) of the image data associated with a determined location of the proximal end of the dilator. Such landmarks or remembered features may allow the computing device to bypass one or more steps of method 400 as appropriate, thereby reducing time in image processing, and increasing the accuracy of the presentation.

Example Surgical Environment

[0050] FIG. 5 shows a diagram of a surgical environment for detecting a proximal end of a dilator from image data, according to an example embodiment of the present disclosure. As shown in FIG. 5, the surgical environment may include a patient 502 that may undergo surgery on a surgical site 106 on the body of the patient 502. The surgery may be performed by a surgeon 504, who may utilize surgical tools, such as a dilator 508 to perform the surgery. The surgeon 504 may rely on an apparatus comprising a DSM 514. The DSM 514 may include one or more cameras 516 for viewing the surgical site 106 and receive image and/or video data. The DSM 514 may be supported by a surgical cart 506 via one or more robotic arms 512 to move and/or orient the DSM 514 and/or its associated one or more cameras 516 (e.g., to move alongside the surgical site and/or to move based on a surgeon’s commands). The surgeon 504 may further rely on a display 510 to view image and/or video data captured and/or processed by the DSM camera 516 based on methods and processes described herein. One or more aspects of the DSM 514, cameras, 516, robotic arms, and display 510 may be controlled by a surgical visualization and navigation system 501 (referred to herein as “system” 501). In some aspects, the system 501 may control the aforementioned components based on received image and/or video data from DSM camera 516. For example, the system 501 is configured to receive and process image and/or video data received by the DSM camera 516 as it identifies and tracks the dilator proximal end, and adjust the image and/or video output by the display 510 such that the proximal end of the dilator is blocked or otherwise rendered less distracting for the surgeon. FIG. 6 and the related description describe various example components and functionalities of system 501 in more detail. Also or alternatively, the system 501 may control the aforementioned components via user interface 518 and/or based on a command (e.g., voice command) of the surgeon 504.

Example System For Detecting, Blocking And Tracking A Proximal End of A Dilator

[0051] FIG. 6 shows a block diagram of an example system (e.g., surgical visualization and navigation system 501) used for detecting and tracking a proximal end of a dilator from image data and mitigating visually distracting effects of the proximal end of the dilator, according to an example embodiment of the present disclosure, according to example embodiments of the present disclosure. The example system (e.g., surgical visualization and/or navigation system 501) may include, but is not limited to the one or more components shown in FIG. 6, such as one or more processors 602, a robot control system 604, a memory device (memory 608), the display 510, a localizer 612, the digital surgical microscope (DSM) 514, a database 618, a machine learning module 624, and a calibration module 628. The system may be used to detect and track a dilator (e.g., proximal end 104 of the dilator 508) and/or a surgical site (e.g., a patient anatomy 106 enclosed by the proximal end 104 of the dilator 508), according to methods described herein. The system 501 and the dilator 508 may be controlled and accessed by the surgeon 504 during a surgical procedure.

[0052] The processor 602 may comprise any one or more types of digital circuit configured to perform operations on a data stream, including functions described in the present disclosure. The memory 608 may comprise any type of long term, short term, volatile, nonvolatile, or other memory and is not to be limited to any particular type of memory or number of memories, or type of media upon which memory is stored. The memory may store instructions that, when executed by the processor 602, can cause the system 501 to perform one or more methods discussed herein. The robot control system 604 may comprise a controller or microcontroller configured to control one or more robotic arm actuators and/or end effectors 606 to move robotic arms 512 shown in FIG. 5. The robotic arms 512 may cause movement and/or adjust the DSM 514 and/or DSM camera 516. Movement of the digital surgical microscope 514 may allow one or more cameras 516 associated with the DSM 514 to better capture a field of view associated with the surgical site 106 and/or track the dilator during the surgical session. Also or alternatively the robot control system 604 may be configured to adjust one or more mechanical aspects of the DSM 514. The display 510 may output image data and/or video data stream generated by the DSM 514 and rendered by the processor 502 to have proximal end of the dilator (and other undesired areas of the field of view) blocked, according to the methods discussed herein. The localizer 612 may comprise hardware and/or software configured to detect a pose (e.g., a position and/or orientation) of the DSM, the system, and/or the dilator in a specified viewing space.

[0053] The localizer 612 may supply this information to the system 501 responsive to requests for such information in periodic, real-time, or quasi-real-time fashion or at a constant rate. In some aspects, the localizer 612 may also be equipped with a camera to capture a field of view of the surgical site. Furthermore, the display 510 showing image data captured by the DSM 514 may also show (e.g., as an overlay or mask) a field of view of the localizer 612. In some embodiments, the DSM 514 may be rendered to be compatible with (e.g., by being rendered trackable by) the localizer (e.g., via a DSM navigation target).

[0054] The database 618 may comprise an electronic repository or data store for storing training datasets for training machine learning models described herein. The training datasets may comprise for example large datasets of image data 620 received, for example, from a plurality of reference surgery sessions. Such image data 620 may be rendered with one or more label(s) 622, for example, the locations or dimensions of a bounding box for a proximal end of a dilator; a center of mass for the dilator; the locations, dimensions or measurements of an inner surface of the dilator; or features or landmarks associated with a surgical site typically enclosed by the dilator.

[0055] The machine learning module 624 may comprise a software, program, and/or computer executable instructions that cause the system 501 to train machine learning models (e.g., using training data stored in database 618 or received externally) and/or apply one or more trained machine learning models 626, as described herein. The calibration module 628 may comprise a software, program, and/or computer executable instructions that cause the system 501 to calibrate the DSM 514 (e.g., via predefined calibration routines 630). The ground truth data capture module 632 may comprise a hardware, software, program, and/or computer executable instructions that cause the system 501 to capture locational data from the cameras 516 at scale to determine and/or train machine learning models to determine spatial orientation of one or more components of the system 501. DSM Tracking The Dilator Proximal End

[0056] In some embodiments, the detected shape and/or location of the dilator proximal end 104 may enable the system 501 to drive the robotic arm actuators and/or effectors 606 (e.g., via the robot control system 604) to position and/or orientate (e.g., pose) a stereoscopic camera 516 of the DSM 514 to capture the proximal end 104 of the dilator in an optimal manner. For example, the DSM camera 516 may be posed such that the stereoscopic optical axis of the DSM camera 516 is parallel to and centered at the center of the proximal end 104 of the dilator. In this manner, the field of view of the DSM camera 516 may focus on the relevant surgical site 106 near the distal end of the dilator. The field of view may continue to focus on and track the relevant surgical site 106 in quasi real time as the surgical procedure progresses, in a hands-free automated manner, thereby freeing the surgeon 504 of this onerous task. The onscreen view of the dilator proximal end 104 may be nominally circular when viewed with an ideal monoscopic camera 516. In some embodiments, for example in an ideal case, the optical axis of the DSM camera 516 may be parallel to and coincident with the central axis of the dilator.

[0057] In some embodiments, however, if the optical axis of the DSM camera 516 is nonparallel but close to parallel, the onscreen view of the dilator proximal end 104 may become elliptical. In such embodiments, the onscreen shape of the detected dilator proximal end 104 can thus be fit to an ellipse. The eccentricity of the ellipse found in each camera eye image may be used as a metric for the viewing angle for each camera eye. This metric is in some embodiments derived from a mathematical analysis of the system giving said viewing angle as a function of onscreen ellipse eccentricity as well as ellipse position. In some embodiments, the metric for determining viewing angle as a function may be taken as a direct output of a trained deep learning model 626. For example, the deep learning model may be trained with a ground truth input of image data 620 of the dilator proximal end in each camera with corresponding viewing angle. Such data may be generated using standard camera calibration routines 630 as well as one or more instrumented dilator tubes. The cameras 516 may be calibrated and then a “hand to eye” transformation can be determined for each camera 516 relative to the robot on which the DSM 514 is mounted (e.g., using OpenCV’s cv::calibrateHandEye). Furthermore, the transformation from the robot end effector 606 to the camera eye of the camera 516 can be determined, for example, using cv: :calibrateRobotWorldHandEye(). [0058] The optical axis of the camera 16 may thus be determined relative to the robot end effector 606. Further the dilator tube may be instrumented such that the angle of its central axis relative to the robot base coordinate system can be rendered measurable and reported to the ground truth data capture module 632. A geometric average of said viewing angles may be used as a goal for the robot control system 604. The robot control system 604 may be controlled to attempt to drive said geometric average toward zero degrees or equivalent angular measure. Furthermore, the geometric average of the centroids of the ellipse determined in each of the “eyes” of the stereoscopic camera 516 may be driven to be the center of the screen.

[0059] It should be noted that due to the stereoscopic nature of the DSM 514, the optimal pose of the optical axis of each individual “eye” of the stereoscopic camera 516 relative to the central axis of the dilator tube might not actually be parallel and coincident to said central axis. In one or more embodiments, the stereoscopic axis may be the geometric average of the two individual camera eye optical axes. To maximize the viewable portion, in a stereoscopic sense, of the relevant surgical site 106 near the distal end of the dilator 104, the optimal pose of the stereoscopic optical axis relative to the central axis of the dilator tube may be parallel to and coincident with said central axis.

[0060] Note that this typically means that the optical axis of each of the two individual camera eyes of the camera 516 is neither parallel to nor coincident with the central axis of the dilator tube. Thus, a portion of the relevant surgical site 106 near the distal end of the dilator is blocked in each eye (and a different portion per eye), leading to user eye confusion, binocular rivalry and potential strain and fatigue. The features of the present disclosure minimize such portions and thus minimize user strain and fatigue due to this problem.

[0061] In some embodiments, the trained deep learning model 626 uses computer vision and/or deep learning techniques to detect a shape of the dilator proximal end 104. The system 501 then determines a difference between a known/expected shape of the dilator proximal end 104 and the detected shape of the dilator proximal end 104. The system 501 uses this difference to drive the robotic arm actuators and/or effectors 606 (e.g., via the robot control system 604) to position and/or orientate (e.g., pose) a stereoscopic camera 516 of the DSM 514, such that the camera optical axis always looks down the dilator tube.

[0062] As sometimes the dilator proximal end 104 may not be in view of the stereoscopic DSM camera 516, additional cameras may be utilized by the system 501 to aid in detection of the proximal end 104 relative to the stereoscopic DSM camera 516, for example on the underside of the DSM 514, and/or in the handles. Each such camera is designed and posed to enable a view of the volume near the surgical site expected to contain the dilator proximal end 104. Such design and posing can include fixed zoom and/or focus and/or variable zoom and/or focus. In the general case, each such camera is configured to view the dilator proximal end 104 under all normal operating conditions of the DSM 514. Where the range of operating conditions make it difficult or impossible for a single camera design to view the dilator proximal end 104 under all normal operating conditions of the DSM 514, one or more additional cameras are added, which taken together cover the range.

[0063] For instance, other sensors and/or cameras mounted on the DSM 514 may have a wider field of view than the DSM camera 516. This wider field of view may include the scene in addition to the dilator proximal end 104. This other camera or set of cameras removes the need for the digital surgical microscope view to contain the dilator proximal end 104. The pose of each such camera relative to the DSM camera 516 is determined using computer vision techniques of multi-view geometry camera calibration. The pose of either or both of each such camera and the DSM camera 516 relative to the system 501 or other posing system is found using computer vision "eye-to-hand" and/or "hand-to-eye" calibration techniques such as OpenCV’s cv: :calibrateHandEye transformation.

[0064] The poses of said additional cameras are determined relative to each other and to each individual eye of the stereoscopic DSM camera 516 using standard camera calibration routines 630 such as are described in “Multiple View Geometry in Computer Vision” and/or described and/or implemented in and/or enabled by software packages (e.g., 3DF ZEPHYR from 3DFLOW and OPENCV). In the case of variable zoom and/or focus, one or more calibration routines 630 may be performed at each such zoom and focus, or at a containing or nearly- containing subset of such optical settings. Furthermore, interpolation and/or extrapolation may be used to determine the calibration settings at the zoom and focus currently in use. In such embodiments, where external and/or additional cameras are used, a typically different set of deep learning models may be trained because the view of the dilator proximal end 104 as obtained by said external camera(s) may be typically different from the view obtained by the individual eyes (cameras) of the stereoscopic DSM camera 516. [0065] The surgical site region ofinterest is typically inside a dilator tube. Thus, in addition to a deep learning model that detects the dilator proximal end 104, a deep learning model is trained to segment pixels of the dilator proximal end 104 (typically a distorted annulus shape) separate from the surgical site that is visible within said shape.

[0066] This segmented surgical region of interest is then used to drive visualization pipeline processing parameters including sensor settings, exposure time, image processing parameters including brightness, contrast, high dynamic range (HDR) and other luma mapping, HDR and other tone mapping, and/or HDR and other blending. For example, the microscope light is typically reflected strongly back into the microscope by the flat surface of the dilator proximal end 104, resulting in saturated pixels recorded by the DSM camera 516. When such pixels are included in the calculation of a reasonable exposure time, the calculation is skewed too low, thereby resulting in poor image quality of the region of interest. Excluding such pixels and using only pixels in the region ofinterest to calculate visualization pipeline processing parameters results in improved image quality for the region of interest.

[0067] It will be appreciated that each of the systems, structures, methods and procedures described herein may be implemented using one or more computer programs or components. These programs and components may be provided as a series of computer instructions on any conventional computer-readable medium, including random access memory (“RAM”), read only memory (“ROM”), flash memory, magnetic or optical disks, optical memory, or other storage media, and combinations and derivatives thereof. The instructions may be configured to be executed by a processor, which when executing the series of computer instructions performs or facilitates the performance of all or part of the disclosed methods and procedures.

[0068] It should be understood that various changes and modifications to the example embodiments described herein will be apparent to those skilled in the art. Such changes and modifications can be made without departing from the spirit and scope of the present subject matter and without diminishing its intended advantages. It is therefore intended that such changes and modifications be covered by the appended claims. Moreover, consistent with current U.S. law, it should be appreciated that 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, paragraph 6 is not intended to be invoked unless the terms “means” or “step” are explicitly recited in the claims. Accordingly, the claims are not meant to be limited to the corresponding structure, material, or actions described in the specification or equivalents thereof.

Claims

What Is Claimed Is:

1. A system that uses deep learning to detect and track a proximal end of a dilator used in surgical procedures, the system comprising: a digital surgical microscope (DSM) camera; optionally one or more auxiliary cameras; one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the system to: receive, in real-time via the DSM camera or the one or more auxiliary cameras, image data from an incoming surgical video stream of a field of view of the DSM camera or the one or more auxiliary cameras, wherein the field of view shows a proximal end of a dilator; generate, using the image data and a trained neural network, a bounding box around a region of the image data corresponding to the proximal end of the dilator; determine, based on the location of the bounding box, the region of the image data corresponding to the proximal end of the dilator; apply at least one pixel modification algorithm to the region of the image data corresponding to the proximal end of the dilator; and generate, in real-time, an updated image data for an outgoing surgical video stream, wherein the outgoing surgical video stream shows the at least one pixel modification algorithm applied to the region of the image data corresponding to the proximal end of the dilator.

2. The system of claim 1, wherein the instructions, when executed, further cause the system to: track, in real-time, the proximal end of the dilator in the incoming surgical video stream; and modify, in real-time, detected pixels corresponding to the proximal end of the dilator in the outgoing surgical video stream in accordance with the at least one pixel modification algorithm.

3. The system of claim 1, wherein the instructions, when executed, further cause the system to: determine, via a trained machine learning model, an average width of an annulus of the proximal end of the dilator, wherein the region of the image data corresponding to the proximal end of the dilator includes the annulus of the proximal end of the dilator.

4. The system of claim 2, wherein the instructions, when executed, further cause the system to: receive new image data from the incoming surgical video stream of a new field of view of the DSM camera or the one or more auxiliary cameras, wherein at least one of the new field of view shows the proximal end of the dilator; recognize, based on a landmarking of the region of the field of view corresponding to the proximal end of the dilator, a new region of the new image data corresponding to the proximal end of the dilator; and apply the at least one pixel modification algorithm to the new region of the new field of view corresponding to the proximal end of the dilator.

5. The system of claim 4, further comprising a robot control system configured to move the DSM camera, wherein the instructions, when executed, further cause the system to: detect, via the DSM camera or the one or more auxiliary cameras, a change in a pose of the proximal end of the dilator; and cause, via the robot control system, a corresponding movement of the DSM camera to track the proximal end of the dilator.

6. The system of claim 1, wherein the instructions, when executed, further cause the system to, prior to the generating the bounding box: generate, using the image data, an input feature vector based on a set of features from the image data; apply, to the trained neural network, the input feature vector to generate an output vector; and identify, based on the output vector, the location of the bounding box.

7. The system of claim 7, wherein the output vector comprises one or more Cartesian coordinates indicating vertices of the bounding box.

8. The system of claim 7, wherein the one or more Cartesian coordinates are determined in an image space of the system.

9. The system of claim 1, wherein the instructions, when executed, further cause the system to apply a control algorithm that uses onscreen positions of the proximal end of the dilator as feedback data to the system without performing a pixel-to-millimeter conversion of the image data.

10. The system of claim 1, wherein the instructions, when executed, further cause the system to: receive a plurality of reference image data corresponding to a plurality of reference images, wherein each reference image shows a reference target; generate, for each of the plurality of reference image data, a reference input feature vector based on a set of features shared among the plurality of reference image data; receive, for each of the plurality of reference image data, an indication of the region of a reference bounding box surrounding the reference target; associate, for each of the plurality of reference image data, the reference input feature vector with a reference output vector indicating a region of the reference bounding box; and train, using the associated reference input feature vectors and reference output vectors, a neural network model to generate the trained neural network by at least: inputting the reference input feature vectors in an input layer of a multilayer perceptron comprising a plurality of nodes associated with the input layer, an output layer, and a plurality of hidden layers, inputting the reference output vectors in the output layer of the multilayer perceptron, and initializing weights corresponding to distances between the plurality of nodes.

11. A computer-implemented method, comprising: receiving, in real-time by a computing device having a processor, image data from a surgical video stream of a field of view of a digital surgical microscope (DSM) camera or one or more auxiliary cameras, wherein the field of view shows a view of a proximal end of a dilator and a patient anatomy; generating, using the image data and a trained neural network, a bounding box around a region of the image data corresponding to the view of the proximal end of the dilator; determining, based on the location of the bounding box, the region of the image data corresponding to the view of the proximal end of the dilator; applying at least one pixel modification algorithm to the region of the image data corresponding to the view of the proximal end of the dilator; and generating, in real-time, an updated image data for an outgoing surgical video stream, wherein the outgoing surgical video stream shows the patient anatomy and the at least one pixel modification algorithm applied to the region of the image data corresponding to the view of the proximal end of the dilator.

12. The computer-implemented method of claim 11, further comprising: tracking, in real-time, the proximal end of the dilator in the incoming surgical video stream; and modifying, in real-time, detected pixels corresponding to the proximal end of the dilator in the outgoing surgical video stream in accordance with the at least one pixel modification algorithm.

13. The computer-implemented method of claim 11, further comprising: determining, via a trained machine learning model, an average width of an annulus of the proximal end of the dilator, wherein the region of the image data corresponding to the proximal end of the dilator includes the annulus of the proximal end of the dilator.

14. The computer-implemented method of claim 12, further comprising: receiving new image data from the incoming surgical video stream of a new field of view of the DSM camera or the one or more auxiliary cameras, wherein at least one of the new field of view shows the proximal end of the dilator; recognizing, based on a landmarking of the region of the field of view corresponding to the proximal end of the dilator, a new region of the new image data corresponding to the proximal end of the dilator; and applying the at least one pixel modification algorithm to the new region of the new field of view corresponding to the proximal end of the dilator.

15. The computer-implemented method of claim 14, further comprising: moving the DSM camera via a robot control system; detecting, via the DSM camera or the one or more auxiliary cameras, a change in a pose of the proximal end of the dilator; and causing, via the robot control system, a corresponding movement of the DSM camera to track the proximal end of the dilator.

16. The computer-implemented method of claim 14, further comprising: generating, using the image data, an input feature vector based on a set of features from the image data; applying, to the trained neural network, the input feature vector to generate an output vector; and identifying, based on the output vector, the location of the bounding box.

17. The computer-implemented method of claim 16, wherein the output vector comprises one or more Cartesian coordinates indicating vertices of the bounding box.

18. The computer-implemented method of claim 17, wherein the one or more Cartesian coordinates are determined in an image space.

19. The computer-implemented method of claim 11, further comprising applying a control algorithm that uses onscreen positions of the proximal end of the dilator as feedback to the system without performing a pixel-to-millimeter conversion of the image data.

20. The computer-implemented method of claim 11, further comprising: receiving a plurality of reference image data corresponding to a plurality of reference images, wherein each reference image shows a reference target; generating, for each of the plurality of reference image data, a reference input feature vector based on a set of features shared among the plurality of reference image data; receiving, for each of the plurality of reference image data, an indication of the region of a reference bounding box surrounding the reference target; associating, for each of the plurality of reference image data, the reference input feature vector with a reference output vector indicating a region of the reference bounding box; and training, using the associated reference input feature vectors and reference output vectors, a neural network model to generate the trained neural network by at least: inputting the reference input feature vectors in an input layer of a multilayer perceptron comprising a plurality of nodes associated with the input layer, an output layer, and a plurality of hidden layers, inputting the reference output vectors in the output layer of the multilayer perceptron, and initializing weights corresponding to distances between the plurality of nodes.

- 21 -