WO2023154358A1 - Traitement de vidéos volumétriques à l'aide d'un réseau neuronal convolutif 4d et de multiples convolutions de dimension inférieure - Google Patents
Traitement de vidéos volumétriques à l'aide d'un réseau neuronal convolutif 4d et de multiples convolutions de dimension inférieure Download PDFInfo
- Publication number
- WO2023154358A1 WO2023154358A1 PCT/US2023/012643 US2023012643W WO2023154358A1 WO 2023154358 A1 WO2023154358 A1 WO 2023154358A1 US 2023012643 W US2023012643 W US 2023012643W WO 2023154358 A1 WO2023154358 A1 WO 2023154358A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- recited
- neural network
- convolutional neural
- dimensional
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H30/00—ICT specially adapted for the handling or processing of medical images
- G16H30/40—ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
Definitions
- the present invention relates generally to imaging systems, and more particularly to processing volumetric videos using a four-dimensional (4D) convolutional neural network by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video.
- Imaging systems produce representations of an object’s form, especially a visual representation.
- biomedical imaging systems enable the visualization of the internal organs of the body and its diseases, such as via volumetric videos.
- images two-dimensional (2D)
- volumes three-dimensional (3D)
- videos (2D + temporal (T)
- advances to imaging hardware has created the opportunity for many biomedical imaging systems to now collect volumetric videos (3D+T).
- a convolutional neural network is a class of artificial neural network, most commonly applied to analyze visual imagery.
- a convolutional neural network is also known as a shift invariant or space invariant artificial neural network based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation equivariant responses known as feature maps.
- Convolutional neural networks are regularized versions of multilayer perceptrons (fully connected networks). That is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks makes them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.).
- Convolutional neural networks take a different approach towards regularization. They take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters. [0008] Furthermore, convolutional neural networks use relatively little pre-processing compared to other image classification algorithms since the network learns to optimize the filters (or kernels) through automated learning, whereas, in traditional algorithms these filters are hand-engineered. [0009] The initial application of convolutional neural networks to two-dimensional (2D) biomedical images and videos has been largely successful. However, for biomedical imaging modalities that generate volumetric images or volumetric videos in four dimensions, convolutional neural networks designed for natural images cannot be directly applied without ignoring at least one of the spatial-temporal dimensions.
- a computer-implemented method for processing volumetric videos comprises receiving a volumetric video with three spatial dimensions and a fourth dimension.
- the method further comprises performing convolutional operations over the three spatial dimensions and the fourth dimension of the volumetric video using a four- dimensional convolutional neural network to perform one of the following processes: segmentation, denoising, deconvolution and image domain transfer, where the convolutional operations comprise a set of two-dimensional and/or three-dimensional convolutions that upon aggregation create features informed by the three spatial dimensions and the fourth dimension.
- Other forms of the embodiment of the computer-implemented method described above are in a system and in a computer program product.
- Figure 1 illustrates a communication system for practicing the principles of the present invention in accordance with an embodiment of the present disclosure
- Figure 2 is a diagram of the software components used by the volumetric video processing system to process volumetric videos to perform a process, such as segmentation, denoising, deconvolution or image domain transfer, in accordance with an embodiment of the present invention
- Figure 3 illustrates an embodiment of the present invention of the hardware configuration of the volumetric video processing system which is representative of a hardware environment for practicing the present invention
- Figure 4 is a flowchart of a method for training a convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer on volumetric videos in accordance with an embodiment of the present invention
- Figure 5 is a flowchart of a method for processing volumetric videos using
- the embodiments of the present invention provide a means for effectively processing volumetric volumes that overcomes the computational expense and improves accuracy, such as segmentation accuracy in dense biomedical imaging data, by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video.
- “pseudo-4D convolutions,” are used as a replacement for standard 4D convolution to overcome the above structural problems and improve accuracy, such as segmentation accuracy in dense biomedical imaging data.
- “Pseudo-4D convolutions,” as used herein, refer to a set of lower-dimensional convolutions that, upon aggregation, create features informed by all spatial-temporal axes while avoiding the burden of parameter explosion. [0035] By utilizing such pseudo-4D convolutions, any dimensional associations a voxel (a voxel represents a value on a regular grid, such as in three-dimensional space) may have in space or time is not discarded. By regarding a voxel in its full dimensionality, various processes, such as segmentation, are significantly improved followed by a reduction in false positives in low signal, high noise imaging conditions.
- the present invention comprises a computer-implemented method, system and computer program product for processing volumetric videos.
- volumetric videos which may consist of biomedical image data, are received.
- such volumetric videos are generated by non-ionizing biomedical imaging or an ionizing system.
- Convolutional operations are then performed over three spatial dimensions and a fourth dimension of the received volumetric video using a 4D convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer.
- the fourth dimension corresponds to a temporal dimension.
- such convolutional operations correspond to “pseudo-4D convolutions,” which refer to a set of lower-dimensional convolutions that, upon aggregation, create features informed by all spatial-temporal axes while avoiding the burden of parameter explosion.
- Such a set of lower-dimensional convolutions includes a set of two-dimensional and/or three-dimensional convolutions that upon aggregation create features informed by the three spatial dimensions and the fourth dimension (e.g., temporal dimension).
- Figure 1 illustrates an embodiment of the present invention of a communication system 100 for practicing the principles of the present invention.
- Communication system 100 includes a volumetric video processing system 101 configured to receive volumetric videos 102 which are processed by volumetric video processing system 101, such as performing segmentation, denoising, deconvolution or image domain transfer on the received volumetric data, to produce processed volumetric videos 103.
- volumetric videos 102 consist of biomedical image data, such as generated by non-ionizing biomedical imaging or an ionizing system.
- volumetric video processing system 101 performs “pseudo-4D convolutions” on the received volumetric videos 102 using a four-dimensional convolutional neural network to perform a process, such as segmentation, denoising, deconvolution or image domain transfer, which is outputted as the processed volumetric videos 103.
- “Volumetric videos,” as used herein, refer to video recordings that capture four dimensions of an image, namely, the three spatial axes and a temporal component.
- Segmentation refers to the process of dividing an image into regions with similar properties, such as gray level, color, texture, brightness and contrast. The role of segmentation is to subdivide the objects in an image. In the case of medical image segmentation, the aim is to study the anatomical structure.
- “Denoising,” as used herein, refers to removing noise or distortions from an image.
- Convolution refers to a mathematical operation on functions, such as two functions, that produces another function, such as a third function, that express how the shape of one function is modified by the other.
- the term “convolution,” as used herein, also refers to both the result function and to the process of computing it. In one embodiment, it is defined as the integral of the product of the two functions after one is reversed and shifted. The integral is evaluated for all values of the shift, producing the convolution function.
- Deconvolution refers to the operation inverse to convolution. In one embodiment, deconvolution is utilized for improving the contrast and resolution of images.
- Image domain transfer refers to transferring an image from a source domain to a target domain.
- the image may be transferred from a source domain of being blurred to a target domain of being clear.
- the image may be transferred from a source domain of being low-resolution to a target domain of being high-resolution.
- the image may be transferred from a source domain of being an image to a target domain of being a painting.
- the image may be transferred from a source domain of being noisy to a target domain of being clean.
- a style may be applied to an inputted image thereby transforming the input image to an output image that corresponds to the input image with the applied style.
- volumetric video processing system 101 may be performed by volumetric video processing system 101 using a 4D convolutional neural network by applying 4D convolutions on the inputted volumetric video 102 after training the 4D convolutional neural network to perform such operations as discussed in further detail further below.
- a description of the software components of volumetric video processing system 101 used for processing volumetric videos is provided below in connection with Figure 2.
- a description of the hardware configuration of volumetric video processing system 101 is provided further below in connection with Figure 3.
- Figure 2 is a diagram of the software components used by volumetric video processing system 101 ( Figure 1) to process volumetric videos to perform a process, such as segmentation, denoising, deconvolution or image domain transfer, in accordance with an embodiment of the present invention.
- volumetric video processing system 101 includes an artificial neural network engine 201.
- artificial neural network engine 201 is configured to build an artificial intelligence model using a machine learning algorithm (e.g., supervised learning) based on sample data consisting of volumetric videos and segmented volumetric videos, denoised volumetric videos, deconvoluted volumetric videos or image domain transferred volumetric videos.
- sampled data is provided by an expert and referred to herein as the “training data.”
- training data is used by the machine learning algorithm to make predictions or decisions as to the segmentation, denoising, deconvolution or image domain transfer to be performed on the volumetric videos.
- the algorithm iteratively makes predictions on the training data as to the appropriate segmentation, denoising, deconvolution or image domain transfer to be performed on the volumetric videos.
- supervised learning algorithms include neural networks, such as a convolutional neural network.
- a convolutional neural network is a class of artificial neural network, most commonly applied to analyze visual imagery.
- a convolutional neural network is also known as a shift invariant or space invariant artificial neural network based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation equivariant responses known as feature maps.
- the convolutional neural network is a regularized version of multilayer perceptrons.
- a “perceptron,” as used herein, refers to an algorithm for supervised learning of binary classifiers.
- a binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.
- the perceptron is a type of linear classifier that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
- the convolutional neural network uses “pseudo-4D convolutions” to perform operations, such as segmentation, denoising, deconvolution or image domain transfer, on the received volumetric videos.
- “Pseudo-4D convolutions,” as used herein, refer to a set of lower- dimensional convolutions that, upon aggregation, create features informed by all spatial-temporal axes while avoiding the burden of parameter explosion. [0055] In one embodiment, such pseudo-4D convolutions are performed by convolutional operator 202 of volumetric video processing system 101.
- convolutional operator 202 generates spatial feature maps related to the volumetric video input, such as by performing a 3D convolution on the volumetric video input, as well as generates a spatial-temporal representation of the volumetric video input by examining each of the three motion planes, one spatial axis and the temporal aspect of the data, such as by performing a 2D convolution on one of the three spatial axes (e.g., “z” representing the depth) and the temporal aspect of the data, in a sequential manner.
- 3D convolution on the volumetric video input
- a spatial-temporal representation of the volumetric video input by examining each of the three motion planes, one spatial axis and the temporal aspect of the data, such as by performing a 2D convolution on one of the three spatial axes (e.g., “z” representing the depth) and the temporal aspect of the data, in a sequential manner.
- Pseudo-4D convolutions may be performed by convolutional operator 202 using various software tools, including, but not limited to, Adobe® Audition, Samplitude 2496, RealReverb, The FIReverb Suite from CATT, BruteFIR, Ambiovolver, Tascam® Gigapulse, etc. [0057] A further description of these and other functions is provided below in connection with the discussion of the method for processing volumetric videos using a 4D convolutional neural network by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video.
- FIG. 1 illustrates an embodiment of the present invention of the hardware configuration of volumetric video processing system 101 ( Figure 1) which is representative of a hardware environment for practicing the present disclosure.
- Volumetric video processing system 101 has a processor 301 connected to various other components by system bus 302.
- An operating system 303 runs on processor 301 and provides control and coordinates the functions of the various components of Figure 3.
- An application 304 in accordance with the principles of the present disclosure runs in conjunction with operating system 303 and provides calls to operating system 303 where the calls implement the various functions or services to be performed by application 304.
- Application 304 may include, for example, artificial neural network engine 201 ( Figure 2) and convolutional operator 202 ( Figure 2).
- application 304 may include, for example, a program for processing volumetric videos using a 4D convolutional neural network by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video as discussed further below in connection with Figures 4-6, 7A-7B and 8A-8C.
- ROM 305 is connected to system bus 302 and includes a basic input/output system (“BIOS”) that controls certain basic functions of volumetric video processing system 101.
- RAM random access memory
- Disk adapter 307 is also connected to system bus 302. It should be noted that software components including operating system 303 and application 304 may be loaded into RAM 306, which may be volumetric video processing system’s 101 main memory for execution.
- Disk adapter 307 may be an integrated drive electronics (“IDE”) adapter that communicates with a disk unit 308, e.g., disk drive.
- IDE integrated drive electronics
- volumetric video processing system 101 may further include a communications adapter 309 connected to bus 302. Communications adapter 309 interconnects bus 302 with an outside network to communicate with other devices.
- application 304 of volumetric video processing system 101 includes the software components of artificial neural network engine 201 and convolutional operator 202. In one embodiment, such components may be implemented in hardware, where such hardware components would be connected to bus 302.
- volumetric video processing system 101 is a particular machine that is the result of implementing specific, non-generic computer functions.
- functionality of such software components e.g., neural network engine 201 and convolutional operator 202 of volumetric video processing system 101, including the functionality for processing volumetric videos using a 4D convolutional neural network by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video, may be embodied in an application specific integrated circuit.
- the present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages.
- ISA instruction-set-architecture
- machine instructions machine dependent instructions
- microcode firmware instructions
- state-setting data configuration data for integrated circuitry
- configuration data for integrated circuitry or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand- alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- FPGA field-programmable gate arrays
- PLA programmable logic arrays
- These computer readable program instructions may be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- the flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the blocks may occur out of the order noted in the Figures.
- two blocks shown in succession may, in fact, be accomplished as one step, executed concurrently, substantially concurrently, in a partially or wholly temporally overlapping manner, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- a voxel represents a value on a regular grid, such as in three- dimensional space
- standard regularization techniques may be insufficient in creating a robust classifier.
- 4D convolutional neural networks may be susceptible to over-fitting high- dimensional patterns thereby not achieving the desired accuracy. [0074] Consequently, there is not currently a means for effectively processing volumetric volumes using a replacement for standard 4D convolution to overcome the above structural problems (computational expense) and improve accuracy, such as segmentation accuracy in dense biomedical imaging data.
- inventions of the present invention provide a means for effectively processing volumetric volumes that overcomes the computational expense and improves accuracy, such as segmentation accuracy in dense biomedical imaging data, by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video as discussed below in connection with Figures 4-6, 7A-7B and 8A-8C.
- Figure 4 is a flowchart of a method for training a convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer on volumetric videos.
- Figure 5 is a flowchart of a method for processing volumetric videos using a 4D convolutional neural network by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video.
- Figure 6 is a flowchart of a method for forming a representation of the volumetric video using pseudo-4D convolutions.
- Figure 7A illustrates a 3D convolution performed over the spatial dimensions and a 1D convolution performed over time.
- Figure 7B illustrates a 3D convolution performed over the spatial dimensions and a sequence of 2D convolutions performed over each of the motion planes.
- Figure 8A illustrates an architecture of a 4D convolutional neural network being an encoder-decoder architecture.
- Figure 8B illustrates a visualization of a computational block for the P4D-A convolutional neural network.
- Figure 8C illustrates a visualization of a computational block for the P4D-B convolutional neural network.
- FIG. 4 is a flowchart of a method 400 for training a convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer on volumetric videos in accordance with an embodiment of the present invention.
- artificial neural network engine 201 of volumetric video processing system 101 receives training data, such as from an expert, to train a convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer on volumetric videos.
- the training data corresponds to sample data consisting of volumetric videos and segmented volumetric videos, denoised volumetric videos, deconvoluted volumetric videos or image domain transferred volumetric videos.
- artificial neural network engine 201 of volumetric video processing system 101 trains a 4D convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer on volumetric videos using the training data.
- artificial neural network engine 201 utilizes a machine learning algorithm (e.g., supervised learning algorithm) to make predictions or decisions as to the segmentation, denoising, deconvolution or image domain transfer to be performed on the volumetric videos using the received training data.
- the algorithm iteratively makes predictions on the training data as to the appropriate segmentation, denoising, deconvolution or image domain transfer to be performed on the volumetric videos.
- supervised learning algorithms include neural networks, such as a 4D convolutional neural network.
- a convolutional neural network is a class of artificial neural network, most commonly applied to analyze visual imagery.
- a convolutional neural network is also known as a shift invariant or space invariant artificial neural network based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation equivariant responses known as feature maps.
- the convolutional neural network is a regularized version of multilayer perceptrons.
- a “perceptron,” as used herein, refers to an algorithm for supervised learning of binary classifiers.
- a binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.
- the perceptron is a type of linear classifier that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
- the convolutional neural network utilized by the present invention corresponds to a 4D convolutional neural network.
- a 4D convolutional neural network refers to a convolutional neural network that processes a 4D input, such as a 4D volumetric video.
- the convolutional neural network uses “ pseudo-4D convolutions” to perform operations, such as segmentation, denoising, deconvolution or image domain transfer, on the received volumetric videos.
- “Pseudo-4D convolutions,” as used herein, refer to a set of lower- dimensional convolutions that, upon aggregation, create features informed by all spatial-temporal axes while avoiding the burden of parameter explosion.
- volumetric videos may then be processed by such a 4D convolutional neural network using pseudo-4D convolutions as discussed below in connection with Figure 5.
- FIG. 5 is a flowchart of a method 500 for processing volumetric videos using a 4D convolutional neural network by utilizing multiple lower-dimensional convolutions along all dimensions of the volumetric video in accordance with an embodiment of the present invention.
- volumetric video processing system 101 receives volumetric videos 102.
- such volumetric videos consist of biomedical image data, such as generated by non-ionizing biomedical imaging or an ionizing system.
- step 502 for each of the received volumetric videos 102, convolutional operator 202 of volumetric video processing system 101 performs convolutional operations over three spatial dimensions and a fourth dimension of the received volumetric video using a 4D convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer.
- the fourth dimension corresponds to a temporal dimension.
- such convolutional operations correspond to “pseudo-4D convolutions,” which as discussed above, refer to a set of lower-dimensional convolutions that, upon aggregation, create features informed by all spatial-temporal axes while avoiding the burden of parameter explosion.
- Such a set of lower-dimensional convolutions includes a set of two-dimensional and/or three-dimensional convolutions that upon aggregation create features informed by the three spatial dimensions and the fourth dimension (e.g., temporal dimension).
- Such pseudo-4D convolutions form a representation of the volumetric video as discussed below in connection with Figure 6.
- Figure 6 is a flowchart of a method 600 for forming a representation of the volumetric video using pseudo-4D convolutions in accordance with an embodiment of the present invention.
- convolutional operator 202 of volumetric video processing system 101 generates spatial feature maps of the volumetric video by the 4D convolutional neural network.
- step 602 convolutional operator 202 of volumetric video processing system 101 generates a spatial-temporal representation by the 4D convolutional neural network examining each of the three motion planes of the volumetric video.
- each of these three motion planes include a spatial axis (e.g., X, Y, Z), which may represent a length, a height or a depth, along with a fourth dimension, such as a temporal dimension, of the volumetric video.
- a spatial axis e.g., X, Y, Z
- a fourth dimension such as a temporal dimension
- step 603 convolutional operator 202 of volumetric video processing system 101 creates a representation of the volumetric video by combining the spatial feature maps and the spatial- temporal representation as discussed further below in connection with Figure 8C.
- step 603 A further discussion regarding methods 400, 500 and 600 utilizing “pseudo-4D convolutions” to replace dense 4D convolutions is provided below.
- pseudo-4D (P4D) convolutions are a separable kernel approach to 4D processing.
- Figures 7A and 7B illustrate two possible P4D convolutions that are utilized by the present invention.
- Figure 7A illustrates a 3D convolution performed over the spatial dimensions (identified as “S” 701) and a 1D convolution performed over time (identified as “T” 702) in accordance with an embodiment of the present invention.
- Figure 7B illustrates a 3D convolution performed over the spatial dimensions (identified as “S” 703) and a sequence of 2D convolutions performed over each of the motion planes (identified as “M” 704) in accordance with an embodiment of the present invention.
- such 2D convolutions are performed in the X, Y and Z spatial axis (e.g., height, width, depth) along the temporal dimension of the motion planes of the volumetric video with the kernel size of T (temporal) x H (height) x W (width) x D (depth), which is represented by the labeling of 3 x 3 x 1 x 1 (705), 3 x 1 x 3 x 1 (706) and 3 x 1 x 1 x 3 (707).
- the operations shown in Figures 7A and 7B differ in how the spatial and temporal processing are decoupled as well as the complexity of temporal processing as discussed below.
- nonlinear functions are inserted among the 2D and 3D convolutions.
- Figures 7A and 7B illustrate the fourth dimension being the temporal dimension that the present invention is not to be limited in such a manner and that the fourth dimension may correspond to other dimensions.
- a person of ordinary skill in the art would be capable of applying the principles of the present invention to such implementations. Further, embodiments applying the principles of the present invention to such implementations would fall within the scope of the present invention.
- P4D-A implements a straightforward low-rank approximation to a 4D kernel by applying a 3D (xyz) convolution followed by a 1D (t) convolution.
- the P4D-B convolution is a parallel process that additively updates spatial features with a low dimensional encoding of the 4D spatial-temporal neighborhood.
- the P4D-B convolution encodes high-dimensional features via low-dimensional projections.
- the sequential application of 2D convolutions increases the size of the temporal receptive field without an undue increase in parametrization.
- the P4D convolutions possess unique properties unavailable to the standard 4D convolution. First, both P4D convolutions are smaller in parameter count than the standard 4D convolution.
- a P4D-A convolutional neural network is one-third the size of a comparable 4D convolutional neural network (CNN); a P4D-B convolutional neural network is two-thirds the size.
- CNN 4D convolutional neural network
- P4D-B convolutional neural network is two-thirds the size.
- the distribution of parameters within a P4D convolution can be weighted to a specific dimension to leverage prior knowledge of the nature of the data. Further details on preactivated, residual P4D convolutions are provided below [0097]
- the P4D-A approach separates spatial and temporal processing into two sequential operations, 701, 702.
- the spatial operation ( ⁇ ⁇ ) 701 used a 3D convolution of size three (3 ⁇ 3 ⁇ 3), with the temporal operation ( ⁇ ⁇ ) 702 being a 1D convolution of length three.
- a P4D-A convolution, used in a residual fashion, is defined in equation (1): ⁇ where function F i () represents the application of a parametric rectified linear unit and group normalization, and ⁇ represents an appropriately padded convolution with convolutional weights W s 701 or W t 702.
- each of these convolutions is preceded with a nonlinear function and normalization;
- F M () represents these steps in equation (2):
- the 4D convolutional neural network utilizes an encoder-decoder architecture 800 as shown in Figure 8A in accordance with an embodiment of the present invention to evaluate the effectiveness of the different 4D kernels at the task of semantic segmentation.
- the first layer generates 16 features 801 where such features 801 are doubled in encoder layers (downsampling) and halved within decoder layers (upsampling).
- FIG. 8A architecture 800 includes linear layers 802, computational blocks 803, strided convolutions 804, transposed convolutions 805, and skip connections 806.
- Figures 8B and 8C define layer-wise operations as discussed further below.
- the 4D convolutional neural network is trained to minimize focal loss and utilize the Adam optimizer (replacement optimization algorithm for stochastic gradient descent for training deep learning models, such as convolutional neural networks) with a learning rate of 2 ⁇ 10 ⁇ 4 and weight decay of 5 ⁇ 10 ⁇ 5 .
- dropout shown in Figures 8B and 8C
- Figure 8B illustrates a visualization of computational block 803 for P4D-A convolutional neural network in accordance with an embodiment of the present invention.
- computational block 803 includes a 3D convolution 807 performed over the spatial dimensions of the volumetric volume as well as a 1D convolution 808 performed over the temporal dimension of the volumetric volume along with learning the parameters that control the shape and leaky-ness of the function by the Parameter Rectified Linear Activation Function (“PReLU”) and performing group normalization (“GroupNorm”) as shown in elements 809A-809B.
- PReLU Parameter Rectified Linear Activation Function
- GroupNorm group normalization
- Figure 8C illustrates a visualization of computational block 803 for P4D-B convolutional neural network in accordance with an embodiment of the present invention.
- computational block 803 includes a series of 2D convolutions 811A-811C performed in the X, Y and Z spatial axis (e.g., height, width, depth) along the temporal dimension of the motion planes of the volumetric video. Furthermore, the Parameter Rectified Linear Activation Function (“PReLU”) and group normalization (“GroupNorm”) are performed prior to such convolutions as shown in elements 812A-812C. Additionally, as shown in Figure 8C, a 3D convolution 813 is performed over the spatial dimensions along with having the Parameter Rectified Linear Activation Function (“PReLU”) and group normalization (“GroupNorm”) performed prior to such a convolution as shown in element 812D.
- PReLU Parameter Rectified Linear Activation Function
- GroupNorm group normalization
- nonlinear function is inserted and normalization is performed preceding each of the convolutional operations 811A-811C and 813.
- the process discussed above is repeated twice followed by utilizing a dropout 814 to randomly select neurons to be ignored.
- the output of linear layer 802 is combined with the output of dropout 814.
- nonlinear functions are inserted among the 2D and 3D convolutions.
- Cardiac cine MRI is the gold standard for evaluating cardiac function as it provides insight into tissue level dynamics within a cross-section or volume. Evaluating cardiac function requires segmenting the endocardium and epicardium in the right and/or left ventricle. In contrast to the left ventricle, analysis of the RV is complicated due to its crescent-shaped appearance and longitudinal, rather than radial, elongation.
- the first challenge potentially affects 3D processing and stems from how the collection of slices over depth (z) occurs.
- the collection of slices over (z) is done by imaging (t,x,y) volumes in sequence.
- Figure 9 is a table of the cine MRI segmentation results utilizing the principles of the present disclosure.
- the table of Figure 9 illustrates the segmentation quality for a 4D and the P4D CNNs alongside the results for fully autonomous methods previously submitted to the RVSC.
- the table of Figure 9 presents three fully automatic methods by Zuluaga et al., “Multi-Atlas Propagation Whole Heart Segmentation from MRI and CTA Using a Local Normalised Correlation Coefficient Criterion," Springer, 2013, pp.174-181, Tran et al., “A Fully Convolutional Neural Network for Cardiac Segmentation in Short-Axis MRI,” arXiv e-prints, 2016, pp.1-21, and Chuck-Hou Yee, “Heart Disease Diagnosis with Deep Learning,” September 12, 2017, pp.1-13. Outside of these machine methods, the table includes the estimated accuracy of a trained clinician to serve as the gold standard.
- the segmentation results reported as the Dice Score, are displayed as an average (standard deviation) and parameters are rounded to the nearest million. These results are from the RVSC test set.
- Example raw data, ground truth, 4D results and P4D-B results are shown in Figure 10 in accordance with an embodiment of the present disclosure.
- the rows are example data slices that serve as the input to the CNNs, including data with an overlay of the ground truth, data with an overlay of the results from the 4D CNN and data with an overlay of the results from the P4D-B CNN.
- the scale bar in the first raw image covers 50 mm.
- the results illustrate the impact of higher dimensional processing as the CNNs improve over previously examined methods by a noticeable margin.
- a module (gated recurrence module 1101) of volumetric video processing system 101 as shown in Figure 11 in accordance with an embodiment of the present disclosure, for the adaptive alignment of features (“GRAAFs”) performs motion correction without explicit frame warping or optical flow.
- application 304 of volumetric video processing system 101 includes the software component of gated recurrence module 1101.
- such a component may be implemented in hardware, where such hardware components would be connected to bus 302.
- the functions discussed herein performed by such a component are not generic computer functions.
- volumetric video processing system 101 is a particular machine that is the result of implementing specific, non-generic computer functions.
- the functionality of such a software component (e.g., gated recurrence module 1101) of volumetric video processing system 101 may be embodied in an application specific integrated circuit.
- the GRAAF neural network of the present disclosure corresponds to an encoder-decoder neural network, which operates in a feed-forward fashion. This formulation allows for the denoising of an unseen framex[ t] with only the hidden states associated with the past denoised frame x[ t ⁇ 1] while avoiding identity collapse.
- gated recurrence module 1101 is a modified gated recurrent unit via the introduction of an adaptive alignment module first formulated for video super-resolution.
- gated recurrence module 1101 implements the following frame-by-frame implicit alignment.
- a frame-by-frame alignment occurs within gated recurrence module 1101.
- frame x[ t] is denoised using the hidden states of x[ t ⁇ 1]. It is now deemed appropriate to discuss the modified equations that generate a reference volume, G ⁇ ( f t ) , informed by both local and global information.
- Equation (8)-(11) detail the equations related to gated recurrence module 1101 and are visualized in Figure 12.
- Figure 12 is a control chart for equations (8)-(11) in accordance with an embodiment of the present disclosure.
- adaptive alignment 1201 is performed using gated recurrence module 1101.
- the extracted central ⁇ 0 serves as the target.
- the neural network of the present disclosure takes in two streams: [0125] x 1 and x 2 are copies of one another, where x 2 is flipped with respect to its temporal dimension.
- these two representations are stored in the batch of a tensor such that they do not interact until the final stage where the outputs from the decoder, ( y [ t ⁇ n ], ... , y[ t ⁇ 1], y [ t + 1], ... , y[ t + n ]), are utilized to estimate the value in the central frame.
- y [ t ⁇ n ] the outputs from the decoder
- Table 1 [0127] Referring to Table 1, the higher numbers mean a more efficacious denoising as PSNR (peak signal-to-noise ratio) is calculated according to where MSE is the mean square error and MAX I is equal to 2 bit_depth . Furthermore, Table 1 includes the metrics of structural similarity index measure (SSIM) as well as the spatially average temporal correlations (Mean Cross Correlation) for the top k voxels, where top k concerns choosing a set of k voxels that have the highest summed value over time that directly correlates to neural activity.
- SSIM structural similarity index measure
- Mean Cross Correlation the spatially average temporal correlations
- embodiments of the present invention provide a means for effectively processing volumetric volumes that overcomes the computational expense and improves accuracy, such as segmentation accuracy in dense biomedical imaging data, by utilizing pseudo-4D convolutions of the volumetric video (multiple lower-dimensional convolutions along all dimensions of the volumetric video).
- pseudo-4D convolutions of the present invention permit the integration of information from all four dimensions present in volumetric video. Such processing leads to segmentation, denoising, deconvolution or image domain transfer results that are superior to those achieved with prior techniques.
- Pseudo-4D convolutions improve upon the 4D convolution in several ways.
- imaging systems produce representations of an object’s form, especially a visual representation.
- biomedical imaging systems enable the visualization of the internal organs of the body and its diseases, such as via volumetric videos. In the past, such volumetric videos simply capture a three-dimensional space, such as a location or performance.
- volumetric videos capture four dimensions, namely, the three spatial axes and a temporal component.
- Attempts have recently been made to process volumetric videos using a 4D convolution.
- 4D convolution presents a problem of computational expense: given a kernel of lengthk in each dimension, a 4D convolution makes use of k 4 parameters, leading to poor memory/parameter scaling.
- a voxel represents a value on a regular grid, such as in three- dimensional space
- standard regularization techniques may be insufficient in creating a robust classifier.
- Embodiments of the present disclosure improve such technology by receiving volumetric videos, which may consist of biomedical image data.
- volumetric videos are generated by non-ionizing biomedical imaging or an ionizing system.
- Convolutional operations are then performed over three spatial dimensions and a fourth dimension of the received volumetric video using a 4D convolutional neural network to perform segmentation, denoising, deconvolution or image domain transfer.
- the fourth dimension corresponds to a temporal dimension.
- such convolutional operations correspond to “pseudo-4D convolutions,” which refer to a set of lower-dimensional convolutions that, upon aggregation, create features informed by all spatial-temporal axes while avoiding the burden of parameter explosion.
- Such a set of lower-dimensional convolutions includes a set of two- dimensional and/or three-dimensional convolutions that upon aggregation create features informed by the three spatial dimensions and the fourth dimension (e.g., temporal dimension).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Radiology & Medical Imaging (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Primary Health Care (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
Procédé mis en œuvre par ordinateur, système et produit programme d'ordinateur pour traiter des vidéos volumétriques. Des vidéos volumétriques, qui peuvent être constituées de données d'image biomédicale, sont reçues. De telles vidéos volumétriques peuvent être générées par imagerie biomédicale non ionisante ou par un système d'ionisation. Des opérations de convolution sont ensuite effectuées sur trois dimensions spatiales et une quatrième dimension d'une vidéo volumétrique reçue à l'aide d'un réseau neuronal convolutif 4D pour effectuer une segmentation, un débruitage, une déconvolution ou un transfert de domaine d'image, la quatrième dimension pouvant correspondre à une dimension temporelle. En outre, de telles opérations de convolution impliquent un ensemble de convolutions de dimension inférieure qui, lors de l'agrégation, créent des caractéristiques informées par tous les axes spatio-temporels tout en évitant la charge d'explosion de paramètre. Un tel ensemble de convolutions dimensionnelles inférieures peut comprendre des convolutions unidimensionnelles, bidimensionnelles et/ou tridimensionnelles qui, lors de l'agrégation, créent des caractéristiques informées par les trois dimensions spatiales et la quatrième dimension (par exemple, dimension temporelle).
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263308128P | 2022-02-09 | 2022-02-09 | |
| US63/308,128 | 2022-02-09 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023154358A1 true WO2023154358A1 (fr) | 2023-08-17 |
Family
ID=87564959
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/012643 Ceased WO2023154358A1 (fr) | 2022-02-09 | 2023-02-08 | Traitement de vidéos volumétriques à l'aide d'un réseau neuronal convolutif 4d et de multiples convolutions de dimension inférieure |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2023154358A1 (fr) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160198168A1 (en) * | 2012-05-14 | 2016-07-07 | Luca Rossato | Encoding and decoding based on blending of sequences of samples along time |
| US9968257B1 (en) * | 2017-07-06 | 2018-05-15 | Halsa Labs, LLC | Volumetric quantification of cardiovascular structures from medical imaging |
| US20200167930A1 (en) * | 2017-06-16 | 2020-05-28 | Ucl Business Ltd | A System and Computer-Implemented Method for Segmenting an Image |
-
2023
- 2023-02-08 WO PCT/US2023/012643 patent/WO2023154358A1/fr not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160198168A1 (en) * | 2012-05-14 | 2016-07-07 | Luca Rossato | Encoding and decoding based on blending of sequences of samples along time |
| US20200167930A1 (en) * | 2017-06-16 | 2020-05-28 | Ucl Business Ltd | A System and Computer-Implemented Method for Segmenting an Image |
| US9968257B1 (en) * | 2017-07-06 | 2018-05-15 | Halsa Labs, LLC | Volumetric quantification of cardiovascular structures from medical imaging |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Grampurohit et al. | Brain tumor detection using deep learning models | |
| CN109859147B (zh) | 一种基于生成对抗网络噪声建模的真实图像去噪方法 | |
| Kumar et al. | Breast cancer classification of image using convolutional neural network | |
| Priyadharshini et al. | A novel hybrid Extreme Learning Machine and Teaching–Learning-Based Optimization algorithm for skin cancer detection | |
| CN112669248B (zh) | 基于cnn与拉普拉斯金字塔的高光谱与全色图像融合方法 | |
| CN109325931A (zh) | 基于生成对抗网络和超分辨率网络的多模态图像融合方法 | |
| CN114748053A (zh) | 一种基于fMRI高维时间序列的信号分类方法及装置 | |
| Young et al. | Supervision by denoising | |
| Kumar et al. | An efficient framework for brain cancer identification using deep learning | |
| Ribeiro et al. | Evaluating the pre-processing impact on the generalization of deep learning networks for left ventricle segmentation | |
| Annavarapu et al. | Figure-ground segmentation based medical image denoising using deep convolutional neural networks | |
| WO2023154358A1 (fr) | Traitement de vidéos volumétriques à l'aide d'un réseau neuronal convolutif 4d et de multiples convolutions de dimension inférieure | |
| Pal et al. | A brief survey on image segmentation based on quantum inspired neural network | |
| Amrita et al. | Water wave optimized nonsubsampled shearlet transformation technique for multimodal medical image fusion | |
| Wang et al. | Spatially adaptive losses for video super-resolution with GANs | |
| Patel et al. | Deep learning in medical image super-resolution: A survey | |
| Ramadass et al. | Effectiveness of generative adversarial networks in denoising medical imaging (CT/MRI images) | |
| Ahmed et al. | GraFMRI: A graph-based fusion framework for robust multi-modal MRI reconstruction | |
| CN120125599B (zh) | 一种基于cod频域引导的医学图像模糊边界分割系统及方法 | |
| Naeini | Structure guided image restoration: a deep learning approach | |
| Nazeri Naeini | Structure guided image restoration: a deep learning approach | |
| Mukherjee et al. | A min-max based hyperparameter estimation for domain-adapted segmentation of amoeboid cells | |
| Andermatt | Automated brain lesion segmentation in magnetic resonance images | |
| US12482149B2 (en) | Under-sampled magnetic resonance image reconstruction of an anatomical structure based on a machine-learned image reconstruction model | |
| Xin et al. | Image Enhancement Method Based on Conditional Generative Adversarial Network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23753413 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23753413 Country of ref document: EP Kind code of ref document: A1 |