WO2025049074A1

WO2025049074A1 - Robotic grasping using efficient vision transformer

Info

Publication number: WO2025049074A1
Application number: PCT/US2024/041621
Authority: WO
Inventors: Kyle COELHO; Brian ZHU; Ines UGALDE DIAZ; Husnu Melih ERDOGAN; Eugen SOLOWJOW; Paul Andreas BATSII; Christopher SCHÜTTE
Original assignee: Siemens AG; Siemens Corp
Current assignee: Siemens AG; Siemens Corp
Priority date: 2023-08-30
Filing date: 2024-08-09
Publication date: 2025-03-06
Anticipated expiration: 2026-02-28

Abstract

A method for executing robotic grasps includes acquiring an image of a scene with one or more objects and processing the acquired image using a trained grasping neural network. Depth and color frames of the image are passed separately through convolutional blocks of an encoder for extracting feature maps via down sampling of the respective image frames. The feature maps extracted from the depth and color frames are fused to produce a fused feature map. The fused feature map is spatially divided into patches and fed as input to a vision transformer to encode the patches based on information from other patches in the fused feature map. The output of the vision transformer is fed to a decoder to construct a grasp map of the scene. An optimal grasping location is estimated from the grasp map.

Description

ROBOTIC GRASPING USING EFFICIENT VISION TRANSFORMER

TECHNICAL FIELD

[0001] The present disclosure relates generally to the field of robotics in industrial automation, and more specifically to methods and systems for learning and executing robotic grasps utilizing neural networks.

BACKGROUND

[0002] Robotic bin picking of unknown objects is a crucial component in advancing automation in warehouses and manufacturing lines. Recent advances in computer vision and deep learning have enabled this technology to be deployed at large scales for picking objects of all geometries and in any arrangement. Camera systems, such as RGB-D cameras, may collect both color pictures and depth maps or point-clouds of bins with objects in random configurations. The camera input may then be transferred to deep neural networks (“grasping neural networks”) that have been trained to compute optimal grasping locations or “pick points” based on said input.

[0003] The success of a robotic bin picking solution may be measured in customer focused KPIs that include number of operator interventions, and successful picks per hour. It is desirable for the number of operator resets to be minimized and picks per hour to be maximized to allow for continuous running of the deployed system and to maximize the throughput of goods. To enable this, however, it may be necessary to ensure that a grasping neural network not only delivers a valid pick point, but that the pick point is also centered regardless of geometry and is located on an object that is totally un-occluded (not blocked by any other object). This may not be an easy task because of potential issues such as ‘double picks’ (when a pick point is on the boundary between two objects leading to over-picking) and ‘popouts’ (when a pick point is on an object that is occluded and might fling another object out of the bin when being picked at high speeds). Furthermore, there can be a near infinite variation in object configurations that must be handled along with the fact that a grasping neural network is often deployed at the edge on resource constrained devices. Lastly, there can be a chance that an input image frame could be of poor quality. All these factors must be considered when designing an end-to-end grasping neural network whose role is to output the best possible pick point given the extremely high variability in object type and arrangement. SUMMARY

[0004] Aspects of this disclosure provide methods, systems, and computer program products that address and overcome one or more of the above-described technical challenges. The present disclosure is directed to an end-to-end deep learning-based grasping pipeline that can provide robust, centered and un-occluded pick points on unknown objects in random configurations and that can be run efficiently on low compute platforms, such as Edge devices.

[0005] A first aspect of this disclosure provides a computer-implemented method for executing robotic grasps. The method comprises acquiring, via a camera, an image of a scene including one or more objects, the image defined by a depth image frame and a color image frame. The method further comprises processing the acquired image using a trained grasping neural network. The processing comprises passing the depth image frame and the color image frame through convolutional blocks of an encoder for extracting feature maps via down sampling of the respective image frames. The processing further comprises fusing the feature maps extracted from the depth image frame and the color image frame to produce a fused feature map. The processing further comprises feeding the fused feature map, spatially divided into patches, as input to a vision transformer to encode the patches based on information from other patches in the fused feature map. The processing further comprises feeding an output of the vision transformer to a decoder to construct a grasp map of the scene. The method further comprises estimating an optimal grasping location based on the grasp map.

[0006] A second aspect of this disclosure provides a computer-implemented method for learning robotic grasping. The method comprises inputting images from a training dataset to a grasping neural network comprising an encoder-decoder architecture, each image comprising a color image frame and a corresponding depth image frame depicting a scene including one or more objects placed in random configurations. The method further comprises, for each image, passing the depth image frame and the color image frame through convolutional blocks of an encoder for extracting feature maps via down sampling of the respective image frames. The method further comprises, for said image, fusing the feature maps extracted from the depth image frame and the color image frame to produce a fused feature map. The method further comprises, for said image, feeding the fused feature map, spatially divided into patches, as input to a vision transformer to encode the patches based on information from other patches in the fused feature map. The method further comprises, for said image, feeding an output of the vision transformer to a first decoder to construct a grasp map of the scene. The method comprises training the grasping neural network based on a grasping loss of the first decoder, the grasping loss being computed based on ground truth grasp maps for the images in the training dataset.

[0007] Further aspects of this disclosure provide systems and computer program products embodying the described methods.

[0008] Additional technical features and benefits may be realized through the techniques of the present disclosure. Embodiments and aspects of the disclosure are described in detail herein and are considered a part of the claimed subject matter. For a better understanding, refer to the detailed description and to the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The foregoing and other aspects of the present disclosure are best understood from the following detailed description when read in connection with the accompanying drawings. To easily identify the discussion of any element or act, the most significant digit or digits in a reference number refer to the figure number in which the element or act is first introduced.

[0010] FIG. 1 schematically illustrates an autonomous system configured for executing a robotic grasp for a bin picking application, according to one or more embodiments.

[0011] FIG. 2 schematically illustrates an encoder-decoder neural network architecture incorporating a vision transformer for robotic grasping according to one or more embodiments.

[0012] FIG. 3 schematically illustrates a further development to the neural network architecture shown in FIG. 1 to include a segmentation decoder in addition to a grasp decoder, according to one or more embodiments.

[0013] FIG. 4 illustrates a comparison between traditional compute heavy attention and light weight attention computation in a vision transformer according to disclosed embodiments. [0014] FIG.5, FIG. 6 and FIG. 7 illustrate outputs of a grasping neural network according to disclosed embodiments under different example scenarios.

[0015] FIG. 8 illustrates a computing environment within which embodiments of this disclosure may be implemented.

DETAILED DESCRIPTION

[0016] Various technologies are described herein that are directed to robotic grasping of objects in industrial applications. The objects may be placed in a bin, or otherwise disposed, such as on a table. Without loss of generality, the proposed methodology is described in the context of a robotic bin picking application. The term “bin”, as used herein, refers to a container or other structure (e.g., a tray, tote, pallet, carton, etc.) capable of receiving physical objects. A robotic bin picking application typically involves controlling a robot having a robotic arm with end effector to grasp objects individually from a pile of objects disposed in a bin. The objects may be of the same or assorted types, and may, for example, be disposed in random configurations or poses in the bin.

[0017] To execute a robotic grasp, an optimal grasping location (also referred to herein as “pick point”) may be computed from a camera image of the pick scene that includes the bin with the objects using a grasping neural network. Grasping neural networks are typically convolutional, such that given an input image (typically including a depth frame), the network can output a grasp map that assigns each pixel of the input image with some type of grasp score indicative of a confidence of grasp. Based on the grasp map, an optimal grasp location or pick point may be computed, typically by applying an ‘argmax’ operator on the grasp map, i.e., selecting the pixel with the highest grasp confidence.

[0018] The proposed methodology provides an improved deep learning-based grasping pipeline that can provide robust, centered and un-occluded pick points on unknown objects in random configurations and that can be run efficiently on low compute platforms, such as Edge devices. This objective is achieved by redesigning the grasping neural network architecture to truly understand the semantic and geometric properties of a pick scene and all the objects in it. The pipeline uses a depth image frame and a color image frame as input to the grasping neural network. The grasping neural network has an encoder-decoder architecture and utilizes an efficient vision transformer with selfattention to learn the complex relationships between objects in a scene and the full extent of an object’s geometry.

[0019] The proposed methodology is based on an inventive mechanism to fuse the features extracted from the input depth and color image frames to enable the maximum amount of information sharing between these two modalities to consistently produce desired results, especially in case one of the input modalities is sub-optimal. This is important because there are complementary as well as non-complementary features in the depth and color image frames. For example, textures on objects in a color image will not be present in a depth image, while the edges in a color image should match the same edges in depth image as well.

[0020] Traditionally, the input to a vision transformer is defined by dividing an input image into a number of non-overlapping patches, which are fed as a sequence of input tokens to the transformer. To illustrate, an image of 8x8 pixels can be broken up into 16 patches of 2x2 pixels. The input token sequence then goes through a self-attention process at a transformer attention layer. This approach can pose some challenges especially in the context of autonomous robotic bin picking, which can include, among others: computational complexity, and the fact that, unlike words in a sentence, an image can start to lose semantic meaning once it is cut up into patches.

[0021] According to the proposed methodology, the input depth and color image frames are first down sampled by convolutional blocks of an encoder to learn a set of features (feature maps) that are fused to produce a fused feature map. The fused feature map is spatially divided into patches and passed on to a vision transformer. By first down sampling the depth and color image frames, it may be possible to aggregate a lot of features from the images, enabling the transformer attention layer to better understand spatial relationships between the patches in the fused feature map, and thus better understand object boundaries, long objects, etc., using less training data. In general, the proposed methodology can enable the vision transformer to understand pick scenes with randomly oriented objects better with each patch being encoded with information about every other patch in the fused feature map. Furthermore, the step of prior down sampling can also reduce computational complexity of the transformer, which may be particularly beneficial during inference time.

[0022] The architecture of the proposed methodology may be understood as following a wishbone or Y shaped structure where each input (i.e., the depth image and the color image) first goes through its own set of processing with the convolutional blocks and then get processed together with the transformer attention. The reasoning behind the methodology may be follows: first, each input is used to learn its own set of low-level features such as edges and contours, after which the addition of the feature maps extracted from these inputs will boost similar features while negating dissimilar ones, and finally the vision transformer will jointly attend to the best representation that both the inputs provide. The transformer output is fed to a decoder to construct a grasp map of the pick scene, based on which an optimal grasping location is estimated. The methodology can provide an end-to-end solution because an output is directly produced with the desired result.

[0023] The grasping neural network according to the proposed methodology can produce highly centered picks on objects that are totally un-occluded for any unknown object. This is testament to the heightened awareness that the grasping neural network has for geometries and spatial relationships between objects. Moreover, due to the transformer’s attention and the inventive fusion mechanism, the grasping neural network can compensate for when one of the input modalities is of poor quality, leading to much better generalization and robustness when the trained grasping neural network is deployed. Furthermore, the architecture can provide for high computational efficiency, allowing it to be run on CPU only edge devices that greatly expands its range of applications. Still further, the

[0024] Aspects of the proposed methodology may be embodied as software executable by a processor. In some embodiments, aspects of the disclosed methodology may be suitably integrated into commercial artificial intelligence (Al)-based automation software products, such as SIMATIC Robot Pick Al™ developed by Siemens AG, among others.

[0025] Turning now to the drawings, FIG. 1 illustrates an autonomous system 100 configured for performing robotic bin picking according to one or more embodiments. In the following description, unless otherwise specified, the term “system” refers to the autonomous system 100. The system 100 may be implemented in a factory setting. In contrast to conventional automation, autonomy gives each asset on the factory floor the decision-making and self-controlling abilities to act independently in the event of local issues. The system 100 may comprise one or more robots, such as the robot 102, which may be controlled by a computing system 104 to execute one or more industrial tasks within a physical environment such as a shopfloor. Examples of industrial tasks include assembly, transport, or the like. [0026] The computing system 104 may comprise an industrial PC, or any other computing device, such as a desktop or a laptop, or an embedded system, among others. The computing system 104 can include one or more processors configured to process information and/or control various operations associated with the robot 102. The processor(s) may include a one or more CPUs, GPUs, microprocessors, or any hardware devices suitable for executing instructions stored on a memory comprising a machine-readable medium. In particular, the one or more processors may be configured to execute an application program, such as an engineering tool, for operating the robot 102.

[0027] To realize autonomy of the system 100, in some embodiments, the application program may be designed to operate the robot 102 to perform a task in a skill-based programming environment. In contrast to conventional automation, where an engineer is usually involved in programming an entire task from start to finish, typically utilizing low-level code to generate individual commands, in an autonomous system as described herein, a physical device, such as the robot 102, is programmed at a higher level of abstraction using skills instead of individual commands. The skills are derived for higher-level abstract behaviors centered on how the physical environment is to be modified by the programmed physical device. Illustrative examples of skills include a skill to grasp or pick up an object, a skill to place an object, a skill to open a door, a skill to detect an object, and so on.

[0028] The application program may generate controller code that defines a task at a high level, for example, using skill functions as described above, which may be communicated to a robot controller 108. From the high-level controller code, the robot controller 108 may generate low-level control signals for one or more motors for controlling the movement of the robot 102, such as angular position of the robot arms, swivel angle of the robot base, and so on, to execute the specified task. In other embodiments, the controller code generated by the application program may be communicated to intermediate control equipment, such as programmable logic controllers (PLC), which may then generate low-level control commands for the robot 102 to be controlled. Additionally, the application program may be configured to directly integrate sensor data from the physical environment in which the robot 102 operates. To this end, the computing system 104 may comprise a network interface to facilitate transfer of live data between the application program and various sensors, such as camera 122. [0029] The robot 102 can include a robotic arm or manipulator 110 and a base 112 configured to support the robotic manipulator 110. The base 112 can include wheels 114 or can otherwise be configured to move within the physical environment 106. The robot 102 can further include an end effector 116 attached to the robotic manipulator 110. The end effector 116 may include a gripper configured to grasp (hold) and pick up an object 118. Examples of end effectors include vacuumbased grippers or suction cups (as shown), antipodal grippers such as fingers or claws, magnetic grippers, among others. The robotic manipulator 110 can be configured to move so as to change the position of the end effector 116, to enable picking and moving objects 118 within the physical environment.

[0030] A robotic bin picking task may involve picking objects 118 in a singulated manner from the bin 120 containing the objects 118 using the end effector 116. The objects 118 may be disposed in random configuration (or poses) within the bin 120. The objects 118 can be of assorted types or of the same type. To accomplish this task, the system 100 may include sensors or cameras that enable the robot 102 to perceive the physical environment. As shown, these sensors may include (among others) a camera 122 for capturing an image of the pick scene that includes, in this case, the bin 120 containing the objects 118. The camera 122 may include, for instance, an RGB-D camera, or a cloud sensor, among others. In some embodiments, the camera 122 may be positioned to capture an image with a top-down view of the bin 120. The camera image may be provided as an input to a computing system, such as the computing system 104, for computing a pick point for the end effector 116 to execute a grasp. The pick point may be defined by coordinates of the end effector 116 in a 3D reference frame 130, as well as a direction of approach, which is typically defined by a normal to the object surface on which the pick point lies. The pick point may be computed using a grasping neural network according to the proposed methodology. The computed pick point may be outputted to a controller, such as the robot controller 108, to control the end effector 116 to pick an object 118. For example, as described above, the pick point may be output as high-level controller code to the controller, which may therefrom generate low-level commands to control movement of the end effector 116.

[0031] FIG. 2 illustrates an architecture of a grasping neural network (hereinafter “GNN”) 200 according to one or more embodiments. The various components of the GNN, such as encoders 206, 208 and 210, decoder 218 including sub-components thereof, may be implemented by a computing system in various ways as hardware and programming. The programming for the components may take the form of processor-executable instructions stored on non-transitory machine-readable storage mediums and the hardware may include processors to execute those instructions. For example, the programs may run on the computing system 104 shown in FIG. 1. Furthermore, the processing capability may be distributed among multiple system components, such as among multiple processors and memories, optionally including multiple distributed processing systems or cloud/network elements.

[0032] To train the GNN 200, a training dataset is utilized that comprises images where each image depicts a pick scene including one or multiple objects disposed in random configurations in a bin or table. In various embodiments, the training dataset may include real-world camera images of a pick scene, or synthetic camera images rendered by a simulation engine using 3D representations (e.g., CAD models) of the objects, bin/table and other physical structures present in the pick scene, or a combination of real-world and synthetic images. A non-limiting example of a commercially available 3D simulation engine suitable for the present application is Process Simulate™ developed by Siemens Industry Software Inc. In some embodiments, the synthetic images of the training dataset may be rendered by randomizing a set of environmental and/or sensor-based parameters. Examples of parameters that can be randomized include one or more of: number and object classes of objects present in the pick scene, position and texture of the objects in the bin, texture of background including the bin, settings of simulation camera, lighting conditions of the scene, etc. In some embodiments, the training may comprise initially training the GNN 200 using a large number of synthetic images generated using randomized parameters and subsequently fine-tuning it using a smaller number of real-world images.

[0033] Each image in the training dataset comprises a color image frame and a depth image frame. A color image frame includes a two-dimensional representation of image pixels, where each pixel includes intensity values for a number of color components. An example of a color frame is an RGB color frame, which is an image frame including pixel intensity information in red, green and blue color channels. A depth image frame or depth map includes a two-dimensional representation of image pixels that contains, for each pixel, a depth value. The depth values correspond to the distance of the surfaces of scene objects from a camera viewpoint. The color image frame and the corresponding depth image frame of each image are aligned pixel-wise.

[0034] In some embodiments, the images in the training dataset may be acquired (in a physical and/or simulation environment) via an RGB-D camera, which may be configured to acquire an image with red-green-blue (RGB) color and depth (D) channels. Additionally, or alternatively, images in the training dataset can also be acquired (in a physical and/or simulation environment) using a point cloud sensor. A point cloud may include a set of points in a 3D coordinate system that represent a 3D surface or multiple 3D surfaces, where each point position is defined by its Cartesian coordinates in a 3D reference frame, and further by intensity values of color components (e.g., red, green and blue). A point cloud can thus include a colorized 3D representation of all surfaces in the respective scene. The point cloud can be converted into color (RGB) and depth frames by applying a sequence of transforms based on the camera intrinsic parameters. Camera intrinsic parameters are parameters that allow a mapping between pixel coordinates in the 2D image frame and the 3D reference frame. Typically, the camera intrinsic parameters include the coordinates of the principal point or optical center, and the focal length along orthogonal axes.

[0035] Continuing with reference to FIG. 2, the depth frames 202 of the training images may be provided as input to a first encoder 206 and the corresponding color frames 204 may be provided as input to a second encoder 208. The first encoder 206 can include one or more convolutional blocks configured to down sample each depth image frame 202 to learn a first set of features. In some embodiments, as shown, the first encoder 206 may include a series of convolutional blocks 206a, 206b, 206c that successively down sample the input to extract a feature map at each respective convolutional block 206a, 206b, 206c. The second encoder 208 can likewise include one or more convolutional blocks configured to down sample each color image frame 204 to learn a second set of features. In some embodiments, as shown, the second encoder 208 may correspondingly include a series of convolutional blocks 208a, 208b, 208c that successively down sample the input to extract a feature map at each respective convolutional block 208a, 208b, 208c.

[0036] A “convolutional block” is a building block used in a convolutional neural network (CNN) for image recognition. It may be made up of one or more convolutional layers, which are used to extract features from the input image. The convolutional layers are typically followed by one or more pooling layers, which are used to reduce the spatial dimensions of the feature maps while retaining the most important information.

[0037] In some embodiments, each of the convolutional blocks 206a, 206b, 206c, 208a, 208b, 208c may include an inverted residual bottleneck (IRB) module for extracting the feature maps. An inverted residual bottleneck (or inverted residual block) is a type of residual block used for image models that uses an inverted structure for efficiency reasons and is therefore light-weight and suitable for deployment on low compute platforms. Briefly described, an IRB includes a narrow- wide-narrow architecture. For example, an IRB can follow an approach in which the input is first widened with a 1x1 convolution, then use a 3x3 depthwise convolution (which greatly reduces the number of parameters), then use a 1x1 convolution to reduce the number of channels so input and output can be added. For a detailed understanding of the concept of inverted residual blocks, the reader is referred to the publication: Sandler, M.; Howard, A.; Menglong Zhu; Zhmoginov, A.; Liang-Chieh Chen. “MobileNetV2: Inverted Residuals and Linear Bottlenecks. ” (The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510-4520).

[0038] Leveraging the use of IRBs, in some embodiments, the convolutional stem 206a, 208a (i.e., the first convolutional block that the input is passed through) for the depth and color image frames respectively may be designed differently from the other convolutional blocks in that they include an extra convolutional layer in addition to an IRB module, which performs the down sampling instead of the IRB module. This new tweak can instantly force the network to start learning the most relevant features right from the input itself, as it is already being forced to compress information.

[0039] Furthermore, in some embodiments, each time a feature map is down sampled at a convolutional block, a leaky ReLU nonlinearity may be used as the activation function. This can help to prevent gradients from dying out when training the GNN 200.

[0040] To fuse the features extracted from the depth image frame 202 and the color image frame 204, the disclosed GNN 200 utilizes a fusion module defined by a third encoder 210. As shown, the third encoder 210 may also include a series of convolution blocks 210a, 210b, 210c. Feature maps extracted at the corresponding convolutional blocks of the first encoder 206 and the second encoder 208, such as feature maps extracted at convolutional blocks 206a and 208a, feature maps extracted at convolutional blocks 206b and 208b, and feature maps extracted at convolutional blocks 206c and 208c, may be fused via respective convolutional blocks 210a, 210b, 210c of the third encoder 210. The features extracted at the last convolutional blocks of the encoders 206, 208, 210 may be suitably aggregated by an aggregator 210 to produce a fused feature map 213. For example, the aggregator 210 may include one or more computational modules, such as a concatenation and addition module and an ASPP (Atrous Spatial Pyramid Pooling) module, among others. The shown encoder architecture is illustrative, noting that the number of convolutional layers for each encoder 206, 208, 210 and the computational modules employed is a matter of design choice for one skilled in the art. For example, the number of convolutional blocks in series may be determined such that a desired pixel size is achieved for the fused feature map 213 to be processed by the vision transformer 214.

[0041] Before describing the vision transformer 214, some context about transformers in general may be useful. Transformers are deep learning models that are capable of training on large sets of data through a now well-known concept of self-attention, to detect subtle ways that elements (“tokens”) in a sequence relate to each other. These architectures have helped power the recent revolution in generative Al such as ChatGPT due to their ability to model long term relationships through the attention mechanism. Attention, for a text sentence, allows each word (represented as a “token”) to understand how closely related it is to every other word in the sentence through dot products between their weight matrices. Even if the text sentence is very long, the model can compute this set of affinities for each word due to its highly parallel nature. This kind of attention is called self-attention and is very commonplace in Natural Language Processing (NLP) deep learning models nowadays because it has led to huge performance gains. But, since the model compares a representation of each word with every other word, the complexity is O(n^f 2.d) where n is the number of words and d is the dimension for the vector that represents that word (which is also usually large). Since sentences can be very long, this can be a big bottleneck. This has certainly made it quite computationally inefficient since the model would have to allocate a large amount of memory for certain matrices.

[0042] Extending the above principle to a vision transformer, since there is no notion of a sequence of words in this case, an image is typically split up into patches to create a sequence of input tokens. In the context of a vision transformer, a “token” refers to a vector representation of a patch. The input tokens may be produced by transforming each patch into a vector by an embedding process (typically at an initial layer of the transformer). In the proposed methodology, instead of the input image, the fused feature map 213 is spatially divided into a number of non-overlapping patches, that define a sequence of input tokens for the vision transformer 214. The vision transformer 214 may include one or more transformer blocks that can compute patch-based attention to encode each of the patches based on information from other patches in the fused feature map 213. [0043] Patch-based attention may be computed in a number of ways, as illustrated in FIG. 4. Here, the block (A) illustrates a first approach of attention computation by computing an attention matrix in a traditional manner, and the block (B) illustrates a second approach for light-weight attention computation according to one or more embodiments of this disclosure. For the purpose of illustration, in FIG. 4, the number of input tokens (i.e., n) is shown to be three.

[0044] Referring to block (A), in the first approach each input token may be mapped to a query token 1, 2, 3 using a first weight matrix, a key token 1 2’, 3’ using a second weight matrix, and to a value token (not shown) using a third weight matrix. An attention matrix a is then computed by computing an inner product (dot product) of each query token 1, 2, 3 with each key token 1’, 2’, 3’. As shown, the total number of inner products is n², where n=3 in this illustration. Understandably, for a high-resolution image, n is usually very large, making the computation resource heavy.

[0045] To address computational complexity, the second approach may be employed. In this approach, the concept of “separable attention” (described in the publication: Sachin Mehta and Mohammad Rastegari. Separable Self-attention for Mobile Vision Transformers. In Transactions on Machine Learning Research, 2023) may be suitably modified and adapted to the present problem, such that instead of computing the attention score for each input token with respect to all n tokens, a context score is computed for each token into a single latent token L.

[0046] Referring to block (B), in the second approach, each input token 1, 2, 3 may be mapped respectively to a scalar context score cf ^L, c_s ^{2, L}, c_s ^{3, L} using a linear layer L defined by a set of weights that are learned during the training process. To compute the context scores c_s, first, an inner product of each input token 1, 2, 3 with the linear layer L may be computed, resulting in an n- dimensional vector. This inner product operation computes an affinity between L and the input token. A SoftMax operation may then be applied to this /?-dimensional vector to produce the context scores cf', c_s ^{2, L}, c_s ^{3, L}. The key tokens 1’, 2 3’ are also generated using a learnable weight matrix, which then may then be scaled by the respective context score cf ^L, c_s ^{2, L}, c_s ^{3, L}. The scaled key tokens may then undergo a weighted summation to compute a context vector c_v, which encodes information from all the input tokens, and, at the same time, is much cheaper to compute. The encoding for each patch may then be computed by multiplying the context vector c_v with the matrix representing value tokens of the patches. This approach can directly reduce the complexity of computing the patch-based attention from ()fn^f2j to O(n), which can bring a dramatic speedup in inference time while still maintaining the benefits of learning complex relationships within an image.

[0047] Turning again to FIG. 2, in some embodiments, the vision transformer 214 may include a series of transformer blocks 214a, 214b, 214c, each including an attention layer for computing patch-based attention for encoding the patches, to generate the transformer output 216 of the vision transformer 214.

[0048] The transformer output 216 is fed to a decoder 218 to construct a grasp map 220. Although not shown, in some embodiments, the input to the decoder 218 may additionally include a dimension of the gripper or end effector of the robot being used for the task. The dimension of the gripper may be defined by a pixel dimension computed based on the actual gripper dimension (e.g., expressed in a unit of distance measurement) and the depth image frame 202. The decoder 218 may include a number of convolutional layers (not explicitly shown) configured to up-sample the transformer output 216 and construct a grasp map 220 depicting pixel- wise grasp confidence values identifying grasping points for objects in the scene depicted in the image. The decoder 218 can be, for example, a U-Net style decoder that includes skipped connections from the fusion encoder 210 to the decoder 218 to better aggregate information.

[0049] The GNN 200 may be trained in a supervised learning process by backpropagating a grasping loss of the decoder 218. The grasping loss may be computed based on ground truth grasp maps for the images in the training dataset. The ground truth grasp maps may include pixel labels for each pixel of the input image indicative of a grasp confidence value. In various embodiments, the ground truth pixel labels may be assigned based on hand labeling of the images (e.g., by a user locating one or more pick points in the image via a GUI), or may be automatically generated in case of synthetic images. In some embodiments, the grasping loss may be computed as a cross-entropy loss of the decoder 218, using the ground truth grasp maps. The training may include processing batches of images from the training dataset to compute a grasping loss and repeatedly updating parameters (e.g., weights and biases) of the GNN 200 until the grasping loss is minimized, for example, based on a method of gradient descent.

[0050] A variant of the above-described architecture is illustrated in FIG. 3. Like elements that are identified by like reference numerals will not be described again. As shown, a difference between the present architecture and the one described in FIG. 2 is that the GNN 300 includes two decoding heads instead of one, i.e., a second decoder 302 in addition to the first decoder 218. In this case, the transformer output 216 is branched off and fed as an input to both of the decoders 218 and 302. In some embodiments, the input to the first decoder 218 may additionally include a dimension of the gripper or end effector of the robot being used for the task, as described above.

[0051] The second decoder 302 may also include a number of convolutional layers (not explicitly shown) configured to up-sample the transformer output 216 and produce as output a segmentation map 304 depicting pixel- wise segmentation of objects in the scene represented in the image. The shown decoder architecture, which is exemplary, includes a U-Net style decoder that includes skipped connections from the fusion encoder 210 to both the decoders 218 and 302 (illustrated by dashed lines) to better aggregate information.

[0052] The GNN 300 may be trained in a supervised learning process using a loss function defined by a combination of the grasping loss of the first decoder 218 and a segmentation loss of the second decoder 302. The grasping loss of the first decoder 218 may be computed based on ground truth grasp maps for the images in the training dataset, as described above. The segmentation loss of the second decoder 302 may be computed based on ground truth segmentation maps for the images in the training dataset. The ground truth segmentation maps for the second decoder 302 may include pixel labels for each pixel of the input image indicative of whether the pixel belongs to an object in the scene. In various embodiments, the ground truth pixel labels may be assigned based on hand labeling of the images (e.g., by a user localizing objects in the scene via a GUI), or may be automatically generated in case of synthetic images. In particular, the ground segmentation maps for the second decoder 302 may be generated by computing, for each image, a binary segmentation mask which segments all objects in the scene from a background and further defines boundaries separating objects that touch each other.

[0053] Logically, if the GNN 300 starts to learn boundaries between objects on a segmentation map output 304, it should learn to separate the grasping points or pick points on the grasp map output 220 too. Thereby, the grasp map output 220 of the first decoder 218 can be guided by the segmentation map output 304 of the second decoder 302 during the training, to further enable the GNN to understand the semantic and geometric properties of the scene and all the objects in it, and provide robust, centered and un-occluded pick points on unknown objects. An example of this approach is described in the International Patent Application No. PCT/US2024/035234, filed by the present Applicant, which is incorporated by reference herein in its entirety.

[0054] After training and validation, the GNN may be deployed on a real-time control system, such as the computing system 104 shown in FIG. 1. In some embodiments, the real-time control system may comprise an Edge device or other computing device(s) operating on a resource- constrained hardware platform. A non-limiting example of a hardware platform where the trained GNN can be suitably deployed is the TM MFP module in the SIMATIC ^TM line of controllers developed by Siemens AG.

[0055] Referring now to FIG. 1, an example of a robotic grasp execution utilizing the trained GNN is described. The camera 122 captures an image of a pick scene including the bin 120 containing objects 118. The camera image is processed by the trained GNN, which may be deployed on the computing system 104. Referring to the architecture shown in FIG. 2 and 3, depth and color frames of the camera image may be passed separately through convolutional blocks 206a-c, 208a-c of the trained GNN for extracting feature maps via down sampling of the respective image frames. The feature maps extracted from the depth and color frames may be fused to produce a fused feature map, e.g., via fusion encoder 210. The fused feature map may be spatially divided into patches and fed as input to the vision transformer 214 of the trained GNN, which can encode the patches based on information from other patches in the fused feature map. The output of the vision transformer 214 may then be fed to the decoder 218 of the trained GNN to construct a grasp map. An optimal grasping location or pick point is estimated utilizing the grasp map.

[0056] For example, in some embodiments, the pick point may be computed by applying an argmax operator on the grasp map output, i.e., determining the pixel with the highest grasp confidence value in the grasp map. The computed pick point may be transformed from the 2D pixel frame of the grasp map to the 3D real-world reference frame 130 based on the depth frame and the camera intrinsic parameters of the camera 122, using a sequence of known transforms. The estimated pick point, which may include X, Y, Z coordinates and a direction of approach, is outputted to the robot controller 108 to control the robotic end effector 116 to execute the grasp.

[0057] FIG. 5-7 illustrate sample outputs of a trained GNN according to disclosed embodiments. As can be seen, the trained GNN has a remarkable ability to produce pick points that are always in the center of an object and that are always on the topmost (un-occluded) object.

[0058] FIG. 5 demonstrates the robustness of the GNN output in an extremely difficult scenario with tightly packed objects of the same height placed in a bin. Here, the image 502 depicts the input depth image frame, the image 504 depicts the input RGB image frame, the image 506 depicts the grasp map output of the GNN and the image 508 depicts the ground truth. The grasp map 506 has been post-processed to highlight pixels with grasp confidence values higher than a threshold. As seen from the grasp map 506, the GNN has acquired the exceptional ability to produce centered grasping points on each of the objects even though the depth image frame 502 does not give away much in terms of boundary information amongst the very tightly packed objects. The final pick point PF represents the pixel(s) with the highest grasp confidence value in the grasp map 506.

[0059] FIG. 6 demonstrates the robustness of the GNN output in a tricky scenario where there is a translucent fluid over an object in the bin. Here, the image 602 depicts the input depth image frame, the image 604 depicts the input RGB image frame, the image 606 depicts the grasp map output of the GNN and the image 508 depicts segmentation map. The grasp map 606 has been postprocessed to highlight pixels with grasp confidence values higher than a threshold. As seen from the grasp map 606, due to a combination of the fusion of features extracted from RGB and depth images and the vision transformer effect, the GNN still managed to provide the best (centered) pick point even though the depth image frame 602 is corrupted due to the translucent liquid. The final pick point PF represents the pixel(s) with the highest grasp confidence value in the grasp map 606.

[0060] FIG. 7 demonstrates the robustness of the GNN output in a popout prone scenario. Here, the image 702 depicts the input depth image frame, the image 704 depicts the input RGB image frame, the image 706 depicts the grasp map output of the GNN and the image 708 depicts segmentation map. The grasp map 706 has been post-processed to highlight pixels with grasp confidence values higher than a threshold. As seen from the grasp map 706, the GNN clearly produced very low confidence grasps for those that it felt were occluded or not oriented correctly and gave the best position on the most visible one that is easy to grasp. The final pick point PF represents the pixel(s) with the highest grasp confidence value in the grasp map 706.

[0061] Furthermore, especially in case of the GNN 300, the architecture can allow for the ability to deal with ambiguity in the model’s grasp predictions. In some scenarios, the highest grasp confidence value in the grasp map may fail to reach a threshold. If this happens, the GNN may be unable to produce a well-defined pick point on an object in the grasp map. In some embodiments, if the highest grasp confidence value is below a threshold, the estimation of the pick point may be informed by a combination of the grasp map output and the segmentation map output. In this case, the segmentation map output, which has proven to be very reliable, can help guide the choice since the bounds and area of the object are known from this. The final choice of the pick point may be determined, for example, based on a user input.

[0062] The performance of the proposed GNN architecture on low compute platforms has also been demonstrated. For example, the runtime performance for the GNN 300 turns out to be about 500ms on average, using OPENVino toolkit to port and run on the TM MFP, which only possesses a CPU without any dedicated GPU acceleration. For context, most state-of-the-art vision transformers would not even fit on the memory of this hardware platform. For further context, this hardware platform currently has comparable or lower compute capabilities than a regular laptop CPU, which outlines just how much efficiency the proposed GNN architecture was able to achieve for the level of performance it delivers.

[0063] FIG. 8 illustrates an exemplary computing environment comprising a computing system 802, within which aspects of the present disclosure may be implemented. The computing system 802 may be embodied, for example and without limitation, as an industrial PC with a Linux operating system, for executing real-time control of a physical device, such as a robot.

[0064] As shown in FIG. 8, the computing system 802 may include a communication mechanism such as a system bus 804 or other communication mechanism for communicating information within the computing system 802. The computing system 802 further includes one or more processors 806 coupled with the system bus 804 for processing the information. The processors 806 may include one or more central processing units (CPUs), graphical processing units (GPUs), Al accelerators, or any other processor known in the art.

[0065] The computing system 802 also includes a system memory 808 coupled to the system bus 804 for storing information and instructions to be executed by processors 806. The system memory 808 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 810 and/or random access memory (RAM) 812. The system memory RAM 812 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The system memory ROM 810 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 808 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 806. A basic input/output system 814 (BIOS) containing the basic routines that help to transfer information between elements within computing system 802, such as during start-up, may be stored in system memory ROM 810. System memory RAM 812 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 806. System memory 808 may additionally include, for example, operating system 816, application programs 818, other program modules 820 and program data 822.

[0066] The computing system 802 also includes a disk controller 824 coupled to the system bus 804 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 826 and a removable media drive 828 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid state drive). The storage devices may be added to the computing system 802 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).

[0067] The computing system 802 may also include a display controller 830 coupled to the system bus 804 to control a display 832, such as a cathode ray tube (CRT) or liquid crystal display (LCD), among other, for displaying information to a computer user. The computing system 802 includes a user input interface 834 and one or more input devices, such as a keyboard 836 and a pointing device 838, for interacting with a computer user and providing information to the one or more processors 806. The pointing device 838, for example, may be a mouse, a light pen, a trackball, or a pointing stick for communicating direction information and command selections to the one or more processors 806 and for controlling cursor movement on the display 832. The display 832 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 838.

[0068] The computing system 802 also includes an I/O adapter 846 coupled to the system bus 804 to connect the computing system 802 to a controllable physical device, such as a robot. In the example shown in FIG. 8, the I/O adapter 846 is connected to robot controller 848. In some embodiments, the robot controller 848 may include one or more motors for controlling linear and/or angular positions of various parts (e.g., arm, base, etc.) of a robot.

[0069] The computing system 802 may perform a portion or all of the processing steps of embodiments of the disclosure in response to the one or more processors 806 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 808. Such instructions may be read into the system memory 808 from another computer readable storage medium, such as a magnetic hard disk 826 or a removable media drive 828. The magnetic hard disk 826 may contain one or more datastores and data files used by embodiments of the present disclosure. Datastore contents and data files may be encrypted to improve security. The processors 806 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 808. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.

[0070] The computing system 802 may include at least one computer readable storage medium or memory for holding instructions programmed according to embodiments of the disclosure and for containing data structures, tables, records, or other data described herein. The term “computer readable storage medium” as used herein refers to any medium that participates in providing instructions to the one or more processors 806 for execution. A computer readable storage medium may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto- optical disks, such as magnetic hard disk 826 or removable media drive 828. Non-limiting examples of volatile media include dynamic memory, such as system memory 808. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 804. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.

[0071] The computing environment 800 may further include the computing system 802 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 844. Remote computing device 844 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computing system 802. When used in a networking environment, computing system 802 may include a modem 842 for establishing communications over a network 840, such as the Internet. Modem 842 may be connected to system bus 804 via network interface 845, or via another appropriate mechanism.

[0072] Network 840 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computing system 802 and other computers (e.g., remote computing device 844). The network 840 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 840.

[0073] The embodiments of the present disclosure may be implemented with any combination of hardware and software. In addition, the embodiments of the present disclosure may be included in an article of manufacture (e.g., one or more computer program products) having, for example, a non-transitory computer-readable storage medium. The computer readable storage medium has embodied therein, for instance, computer readable program instructions for providing and facilitating the mechanisms of the embodiments of the present disclosure. The article of manufacture can be included as part of a computer system or sold separately.

[0074] The computer readable storage medium can include a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

[0075] The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the disclosure to accomplish the same objectives. Although this disclosure has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the disclosure.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method for executing robotic grasps, comprising: acquiring, via a camera, an image of a scene including one or more objects, the image defined by a depth image frame and a color image frame, processing the acquired image using a trained grasping neural network, the processing comprising: passing the depth image frame and the color image frame through convolutional blocks of an encoder for extracting feature maps via down sampling of the respective image frames, fusing the feature maps extracted from the depth image frame and the color image frame to produce a fused feature map, feeding the fused feature map, spatially divided into patches, as input to a vision transformer to encode the patches based on information from other patches in the fused feature map, and feeding an output of the vision transformer to a decoder to construct a grasp map of the scene, and estimating an optimal grasping location based on the grasp map.

2. The method according to claim 1 , wherein the vision transformer comprises a series of transformer blocks, each comprising an attention layer for computing patch-based attention for encoding the patches, to generate the output of the vision transformer.

3. The method according to any of claims 1 and 2, wherein encoding the patches by the vision transformer comprises computing patch-based attention by: mapping each patch, represented by an input token, to a scalar context score using a single latent token defined by a set of learned weights, scaling key tokens for each input token by the respective context scores, and summing the scaled key tokens to compute a context vector encoding information from all input tokens.

4. The method according to any of claims 1 to 3, wherein the depth image frame and the color image frame are down sampled respectively by a first and a second series of convolution blocks.

5. The method according to claim 4, wherein each of the convolution blocks for the depth image frame and the color image frame includes an inverted residual bottleneck (IRB) module.

6. The method according to claim 5, wherein the first and the second series of convolutional blocks respectively include a convolutional stem for the depth image and the color image, each convolutional stem including an IRB module and an additional convolutional layer that performs the down sampling instead of the IRB module.

7. The method according to any of claims 1 to 6, wherein the fused feature map is produced by fusing feature maps extracted at respective convolutional blocks for the depth and color image frames via a fusion module comprising series of convolutional blocks.

8. The method according to any of claims 1 to 7, wherein the grasp map depicts pixel-wise grasp confidence values identifying grasping points for objects in the scene.

9. The method according to claim 8, wherein the optimal grasping location is estimated by determining a pixel with the highest grasp confidence value in the grasp map constructed by the decoder.

10. The method according to any of claims 1 to 19, comprising outputting the estimated optimal grasping location to a robot controller to control a robot to execute a grasp.

11. A computer-implemented method for learning robotic grasping, comprising: inputting images from a training dataset to a grasping neural network comprising an encoder-decoder architecture, each image comprising a color image frame and a corresponding depth image frame depicting a scene including one or more objects placed in random configurations, for each image: passing the depth image frame and the color image frame through convolutional blocks of an encoder for extracting feature maps via down sampling of the respective image frames, fusing the feature maps extracted from the depth image frame and the color image frame to produce a fused feature map, feeding the fused feature map, spatially divided into patches, as input to a vision transformer to encode the patches based on information from other patches in the fused feature map, and feeding an output of the vision transformer to a first decoder to construct a grasp map of the scene, and training the grasping neural network based on a grasping loss of the first decoder, the grasping loss being computed based on ground truth grasp maps for the images in the training dataset.

12. The method according to claim 12, wherein, for each image, the output of the vision transformer is also fed to a second decoder to construct a segmentation map, wherein the grasping neural network is trained using a loss function defined by a combination of the grasping loss of the first decoder and a segmentation loss of the second decoder, the segmentation loss being computed based on ground truth segmentation maps for the images in the training dataset, whereby a grasp map output of the first decoder is guided by a segmentation map output of the second decoder during the training.

13. A non-transitory computer-readable storage medium including instructions that, when processed by one or more processors, configure the one or more processors to perform the method according to any one of claims 1 to 12.

14. A system for executing a robotic grasp, comprising: a robot controllable by a robot controller, a camera configured to capture an image of a scene including one or more objects, and a computing system comprising: one or more processors, and memory storing instructions executable by the one or more processors to perform a method according to any of claims 1 to 10, to execute a grasp by the robot based on the captured image.