CN119168835A

CN119168835A - A mechanical arm grasping prediction method, electronic device and storage medium

Info

Publication number: CN119168835A
Application number: CN202411168309.1A
Authority: CN
Inventors: 王怀震; 程瑶; 蒋风洋; 黄洋
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2024-08-23
Filing date: 2024-08-23
Publication date: 2024-12-20

Abstract

The invention provides a mechanical arm grabbing prediction method, electronic equipment and a storage medium, which belong to the technical field of mechanical arm vision, and are used for constructing a mechanical arm grabbing gesture correction platform to obtain RGB images of the pose of an actuator, adopting LLaMa-adapter to construct a multi-mode large model, extracting visual characteristics of the RGB images through a CLIP model, analyzing visual characteristic prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of RGB image contents, predicting the operation gesture of an end actuator of the mechanical arm corresponding to the processed text information and image information based on a continuous thinking fine adjustment inference grabbing strategy, acquiring operation gesture operation data of the end actuator of the mechanical arm in real time, and checking the multi-mode large model based on an exponential moving average method. The method provides a continuous strategy learning method, enhances the adaptability of the model to the current scene configuration, reduces the frequency of expert intervention, and enables the mechanical arm to meet the use requirement.

Description

Mechanical arm grabbing prediction method, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of mechanical arm vision, and particularly relates to a mechanical arm grabbing prediction method based on a self-correction multi-mode large model, electronic equipment and a storage medium.

Background

In the prior art, the mechanical arm vision grabbing technology is widely used in multiple fields. In practical application, the environment where the mechanical arm is located is often complex and changeable, and interference factors such as illumination change, shielding, light reflection and the like exist, and the factors can influence the quality of an image, so that the accuracy and the reliability of visual feedback are reduced.

The rapid movement of the robotic arm may cause image blurring, which may seriously affect the accuracy of the visual feedback, especially at high speed gripping, increasing the risk of gripping failure. Current robotic arm vision gripping systems tend to be optimized for a particular task or object, lacking sufficient flexibility and adaptability. This also results in the robotic manipulation strategy failing to meet the performance requirements of the motion, affecting the use of the robotic arm.

Disclosure of Invention

The invention provides a mechanical arm grabbing prediction method, which provides a continuous strategy learning method, enhances the adaptability of a model to the current scene configuration, reduces the frequency of expert intervention, and enables the mechanical arm to meet the use requirement.

The method comprises the following steps:

S101, constructing a mechanical arm grabbing gesture prediction platform, and acquiring RGB images of the pose of an actuator by using FRANKA PANDA robots as mechanical arm gesture prediction models;

s102, adopting LLaMA-Adapter to construct a multi-mode large model;

S103, extracting visual features of the RGB image through a CLIP model of the multi-mode large model, and analyzing the visual feature prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of the RGB image content;

S104, based on a continuous thinking fine adjustment reasoning grabbing strategy, predicting the operation posture of the mechanical arm end effector corresponding to the processed text information and the processed image information;

S105, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified based on an exponential moving average method.

It should be further noted that in step S101, a SAPIEN dataset and a PartNet-Mobility dataset are used to build a robot arm grabbing posture correction platform.

It should be further noted that, in step S101, the robot arm grabbing gesture correction platform is configured with a VulkanRenderer high-efficiency renderer.

It should be further noted that, in the method, a dataset loader provided by SAPIEN datasets is used to load the object model in the datasets and URDF files of the FRANKA PANDA robot and the actuator.

It should be further noted that, in step S103, the CLIP model is configured with a text encoder and an image encoder;

The text encoder is selected as a training network, the image encoder takes a depth convolution network as the training network, the input form of the image encoder is [ n, h, w, c ], n is the batch size, h, w, c is the size of the image.

In step S103, the CLIP model performs pre-training on the text information and the image information to obtain an association relationship between the image content and the natural language description;

the pre-training comprises the following steps:

s1031, processing the text information and the image information into feature vectors;

S1032, constructing a relation matrix, wherein each element in the relation matrix is cosine similarity of each image feature vector and each text feature vector;

S1033, a loss function formula adopted in the pretrained contrast learning method is as follows:

τ is the set super parameter, q is the coding feature, k is the coded sample, and k ₊ is the high-match sample.

In the method, the grabbing motion type is divided into rotation and translation, and the operation gesture movement direction of the mechanical arm end effector is obtained based on a Affordance map model;

Affordance map model A ε R ^H×W is obtained based on the following formula:

Based on D epsilon R ^H×W, the Euclidean distance of the position before and after the movement of the mechanical arm is calculated, and A is the mobility probability of each pixel.

It should be further noted that, in step S105, the strategy formula of the exponential moving average method is μ ^τ＝αμ^τ-1+(1-α)μ^τ

Where τ is the time step, μ represents the multi-modal large model, α=0.99.

According to another embodiment of the present application, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the robotic arm grasp prediction method when executing the program.

According to yet another embodiment of the present application, there is also provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the robotic arm grasp prediction method.

From the above technical scheme, the invention has the following advantages:

According to the mechanical arm grabbing prediction method provided by the invention, grabbing scenes can be more comprehensively understood through visual features and text information of RGB images, so that the optimal operation posture of the mechanical arm end effector can be predicted more accurately. Compared with a single mode, the multi-mode fusion mode can capture nuances in a complex environment, and accuracy and robustness of prediction are improved.

The invention utilizes the pre-trained LLaMA word segmentation device and the CLIP model, can process input data with different sources and formats, and enhances the generalization capability of the model. And can also adapt to different working environments and task demands.

According to the mechanical arm grabbing prediction method, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified by using an index moving average method, so that online learning and self-optimization of the model can be realized. The model can be ensured to still maintain higher prediction accuracy after long-time operation.

The method improves the efficiency and reliability of industrial automatic production by automatically predicting and correcting the grabbing gesture of the mechanical arm. The technology of multiple fields such as deep learning, computer vision, natural language processing and the like is integrated, the prediction precision is improved, the generalization capability is enhanced, and the real-time self-correction is realized.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a robotic arm grasp prediction method;

Fig. 2 is a schematic diagram of an electronic device.

Detailed Description

The following detailed description of the robotic arm grasp prediction method of the present application, for purposes of explanation and not limitation, sets forth specific details, such as particular system configurations, techniques, etc., in order to provide a thorough understanding of embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

It should be understood that references to "one or more" herein means one, two, or more, and references to "a plurality" herein mean two or more. In the description of the present application, "/" means or, unless otherwise indicated, for example, A/B may mean A or B. The term "and/or" is merely an association relation describing the association object, and means that three kinds of relations may exist, for example, a and/or B may mean that a exists alone, a and B exist together, and B exists alone.

The statements of "one embodiment" or "some embodiments" and the like, described in this disclosure, mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present disclosure. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the present application are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flowchart of a method for predicting a robot gripping operation in an embodiment is shown, where the method includes:

s101, constructing a mechanical arm grabbing gesture correction platform, and acquiring RGB images of the pose of an actuator by using FRANKA PANDA robots as mechanical arm gesture prediction models.

In some embodiments, FRANKA PANDA robots, known for their high precision and flexibility, are used as experimental platforms, suitable for complex robotic arm pose prediction tasks.

In the embodiment, a high-resolution camera is installed in a working area of the mechanical arm and used for capturing RGB images of the pose of the actuator. At the same time, necessary sensors (e.g., force sensors, position sensors, etc.) are installed to obtain more comprehensive robot arm operational status data. The camera and the sensor are connected with a computer or a data processing center, so that real-time data transmission and processing are ensured.

In some specific embodiments, FRANKA PANDA robots are selected as experimental platforms, and high-resolution cameras are installed in the working areas of the mechanical arms to ensure that RGB images of the pose of the actuator can be clearly captured.

The robot arm grabbing gesture correction platform is configured to receive and process data from the cameras and sensors. And a data interface is established, so that the camera and the sensor can transmit data to a data processing center in real time and stably.

In some embodiments, SAPIEN is a physical-based, home-scene-robot-oriented simulation environment, the robotic arm grabbing gesture correction platform may set an interaction environment using SAPIEN dataset and PartNet-Mobility dataset, load object models in the dataset and FRANKA PANDA robot and end effector URDF files using the dataset loader provided in SAPIEN, randomly select a contact point in the movable part using the efficient renderer VulkanRenderer based on efficient spectroscopic rasterization, and interact with the target using the opposite direction of its normal vector as the effector direction, successfully implement the operation, and record as a successful sample.

Alternatively, about 10,000 operation success samples are recorded off-line sampling, covering 20 categories.

S102, adopting LLaMA-Adapter to construct a multi-mode large model.

In some embodiments, the mechanical arm grabbing gesture correction platform uses LLaMA-Adapter as a basic model, so that efficient fine adjustment can be supported, and multi-mode input text information and image information can be processed.

In the embodiment, a pre-trained LLaMA model is loaded, most of parameters are frozen, only the adaptation layer in the LLaMA-Adapter is finely adjusted, the calculated amount is reduced, the overfitting is avoided, and the self-correction efficiency can be improved.

The model set in this embodiment takes RGB images and text instructions as inputs. Visual encoders using the CLIP model (or similar model) are used to extract visual features of the image and pre-trained LLaMA segmentors are used to process text instructions.

For the multimodal large model of LLaMA-Adapter construction, LLaMA-Adapter is trained using small amounts of self-instruction (self-instruct) data to adapt it to the robot pose prediction task.

And S103, extracting visual features of the RGB image through a CLIP model of the multi-mode large model, and analyzing the visual feature prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of the RGB image content.

In some embodiments, the RGB image captured by the camera is passed as input to a visual encoder of the CLIP model. The CLIP model encodes the image and extracts high-level visual feature vectors. The visual feature vectors are passed to a multi-modal large model, combined with text prompts processed by a pre-trained LLaMA word-segmenter.

In this way, within the multimodal big model, the visual features and text cues are fused by a specific fusion mechanism to form a unified input representation.

S104, based on the continuous thinking fine adjustment reasoning grabbing strategy, the operation posture of the mechanical arm end effector corresponding to the processed text information and the processed image information is predicted.

In some embodiments, in LLaMA-Adapter, the visual features extracted by CLIP are fused with text cues and processed through an adaptive cue layer inside the model.

The model of the embodiment can predict the optimal operation posture of the mechanical arm end effector according to the fused input information. Optionally, the position of the contact point, the direction and force of the gripper, etc. may be included.

The present embodiment also utilizes continuous thinking (e.g., a system of speed that mimics the human thinking process) to fine tune the model to optimize the prediction results. For example, in the event of a prediction error, the model can think itself back and generate a new prediction.

In this embodiment, the multi-modal large model predicts the optimal operating pose of the robotic arm end effector based on the fused input representation. The prediction result comprises key parameters such as the position of a contact point, the direction and the strength of a grabber and the like.

The present embodiment fine-tunes the prediction strategy by introducing a continuous thinking mechanism (e.g., a fast-slow system model). When the predicted result deviates from the actual execution result, the model can self-think and generate a new prediction strategy. And through continuous learning and optimization, the adaptability of the model to specific scenes and tasks is improved.

In some embodiments, the operating pose and operational data of the robotic end effector are acquired in real-time by sensors. And comparing the actual operation data with the model prediction result, and smoothly evaluating the prediction performance of the model by using an Exponential Moving Average (EMA). EMA can help the model better adapt to data fluctuations in the short term while preserving long term trends.

According to the embodiment, the multi-mode large model can be adjusted as necessary according to the verification result. If the prediction error is large, some of the parameters of the model may be retrained or fine-tuned. Meanwhile, continuous strategy learning is performed by using a successfully corrected sample, so that the adaptability of the model to specific scene configuration is improved.

In some specific embodiments, acquiring operational pose and operational data of the robotic end effector in real time may acquire the operational pose and operational data of the robotic end effector in real time via sensors. And preprocessing and cleaning the collected data to ensure the accuracy and the integrity of the data.

Thereafter, the predictive performance of the multimodal large model was smoothly evaluated using an Exponential Moving Average (EMA). And comparing and analyzing the actual operation data with the model prediction result, and calculating indexes such as prediction error, accuracy and the like.

The present embodiment may also be optimized for feedback adjustment and persistence. The multi-mode large model can be adjusted and optimized as necessary according to the verification result. And continuous strategy learning is carried out by using the successfully corrected samples, so that the adaptability of the model to specific scene configuration is improved. The model is updated and iterated regularly to cope with the use requirement of the mechanical arm.

Through the steps, the mechanical arm posture self-correcting platform based on the multi-mode large model can be constructed, the platform can predict and correct the operation posture of the mechanical arm in real time, and the accuracy and stability of the operation of the mechanical arm are improved.

In one embodiment of the present invention, a possible example will be given below for non-limiting illustration of specific embodiments thereof, based on a robotic arm gripping prediction method.

In this embodiment, the multi-modal pre-training implementation process uses LLaMA-Adapter to construct a multi-modal large model (MLLM), extracts visual features of RGB images through the CLIP model, and encodes text prompts using a pre-trained LLaMA word segmentation engine.

The CLIP model of this embodiment includes a Text encoder that selects a Text transform as a training network and an image encoder that uses a depth convolution network as a training network, where the input form of the image encoder is [ n, h, w, c ], n is the batch size, and h, w, c is the size of the image.

Such as 224 x 3, the input form of the text encoder is n, l, the batch in the text encoder is the same as the batch in the image encoder because of the image text pairs, and l is the sequence length. And then calculating the similarity of the text vector and the image vector to predict whether the text vector and the image vector are a pair, and pretraining the CLIP by adopting a contrast learning method, wherein the CLIP model pretrains a large number of picture-text pairs to obtain the association relation between the image content and the natural language description.

The pre-training stage of this embodiment specifically comprises the following steps:

① The input text and the input image are respectively processed into feature vectors through Encoder;

② Constructing a relation matrix. Each element in the relationship matrix is the cosine similarity of each image feature vector to other text feature vectors. The elements of the main diagonal in the matrix are all matched (the image and text features are fully corresponding) and the elements elsewhere are not matched.

③ The loss function formula adopted in the pretrained contrast learning method is as follows:

In some specific embodiments, a continuous thought fine-tuning inference grabbing strategy (ManipLLM) is configured to achieve robust and interpretable actuator pose prediction using MLLM's inference capabilities.

The present embodiment captures RGB images and their corresponding successful steering end effector poses in a simulator during pre-collection of training data. During the reasoning process, focus is placed on the 2D coordinates x, y of the contact pixels on the predicted image, which are then converted to 3D coordinates using depth information. At the same time, we acquire the left (Z-axis direction) of the clip from its upward and forward directions according to the geometric relationship.

MLLM of this embodiment aligns visual features with the embedded space of a Large Language Model (LLM) through a projection layer, enabling LLaMA to multi-modal understanding and generate corresponding answers. During the training process we only fine-tune the injection adapter in LLM while keeping the main pre-training parameters unchanged to preserve the powerful capabilities of existing MLLM and enhance the model's functionality in terms of manipulation and failure correction.

The inference capability of MLLM may be utilized herein to configure a continuous thought fine-tuning inference grabbing strategy (ManipLLM) to achieve robust and interpretable actuator pose prediction. The grasping strategy based on continuous thinking fine adjustment reasoning can comprise operation category understanding of the end effector of the mechanical arm, priori force field reasoning and gesture prediction taking the end effector of the mechanical arm as a center.

For operation classification of the mechanical arm end effector, the operation targets of different classes can be classified according to geometric attributes by adopting a deep learning method based on target class identification (OCI).

The present embodiment also uses a priori force field reasoning. The stage of classifying the type of grabbing motion into "rotation" and "translation" and gathering the corresponding Affordance map models is intended to make the model aware of which object regions can be manipulated. The object moving part is first found and the object part is moved along the axis. Affordance map model A ε R ^H×W is obtained as follows

D ε R ^H×W calculates the Euclidean distance of the 3D position (corresponding to each pixel) before and after the movement. A is the probability of movability for each pixel.

In one embodiment of the application, object-centric pose prediction, after training data is collected, RGB images and corresponding actuator poses are recorded as model input and result rewards, and 2D coordinate poses [ x, y ] of the end-effectors are predicted by RGB images and text cues.

Depth values are converted to Z coordinate in space by intra-parametric conversion using depth frame information provided by a depth camera. Meanwhile, the Z-axis direction of the grab clamp is acquired from the upward and forward directions according to the geometric relation, and the accurate pose interpretation of the initial contact actuator is generated in an inference mode.

In this embodiment, since the relative positions of the robot and the object may change during each operation, a continuous strategy learning method is configured in order to enable the robotic arm gripping gesture prediction platform to have handling and fault correction capabilities. This approach aims to enhance posture prediction capability without expert feedback cues, and thus explores the use of Exponential Moving Average (EMA) to continually learn from new data, obtaining samples of successful corrections. The strategy formula is mu ^τ＝αμ^τ-1+(1-α)μ^τ.

Where τ is the time step and μ represents the model of the invention. The update weight α=0.99 may be based on being set. The effectiveness of the EMA scheme in action sequential learning was evaluated by performing repeated closed loop correction and sequential policy learning sessions for each scene configuration.

Therefore, the invention realizes the use of the multi-mode large model so as to simultaneously predict the operation posture of the mechanical arm end effector and automatically recognize and correct the failed operation actions.

The application further provides electronic equipment, which is used for realizing the steps of the mechanical arm grabbing prediction method.

The electronic device of the present embodiment is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the embodiments of the application described and/or claimed herein.

Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 500 includes, but is not limited to, a network module 502, an audio output unit 503, an input unit 504, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 501, and a power module 511.

The processor 501 may include one or more processing units, such as the processor 501 may include a central processing unit (central processing unit, CPU), etc.), an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural-network processing unit, NPU, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The input unit 504 is used for receiving an audio or video signal. The input unit 504 may include a graphics processor (Graphics Processing Unit, GPU and microphone) that processes image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode.

The user input unit 507 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 507 includes a touch panel and other input devices.

The present application also provides a storage medium storing a program product capable of implementing the robot arm gripping prediction method described in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

A storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A robot arm grasping prediction method, characterized in that the method comprises:

S101: Build a robot gripping posture prediction platform, use the Franka Panda robot as the robot posture prediction model, and obtain the RGB image of the actuator posture;

S102: Use LLaMA-Adapter to construct a multimodal large model;

S103: extract visual features of the RGB image through the CLIP model of the multimodal large model, and use the pre-trained LLaMA word segmenter to parse the visual feature prompts to obtain text information and image information of the RGB image content;

S104: fine-tuning the reasoning grasping strategy based on continuous thinking, predicting the operation posture of the end effector of the robot arm corresponding to the processed text information and image information;

S105: Acquire the operation posture data of the end effector of the robot arm in real time, and verify the multi-modal large model based on the exponential moving average method.

2. The robot grasping prediction method according to claim 1, characterized in that:

In step S101, the SAPIEN dataset and the PartNet-Mobility dataset are used to build a robotic arm grasping posture correction platform.

3. The robotic arm grasping prediction method according to claim 1 is characterized in that in step S101, the robotic arm grasping posture correction platform is configured with a VulkanRenderer high-efficiency renderer.

4. The robotic arm grasping prediction method according to claim 2 is characterized in that, in the method, a dataset loader provided by the SAPIEN dataset is used to load the object model in the dataset and the URDF files of the Franka Panda robot and actuator.

5. The robot grasping prediction method according to claim 1, characterized in that in step S103, the CLIP model is configured with a text encoder and an image encoder;

The text encoder is selected as the training network, and the image encoder uses a deep convolutional network as the training network. The input form of the image encoder is [n, h, w, c], where n is the batch size and h, w, c are the image sizes.

6. The robot grasping prediction method according to claim 5, characterized in that, in step S103, the CLIP model pre-trains the text information and the image information to obtain the association relationship between the image content and the natural language description;

Pre-training includes the following steps:

S1031: Processing text information and image information into feature vectors;

S1032: construct a relationship matrix, where each element in the relationship matrix is the cosine similarity between each image feature vector and text feature vector;

S1033: The loss function formula used in the pre-trained contrastive learning method is as follows:

τ is the set hyperparameter; q is the encoding feature; k is the encoding sample; k ₊ is the high matching sample.

7. The robot grasping prediction method according to claim 5 is characterized in that, in the method, the grasping motion type is divided into rotation and translation, and the operation posture movement orientation of the robot end effector is obtained based on the Affordance map model;

The affordance map model A∈R ^H×W is obtained based on the following formula:

The Euclidean distance of the robot’s position before and after movement is calculated based on D∈R ^H×W , where A is the mobility probability of each pixel.

8. The robot grasping prediction method according to claim 5, characterized in that, in step S105, the strategy formula of the exponential moving average method is μ ^τ =αμ ^τ-1 +(1-α)μ ^τ

Among them, τ is the time step, μ represents the multimodal large model, and α = 0.99.

9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the robot grasping prediction method according to any one of claims 1 to 8 are implemented.

10. A storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the robot arm grasping prediction method according to any one of claims 1 to 8 are implemented.