[go: up one dir, main page]

CN119168835A - A mechanical arm grasping prediction method, electronic device and storage medium - Google Patents

A mechanical arm grasping prediction method, electronic device and storage medium Download PDF

Info

Publication number
CN119168835A
CN119168835A CN202411168309.1A CN202411168309A CN119168835A CN 119168835 A CN119168835 A CN 119168835A CN 202411168309 A CN202411168309 A CN 202411168309A CN 119168835 A CN119168835 A CN 119168835A
Authority
CN
China
Prior art keywords
model
robot
grasping
mechanical arm
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411168309.1A
Other languages
Chinese (zh)
Inventor
王怀震
程瑶
蒋风洋
黄洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Original Assignee
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong New Generation Information Industry Technology Research Institute Co Ltd filed Critical Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority to CN202411168309.1A priority Critical patent/CN119168835A/en
Publication of CN119168835A publication Critical patent/CN119168835A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0014Image feed-back for automatic industrial control, e.g. robot with camera
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J19/00Accessories fitted to manipulators, e.g. for monitoring, for viewing; Safety devices combined with or specially adapted for use in connection with manipulators
    • B25J19/02Sensing devices
    • B25J19/021Optical sensing devices
    • B25J19/023Optical sensing devices including video camera means
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1694Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
    • B25J9/1697Vision controlled systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/16Automatic learning of transformation rules, e.g. from examples
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a mechanical arm grabbing prediction method, electronic equipment and a storage medium, which belong to the technical field of mechanical arm vision, and are used for constructing a mechanical arm grabbing gesture correction platform to obtain RGB images of the pose of an actuator, adopting LLaMa-adapter to construct a multi-mode large model, extracting visual characteristics of the RGB images through a CLIP model, analyzing visual characteristic prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of RGB image contents, predicting the operation gesture of an end actuator of the mechanical arm corresponding to the processed text information and image information based on a continuous thinking fine adjustment inference grabbing strategy, acquiring operation gesture operation data of the end actuator of the mechanical arm in real time, and checking the multi-mode large model based on an exponential moving average method. The method provides a continuous strategy learning method, enhances the adaptability of the model to the current scene configuration, reduces the frequency of expert intervention, and enables the mechanical arm to meet the use requirement.

Description

Mechanical arm grabbing prediction method, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of mechanical arm vision, and particularly relates to a mechanical arm grabbing prediction method based on a self-correction multi-mode large model, electronic equipment and a storage medium.
Background
In the prior art, the mechanical arm vision grabbing technology is widely used in multiple fields. In practical application, the environment where the mechanical arm is located is often complex and changeable, and interference factors such as illumination change, shielding, light reflection and the like exist, and the factors can influence the quality of an image, so that the accuracy and the reliability of visual feedback are reduced.
The rapid movement of the robotic arm may cause image blurring, which may seriously affect the accuracy of the visual feedback, especially at high speed gripping, increasing the risk of gripping failure. Current robotic arm vision gripping systems tend to be optimized for a particular task or object, lacking sufficient flexibility and adaptability. This also results in the robotic manipulation strategy failing to meet the performance requirements of the motion, affecting the use of the robotic arm.
Disclosure of Invention
The invention provides a mechanical arm grabbing prediction method, which provides a continuous strategy learning method, enhances the adaptability of a model to the current scene configuration, reduces the frequency of expert intervention, and enables the mechanical arm to meet the use requirement.
The method comprises the following steps:
S101, constructing a mechanical arm grabbing gesture prediction platform, and acquiring RGB images of the pose of an actuator by using FRANKA PANDA robots as mechanical arm gesture prediction models;
s102, adopting LLaMA-Adapter to construct a multi-mode large model;
S103, extracting visual features of the RGB image through a CLIP model of the multi-mode large model, and analyzing the visual feature prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of the RGB image content;
S104, based on a continuous thinking fine adjustment reasoning grabbing strategy, predicting the operation posture of the mechanical arm end effector corresponding to the processed text information and the processed image information;
S105, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified based on an exponential moving average method.
It should be further noted that in step S101, a SAPIEN dataset and a PartNet-Mobility dataset are used to build a robot arm grabbing posture correction platform.
It should be further noted that, in step S101, the robot arm grabbing gesture correction platform is configured with a VulkanRenderer high-efficiency renderer.
It should be further noted that, in the method, a dataset loader provided by SAPIEN datasets is used to load the object model in the datasets and URDF files of the FRANKA PANDA robot and the actuator.
It should be further noted that, in step S103, the CLIP model is configured with a text encoder and an image encoder;
The text encoder is selected as a training network, the image encoder takes a depth convolution network as the training network, the input form of the image encoder is [ n, h, w, c ], n is the batch size, h, w, c is the size of the image.
In step S103, the CLIP model performs pre-training on the text information and the image information to obtain an association relationship between the image content and the natural language description;
the pre-training comprises the following steps:
s1031, processing the text information and the image information into feature vectors;
S1032, constructing a relation matrix, wherein each element in the relation matrix is cosine similarity of each image feature vector and each text feature vector;
S1033, a loss function formula adopted in the pretrained contrast learning method is as follows:
τ is the set super parameter, q is the coding feature, k is the coded sample, and k + is the high-match sample.
In the method, the grabbing motion type is divided into rotation and translation, and the operation gesture movement direction of the mechanical arm end effector is obtained based on a Affordance map model;
Affordance map model A ε R H×W is obtained based on the following formula:
Based on D epsilon R H×W, the Euclidean distance of the position before and after the movement of the mechanical arm is calculated, and A is the mobility probability of each pixel.
It should be further noted that, in step S105, the strategy formula of the exponential moving average method is μ τ=αμτ-1+(1-α)μτ
Where τ is the time step, μ represents the multi-modal large model, α=0.99.
According to another embodiment of the present application, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the robotic arm grasp prediction method when executing the program.
According to yet another embodiment of the present application, there is also provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the robotic arm grasp prediction method.
From the above technical scheme, the invention has the following advantages:
According to the mechanical arm grabbing prediction method provided by the invention, grabbing scenes can be more comprehensively understood through visual features and text information of RGB images, so that the optimal operation posture of the mechanical arm end effector can be predicted more accurately. Compared with a single mode, the multi-mode fusion mode can capture nuances in a complex environment, and accuracy and robustness of prediction are improved.
The invention utilizes the pre-trained LLaMA word segmentation device and the CLIP model, can process input data with different sources and formats, and enhances the generalization capability of the model. And can also adapt to different working environments and task demands.
According to the mechanical arm grabbing prediction method, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified by using an index moving average method, so that online learning and self-optimization of the model can be realized. The model can be ensured to still maintain higher prediction accuracy after long-time operation.
The method improves the efficiency and reliability of industrial automatic production by automatically predicting and correcting the grabbing gesture of the mechanical arm. The technology of multiple fields such as deep learning, computer vision, natural language processing and the like is integrated, the prediction precision is improved, the generalization capability is enhanced, and the real-time self-correction is realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a robotic arm grasp prediction method;
Fig. 2 is a schematic diagram of an electronic device.
Detailed Description
The following detailed description of the robotic arm grasp prediction method of the present application, for purposes of explanation and not limitation, sets forth specific details, such as particular system configurations, techniques, etc., in order to provide a thorough understanding of embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
It should be understood that references to "one or more" herein means one, two, or more, and references to "a plurality" herein mean two or more. In the description of the present application, "/" means or, unless otherwise indicated, for example, A/B may mean A or B. The term "and/or" is merely an association relation describing the association object, and means that three kinds of relations may exist, for example, a and/or B may mean that a exists alone, a and B exist together, and B exists alone.
The statements of "one embodiment" or "some embodiments" and the like, described in this disclosure, mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present disclosure. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the present application are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of a method for predicting a robot gripping operation in an embodiment is shown, where the method includes:
s101, constructing a mechanical arm grabbing gesture correction platform, and acquiring RGB images of the pose of an actuator by using FRANKA PANDA robots as mechanical arm gesture prediction models.
In some embodiments, FRANKA PANDA robots, known for their high precision and flexibility, are used as experimental platforms, suitable for complex robotic arm pose prediction tasks.
In the embodiment, a high-resolution camera is installed in a working area of the mechanical arm and used for capturing RGB images of the pose of the actuator. At the same time, necessary sensors (e.g., force sensors, position sensors, etc.) are installed to obtain more comprehensive robot arm operational status data. The camera and the sensor are connected with a computer or a data processing center, so that real-time data transmission and processing are ensured.
In some specific embodiments, FRANKA PANDA robots are selected as experimental platforms, and high-resolution cameras are installed in the working areas of the mechanical arms to ensure that RGB images of the pose of the actuator can be clearly captured.
The robot arm grabbing gesture correction platform is configured to receive and process data from the cameras and sensors. And a data interface is established, so that the camera and the sensor can transmit data to a data processing center in real time and stably.
In some embodiments, SAPIEN is a physical-based, home-scene-robot-oriented simulation environment, the robotic arm grabbing gesture correction platform may set an interaction environment using SAPIEN dataset and PartNet-Mobility dataset, load object models in the dataset and FRANKA PANDA robot and end effector URDF files using the dataset loader provided in SAPIEN, randomly select a contact point in the movable part using the efficient renderer VulkanRenderer based on efficient spectroscopic rasterization, and interact with the target using the opposite direction of its normal vector as the effector direction, successfully implement the operation, and record as a successful sample.
Alternatively, about 10,000 operation success samples are recorded off-line sampling, covering 20 categories.
S102, adopting LLaMA-Adapter to construct a multi-mode large model.
In some embodiments, the mechanical arm grabbing gesture correction platform uses LLaMA-Adapter as a basic model, so that efficient fine adjustment can be supported, and multi-mode input text information and image information can be processed.
In the embodiment, a pre-trained LLaMA model is loaded, most of parameters are frozen, only the adaptation layer in the LLaMA-Adapter is finely adjusted, the calculated amount is reduced, the overfitting is avoided, and the self-correction efficiency can be improved.
The model set in this embodiment takes RGB images and text instructions as inputs. Visual encoders using the CLIP model (or similar model) are used to extract visual features of the image and pre-trained LLaMA segmentors are used to process text instructions.
For the multimodal large model of LLaMA-Adapter construction, LLaMA-Adapter is trained using small amounts of self-instruction (self-instruct) data to adapt it to the robot pose prediction task.
And S103, extracting visual features of the RGB image through a CLIP model of the multi-mode large model, and analyzing the visual feature prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of the RGB image content.
In some embodiments, the RGB image captured by the camera is passed as input to a visual encoder of the CLIP model. The CLIP model encodes the image and extracts high-level visual feature vectors. The visual feature vectors are passed to a multi-modal large model, combined with text prompts processed by a pre-trained LLaMA word-segmenter.
In this way, within the multimodal big model, the visual features and text cues are fused by a specific fusion mechanism to form a unified input representation.
S104, based on the continuous thinking fine adjustment reasoning grabbing strategy, the operation posture of the mechanical arm end effector corresponding to the processed text information and the processed image information is predicted.
In some embodiments, in LLaMA-Adapter, the visual features extracted by CLIP are fused with text cues and processed through an adaptive cue layer inside the model.
The model of the embodiment can predict the optimal operation posture of the mechanical arm end effector according to the fused input information. Optionally, the position of the contact point, the direction and force of the gripper, etc. may be included.
The present embodiment also utilizes continuous thinking (e.g., a system of speed that mimics the human thinking process) to fine tune the model to optimize the prediction results. For example, in the event of a prediction error, the model can think itself back and generate a new prediction.
In this embodiment, the multi-modal large model predicts the optimal operating pose of the robotic arm end effector based on the fused input representation. The prediction result comprises key parameters such as the position of a contact point, the direction and the strength of a grabber and the like.
The present embodiment fine-tunes the prediction strategy by introducing a continuous thinking mechanism (e.g., a fast-slow system model). When the predicted result deviates from the actual execution result, the model can self-think and generate a new prediction strategy. And through continuous learning and optimization, the adaptability of the model to specific scenes and tasks is improved.
S105, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified based on an exponential moving average method.
In some embodiments, the operating pose and operational data of the robotic end effector are acquired in real-time by sensors. And comparing the actual operation data with the model prediction result, and smoothly evaluating the prediction performance of the model by using an Exponential Moving Average (EMA). EMA can help the model better adapt to data fluctuations in the short term while preserving long term trends.
According to the embodiment, the multi-mode large model can be adjusted as necessary according to the verification result. If the prediction error is large, some of the parameters of the model may be retrained or fine-tuned. Meanwhile, continuous strategy learning is performed by using a successfully corrected sample, so that the adaptability of the model to specific scene configuration is improved.
In some specific embodiments, acquiring operational pose and operational data of the robotic end effector in real time may acquire the operational pose and operational data of the robotic end effector in real time via sensors. And preprocessing and cleaning the collected data to ensure the accuracy and the integrity of the data.
Thereafter, the predictive performance of the multimodal large model was smoothly evaluated using an Exponential Moving Average (EMA). And comparing and analyzing the actual operation data with the model prediction result, and calculating indexes such as prediction error, accuracy and the like.
The present embodiment may also be optimized for feedback adjustment and persistence. The multi-mode large model can be adjusted and optimized as necessary according to the verification result. And continuous strategy learning is carried out by using the successfully corrected samples, so that the adaptability of the model to specific scene configuration is improved. The model is updated and iterated regularly to cope with the use requirement of the mechanical arm.
Through the steps, the mechanical arm posture self-correcting platform based on the multi-mode large model can be constructed, the platform can predict and correct the operation posture of the mechanical arm in real time, and the accuracy and stability of the operation of the mechanical arm are improved.
In one embodiment of the present invention, a possible example will be given below for non-limiting illustration of specific embodiments thereof, based on a robotic arm gripping prediction method.
In this embodiment, the multi-modal pre-training implementation process uses LLaMA-Adapter to construct a multi-modal large model (MLLM), extracts visual features of RGB images through the CLIP model, and encodes text prompts using a pre-trained LLaMA word segmentation engine.
The CLIP model of this embodiment includes a Text encoder that selects a Text transform as a training network and an image encoder that uses a depth convolution network as a training network, where the input form of the image encoder is [ n, h, w, c ], n is the batch size, and h, w, c is the size of the image.
Such as 224 x 3, the input form of the text encoder is n, l, the batch in the text encoder is the same as the batch in the image encoder because of the image text pairs, and l is the sequence length. And then calculating the similarity of the text vector and the image vector to predict whether the text vector and the image vector are a pair, and pretraining the CLIP by adopting a contrast learning method, wherein the CLIP model pretrains a large number of picture-text pairs to obtain the association relation between the image content and the natural language description.
The pre-training stage of this embodiment specifically comprises the following steps:
① The input text and the input image are respectively processed into feature vectors through Encoder;
② Constructing a relation matrix. Each element in the relationship matrix is the cosine similarity of each image feature vector to other text feature vectors. The elements of the main diagonal in the matrix are all matched (the image and text features are fully corresponding) and the elements elsewhere are not matched.
③ The loss function formula adopted in the pretrained contrast learning method is as follows:
τ is the set super parameter, q is the coding feature, k is the coded sample, and k + is the high-match sample.
In some specific embodiments, a continuous thought fine-tuning inference grabbing strategy (ManipLLM) is configured to achieve robust and interpretable actuator pose prediction using MLLM's inference capabilities.
The present embodiment captures RGB images and their corresponding successful steering end effector poses in a simulator during pre-collection of training data. During the reasoning process, focus is placed on the 2D coordinates x, y of the contact pixels on the predicted image, which are then converted to 3D coordinates using depth information. At the same time, we acquire the left (Z-axis direction) of the clip from its upward and forward directions according to the geometric relationship.
MLLM of this embodiment aligns visual features with the embedded space of a Large Language Model (LLM) through a projection layer, enabling LLaMA to multi-modal understanding and generate corresponding answers. During the training process we only fine-tune the injection adapter in LLM while keeping the main pre-training parameters unchanged to preserve the powerful capabilities of existing MLLM and enhance the model's functionality in terms of manipulation and failure correction.
The inference capability of MLLM may be utilized herein to configure a continuous thought fine-tuning inference grabbing strategy (ManipLLM) to achieve robust and interpretable actuator pose prediction. The grasping strategy based on continuous thinking fine adjustment reasoning can comprise operation category understanding of the end effector of the mechanical arm, priori force field reasoning and gesture prediction taking the end effector of the mechanical arm as a center.
For operation classification of the mechanical arm end effector, the operation targets of different classes can be classified according to geometric attributes by adopting a deep learning method based on target class identification (OCI).
The present embodiment also uses a priori force field reasoning. The stage of classifying the type of grabbing motion into "rotation" and "translation" and gathering the corresponding Affordance map models is intended to make the model aware of which object regions can be manipulated. The object moving part is first found and the object part is moved along the axis. Affordance map model A ε R H×W is obtained as follows
D ε R H×W calculates the Euclidean distance of the 3D position (corresponding to each pixel) before and after the movement. A is the probability of movability for each pixel.
In one embodiment of the application, object-centric pose prediction, after training data is collected, RGB images and corresponding actuator poses are recorded as model input and result rewards, and 2D coordinate poses [ x, y ] of the end-effectors are predicted by RGB images and text cues.
Depth values are converted to Z coordinate in space by intra-parametric conversion using depth frame information provided by a depth camera. Meanwhile, the Z-axis direction of the grab clamp is acquired from the upward and forward directions according to the geometric relation, and the accurate pose interpretation of the initial contact actuator is generated in an inference mode.
In this embodiment, since the relative positions of the robot and the object may change during each operation, a continuous strategy learning method is configured in order to enable the robotic arm gripping gesture prediction platform to have handling and fault correction capabilities. This approach aims to enhance posture prediction capability without expert feedback cues, and thus explores the use of Exponential Moving Average (EMA) to continually learn from new data, obtaining samples of successful corrections. The strategy formula is mu τ=αμτ-1+(1-α)μτ.
Where τ is the time step and μ represents the model of the invention. The update weight α=0.99 may be based on being set. The effectiveness of the EMA scheme in action sequential learning was evaluated by performing repeated closed loop correction and sequential policy learning sessions for each scene configuration.
Therefore, the invention realizes the use of the multi-mode large model so as to simultaneously predict the operation posture of the mechanical arm end effector and automatically recognize and correct the failed operation actions.
The application further provides electronic equipment, which is used for realizing the steps of the mechanical arm grabbing prediction method.
The electronic device of the present embodiment is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the embodiments of the application described and/or claimed herein.
Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 500 includes, but is not limited to, a network module 502, an audio output unit 503, an input unit 504, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 501, and a power module 511.
The processor 501 may include one or more processing units, such as the processor 501 may include a central processing unit (central processing unit, CPU), etc.), an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural-network processing unit, NPU, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The input unit 504 is used for receiving an audio or video signal. The input unit 504 may include a graphics processor (Graphics Processing Unit, GPU and microphone) that processes image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode.
The user input unit 507 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 507 includes a touch panel and other input devices.
The present application also provides a storage medium storing a program product capable of implementing the robot arm gripping prediction method described in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
A storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1.一种机械臂抓取预测方法,其特征在于,方法包括:1. A robot arm grasping prediction method, characterized in that the method comprises: S101:搭建机械臂抓取姿态预测平台,使用Franka Panda机器人作为机械臂姿态预测模型,获取执行器位姿的RGB图像;S101: Build a robot gripping posture prediction platform, use the Franka Panda robot as the robot posture prediction model, and obtain the RGB image of the actuator posture; S102:采用LLaMA-Adapter构造多模态大模型;S102: Use LLaMA-Adapter to construct a multimodal large model; S103:通过多模态大模型的CLIP模型提取RGB图像的视觉特征,并使用预训练的LLaMA分词器对视觉特征提示进行解析,得到RGB图像内容的文本信息和图像信息;S103: extract visual features of the RGB image through the CLIP model of the multimodal large model, and use the pre-trained LLaMA word segmenter to parse the visual feature prompts to obtain text information and image information of the RGB image content; S104:基于连续思维微调推理抓取策略,对处理后的文本信息和图像信息所对应的机械臂末端执行器的操作姿势进行预测;S104: fine-tuning the reasoning grasping strategy based on continuous thinking, predicting the operation posture of the end effector of the robot arm corresponding to the processed text information and image information; S105:实时获取机械臂末端执行器的操作姿势运行数据,基于指数移动平均法对多模态大模型进行校验。S105: Acquire the operation posture data of the end effector of the robot arm in real time, and verify the multi-modal large model based on the exponential moving average method. 2.根据权利要求1所述的机械臂抓取预测方法,其特征在于,2. The robot grasping prediction method according to claim 1, characterized in that: 步骤S101中,采用SAPIEN数据集和PartNet-Mobility数据集来搭建机械臂抓取姿态校正平台。In step S101, the SAPIEN dataset and the PartNet-Mobility dataset are used to build a robotic arm grasping posture correction platform. 3.根据权利要求1所述的机械臂抓取预测方法,其特征在于,步骤S101中,机械臂抓取姿态校正平台配置有VulkanRenderer高效渲染器。3. The robotic arm grasping prediction method according to claim 1 is characterized in that in step S101, the robotic arm grasping posture correction platform is configured with a VulkanRenderer high-efficiency renderer. 4.根据权利要求2所述的机械臂抓取预测方法,其特征在于,方法中,使用SAPIEN数据集提供的数据集加载器,加载数据集中的对象模型和Franka Panda机器人及执行器的URDF文件。4. The robotic arm grasping prediction method according to claim 2 is characterized in that, in the method, a dataset loader provided by the SAPIEN dataset is used to load the object model in the dataset and the URDF files of the Franka Panda robot and actuator. 5.根据权利要求1所述的机械臂抓取预测方法,其特征在于,步骤S103中,CLIP模型配置有文本编码器和图像编码器;5. The robot grasping prediction method according to claim 1, characterized in that in step S103, the CLIP model is configured with a text encoder and an image encoder; 文本编码器选择为训练网络,图像编码器以深度卷积网络为训练网络,图像编码器的输入形式为[n,h,w,c],n是批次大小,h,w,c是图像的大小。The text encoder is selected as the training network, and the image encoder uses a deep convolutional network as the training network. The input form of the image encoder is [n, h, w, c], where n is the batch size and h, w, c are the image sizes. 6.根据权利要求5所述的机械臂抓取预测方法,其特征在于,步骤S103中,CLIP模型对文本信息和图像信息进行预训练,得到图像内容与自然语言描述关联关系;6. The robot grasping prediction method according to claim 5, characterized in that, in step S103, the CLIP model pre-trains the text information and the image information to obtain the association relationship between the image content and the natural language description; 预训练包括如下步骤:Pre-training includes the following steps: S1031:将文本信息和图像信息处理成特征向量;S1031: Processing text information and image information into feature vectors; S1032:构建关系矩阵,关系矩阵中的每一个元素均是每一个图像特征向量和文本特征向量的余弦相似度;S1032: construct a relationship matrix, where each element in the relationship matrix is the cosine similarity between each image feature vector and text feature vector; S1033:预训练的对比学习方法中采用的损失函数公式如下:S1033: The loss function formula used in the pre-trained contrastive learning method is as follows: τ为设定的超参数;q为编码特征;k编码样本;k+为高匹配样本。τ is the set hyperparameter; q is the encoding feature; k is the encoding sample; k + is the high matching sample. 7.根据权利要求5所述的机械臂抓取预测方法,其特征在于,方法中,将抓取运动类型分为旋转和平移,并基于Affordance map模型获取到机械臂末端执行器的操作姿势移动方位;7. The robot grasping prediction method according to claim 5 is characterized in that, in the method, the grasping motion type is divided into rotation and translation, and the operation posture movement orientation of the robot end effector is obtained based on the Affordance map model; Affordance map模型A∈RH×W,基于如下式获得:The affordance map model A∈R H×W is obtained based on the following formula: 基于D∈RH×W计算机械臂移动前后位置的欧式距离,A为每个像素的可运动性概率。The Euclidean distance of the robot’s position before and after movement is calculated based on D∈R H×W , where A is the mobility probability of each pixel. 8.根据权利要求5所述的机械臂抓取预测方法,其特征在于,步骤S105中,指数移动平均法的策略公式为μτ=αμτ-1+(1-α)μτ 8. The robot grasping prediction method according to claim 5, characterized in that, in step S105, the strategy formula of the exponential moving average method is μ τ =αμ τ-1 +(1-α)μ τ 其中,τ是时间步长,μ表示多模态大模型,α=0.99。Among them, τ is the time step, μ represents the multimodal large model, and α = 0.99. 9.一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至8任一项所述机械臂抓取预测方法的步骤。9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the robot grasping prediction method according to any one of claims 1 to 8 are implemented. 10.一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述机械臂抓取预测方法的步骤。10. A storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the robot arm grasping prediction method according to any one of claims 1 to 8 are implemented.
CN202411168309.1A 2024-08-23 2024-08-23 A mechanical arm grasping prediction method, electronic device and storage medium Pending CN119168835A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411168309.1A CN119168835A (en) 2024-08-23 2024-08-23 A mechanical arm grasping prediction method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411168309.1A CN119168835A (en) 2024-08-23 2024-08-23 A mechanical arm grasping prediction method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN119168835A true CN119168835A (en) 2024-12-20

Family

ID=93880853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411168309.1A Pending CN119168835A (en) 2024-08-23 2024-08-23 A mechanical arm grasping prediction method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN119168835A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120580475A (en) * 2025-05-20 2025-09-02 北京智源人工智能研究院 Robot grasping posture prediction method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117773920A (en) * 2023-12-21 2024-03-29 浙江大学 A natural language driven robotic grasping method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117773920A (en) * 2023-12-21 2024-03-29 浙江大学 A natural language driven robotic grasping method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AI大道理: "CLIP:万物分类(视觉语言大模型)", Retrieved from the Internet <URL:https://blog.csdn.net/qq_42734492/article/details/134387789> *
JIAMING LIU ET AL.: "Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation", ARXIV, 27 May 2024 (2024-05-27), pages 1 - 18 *
XIAOQI LI ET AL.: "ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation", ARXIV, 24 December 2023 (2023-12-24) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120580475A (en) * 2025-05-20 2025-09-02 北京智源人工智能研究院 Robot grasping posture prediction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US20230311335A1 (en) Natural language control of a robot
US11325252B2 (en) Action prediction networks for robotic grasping
Simeonov et al. A long horizon planning framework for manipulating rigid pointcloud objects
Wang et al. Hierarchical policies for cluttered-scene grasping with latent plans
CN112512755B (en) Robotic manipulation using domain-invariant 3D representations predicted from 2.5D visual data
Wu et al. Pixel-attentive policy gradient for multi-fingered grasping in cluttered scenes
JP2023525676A (en) Training and/or utilizing machine learning models for use in natural language based robot control
CN115812180A (en) Robot-controlled offline learning using reward prediction model
Zhang et al. Modular deep q networks for sim-to-real transfer of visuo-motor policies
US20220402125A1 (en) System and method for determining a grasping hand model
Aslan et al. New CNN and hybrid CNN-LSTM models for learning object manipulation of humanoid robots from demonstration
Gao et al. An improved SAC-based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments
CN118013838B (en) A Smart Flexible Assembly Method for 3C Products
CN119526422A (en) A method for interactive operation control of deformable objects based on a visual-touch-language-action multimodal model
CN114519813A (en) Mechanical arm target grabbing method and system
CN117772648B (en) Part sorting processing method, device, equipment and medium based on body intelligence
CN119168835A (en) A mechanical arm grasping prediction method, electronic device and storage medium
CN118664590A (en) Mechanical arm pushing and grabbing cooperative operation system based on language interaction and control method thereof
Park et al. Sim-to-real visual grasping via state representation learning based on combining pixel-level and feature-level domain adaptation
Peng et al. A pushing-grasping collaborative method based on deep Q-network algorithm in dual viewpoints
Tsai et al. Visually guided picking control of an omnidirectional mobile manipulator based on end-to-end multi-task imitation learning
CN112045680B (en) A cloth palletizing robot control system and control method based on behavior clone
US20240412063A1 (en) Demonstration-driven reinforcement learning
Luan et al. Dynamic hand gesture recognition for robot ARM teaching based on improved LRCN model
EP4643272A1 (en) Open-vocabulary robotic control using multi-modal language models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination