CN119168835A - A mechanical arm grasping prediction method, electronic device and storage medium - Google Patents
A mechanical arm grasping prediction method, electronic device and storage medium Download PDFInfo
- Publication number
- CN119168835A CN119168835A CN202411168309.1A CN202411168309A CN119168835A CN 119168835 A CN119168835 A CN 119168835A CN 202411168309 A CN202411168309 A CN 202411168309A CN 119168835 A CN119168835 A CN 119168835A
- Authority
- CN
- China
- Prior art keywords
- model
- robot
- grasping
- mechanical arm
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/0014—Image feed-back for automatic industrial control, e.g. robot with camera
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J19/00—Accessories fitted to manipulators, e.g. for monitoring, for viewing; Safety devices combined with or specially adapted for use in connection with manipulators
- B25J19/02—Sensing devices
- B25J19/021—Optical sensing devices
- B25J19/023—Optical sensing devices including video camera means
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1694—Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
- B25J9/1697—Vision controlled systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/16—Automatic learning of transformation rules, e.g. from examples
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a mechanical arm grabbing prediction method, electronic equipment and a storage medium, which belong to the technical field of mechanical arm vision, and are used for constructing a mechanical arm grabbing gesture correction platform to obtain RGB images of the pose of an actuator, adopting LLaMa-adapter to construct a multi-mode large model, extracting visual characteristics of the RGB images through a CLIP model, analyzing visual characteristic prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of RGB image contents, predicting the operation gesture of an end actuator of the mechanical arm corresponding to the processed text information and image information based on a continuous thinking fine adjustment inference grabbing strategy, acquiring operation gesture operation data of the end actuator of the mechanical arm in real time, and checking the multi-mode large model based on an exponential moving average method. The method provides a continuous strategy learning method, enhances the adaptability of the model to the current scene configuration, reduces the frequency of expert intervention, and enables the mechanical arm to meet the use requirement.
Description
Technical Field
The invention belongs to the technical field of mechanical arm vision, and particularly relates to a mechanical arm grabbing prediction method based on a self-correction multi-mode large model, electronic equipment and a storage medium.
Background
In the prior art, the mechanical arm vision grabbing technology is widely used in multiple fields. In practical application, the environment where the mechanical arm is located is often complex and changeable, and interference factors such as illumination change, shielding, light reflection and the like exist, and the factors can influence the quality of an image, so that the accuracy and the reliability of visual feedback are reduced.
The rapid movement of the robotic arm may cause image blurring, which may seriously affect the accuracy of the visual feedback, especially at high speed gripping, increasing the risk of gripping failure. Current robotic arm vision gripping systems tend to be optimized for a particular task or object, lacking sufficient flexibility and adaptability. This also results in the robotic manipulation strategy failing to meet the performance requirements of the motion, affecting the use of the robotic arm.
Disclosure of Invention
The invention provides a mechanical arm grabbing prediction method, which provides a continuous strategy learning method, enhances the adaptability of a model to the current scene configuration, reduces the frequency of expert intervention, and enables the mechanical arm to meet the use requirement.
The method comprises the following steps:
S101, constructing a mechanical arm grabbing gesture prediction platform, and acquiring RGB images of the pose of an actuator by using FRANKA PANDA robots as mechanical arm gesture prediction models;
s102, adopting LLaMA-Adapter to construct a multi-mode large model;
S103, extracting visual features of the RGB image through a CLIP model of the multi-mode large model, and analyzing the visual feature prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of the RGB image content;
S104, based on a continuous thinking fine adjustment reasoning grabbing strategy, predicting the operation posture of the mechanical arm end effector corresponding to the processed text information and the processed image information;
S105, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified based on an exponential moving average method.
It should be further noted that in step S101, a SAPIEN dataset and a PartNet-Mobility dataset are used to build a robot arm grabbing posture correction platform.
It should be further noted that, in step S101, the robot arm grabbing gesture correction platform is configured with a VulkanRenderer high-efficiency renderer.
It should be further noted that, in the method, a dataset loader provided by SAPIEN datasets is used to load the object model in the datasets and URDF files of the FRANKA PANDA robot and the actuator.
It should be further noted that, in step S103, the CLIP model is configured with a text encoder and an image encoder;
The text encoder is selected as a training network, the image encoder takes a depth convolution network as the training network, the input form of the image encoder is [ n, h, w, c ], n is the batch size, h, w, c is the size of the image.
In step S103, the CLIP model performs pre-training on the text information and the image information to obtain an association relationship between the image content and the natural language description;
the pre-training comprises the following steps:
s1031, processing the text information and the image information into feature vectors;
S1032, constructing a relation matrix, wherein each element in the relation matrix is cosine similarity of each image feature vector and each text feature vector;
S1033, a loss function formula adopted in the pretrained contrast learning method is as follows:
τ is the set super parameter, q is the coding feature, k is the coded sample, and k + is the high-match sample.
In the method, the grabbing motion type is divided into rotation and translation, and the operation gesture movement direction of the mechanical arm end effector is obtained based on a Affordance map model;
Affordance map model A ε R H×W is obtained based on the following formula:
Based on D epsilon R H×W, the Euclidean distance of the position before and after the movement of the mechanical arm is calculated, and A is the mobility probability of each pixel.
It should be further noted that, in step S105, the strategy formula of the exponential moving average method is μ τ=αμτ-1+(1-α)μτ
Where τ is the time step, μ represents the multi-modal large model, α=0.99.
According to another embodiment of the present application, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the robotic arm grasp prediction method when executing the program.
According to yet another embodiment of the present application, there is also provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the robotic arm grasp prediction method.
From the above technical scheme, the invention has the following advantages:
According to the mechanical arm grabbing prediction method provided by the invention, grabbing scenes can be more comprehensively understood through visual features and text information of RGB images, so that the optimal operation posture of the mechanical arm end effector can be predicted more accurately. Compared with a single mode, the multi-mode fusion mode can capture nuances in a complex environment, and accuracy and robustness of prediction are improved.
The invention utilizes the pre-trained LLaMA word segmentation device and the CLIP model, can process input data with different sources and formats, and enhances the generalization capability of the model. And can also adapt to different working environments and task demands.
According to the mechanical arm grabbing prediction method, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified by using an index moving average method, so that online learning and self-optimization of the model can be realized. The model can be ensured to still maintain higher prediction accuracy after long-time operation.
The method improves the efficiency and reliability of industrial automatic production by automatically predicting and correcting the grabbing gesture of the mechanical arm. The technology of multiple fields such as deep learning, computer vision, natural language processing and the like is integrated, the prediction precision is improved, the generalization capability is enhanced, and the real-time self-correction is realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a robotic arm grasp prediction method;
Fig. 2 is a schematic diagram of an electronic device.
Detailed Description
The following detailed description of the robotic arm grasp prediction method of the present application, for purposes of explanation and not limitation, sets forth specific details, such as particular system configurations, techniques, etc., in order to provide a thorough understanding of embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
It should be understood that references to "one or more" herein means one, two, or more, and references to "a plurality" herein mean two or more. In the description of the present application, "/" means or, unless otherwise indicated, for example, A/B may mean A or B. The term "and/or" is merely an association relation describing the association object, and means that three kinds of relations may exist, for example, a and/or B may mean that a exists alone, a and B exist together, and B exists alone.
The statements of "one embodiment" or "some embodiments" and the like, described in this disclosure, mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present disclosure. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the present application are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of a method for predicting a robot gripping operation in an embodiment is shown, where the method includes:
s101, constructing a mechanical arm grabbing gesture correction platform, and acquiring RGB images of the pose of an actuator by using FRANKA PANDA robots as mechanical arm gesture prediction models.
In some embodiments, FRANKA PANDA robots, known for their high precision and flexibility, are used as experimental platforms, suitable for complex robotic arm pose prediction tasks.
In the embodiment, a high-resolution camera is installed in a working area of the mechanical arm and used for capturing RGB images of the pose of the actuator. At the same time, necessary sensors (e.g., force sensors, position sensors, etc.) are installed to obtain more comprehensive robot arm operational status data. The camera and the sensor are connected with a computer or a data processing center, so that real-time data transmission and processing are ensured.
In some specific embodiments, FRANKA PANDA robots are selected as experimental platforms, and high-resolution cameras are installed in the working areas of the mechanical arms to ensure that RGB images of the pose of the actuator can be clearly captured.
The robot arm grabbing gesture correction platform is configured to receive and process data from the cameras and sensors. And a data interface is established, so that the camera and the sensor can transmit data to a data processing center in real time and stably.
In some embodiments, SAPIEN is a physical-based, home-scene-robot-oriented simulation environment, the robotic arm grabbing gesture correction platform may set an interaction environment using SAPIEN dataset and PartNet-Mobility dataset, load object models in the dataset and FRANKA PANDA robot and end effector URDF files using the dataset loader provided in SAPIEN, randomly select a contact point in the movable part using the efficient renderer VulkanRenderer based on efficient spectroscopic rasterization, and interact with the target using the opposite direction of its normal vector as the effector direction, successfully implement the operation, and record as a successful sample.
Alternatively, about 10,000 operation success samples are recorded off-line sampling, covering 20 categories.
S102, adopting LLaMA-Adapter to construct a multi-mode large model.
In some embodiments, the mechanical arm grabbing gesture correction platform uses LLaMA-Adapter as a basic model, so that efficient fine adjustment can be supported, and multi-mode input text information and image information can be processed.
In the embodiment, a pre-trained LLaMA model is loaded, most of parameters are frozen, only the adaptation layer in the LLaMA-Adapter is finely adjusted, the calculated amount is reduced, the overfitting is avoided, and the self-correction efficiency can be improved.
The model set in this embodiment takes RGB images and text instructions as inputs. Visual encoders using the CLIP model (or similar model) are used to extract visual features of the image and pre-trained LLaMA segmentors are used to process text instructions.
For the multimodal large model of LLaMA-Adapter construction, LLaMA-Adapter is trained using small amounts of self-instruction (self-instruct) data to adapt it to the robot pose prediction task.
And S103, extracting visual features of the RGB image through a CLIP model of the multi-mode large model, and analyzing the visual feature prompts by using a pre-trained LLaMA word segmentation device to obtain text information and image information of the RGB image content.
In some embodiments, the RGB image captured by the camera is passed as input to a visual encoder of the CLIP model. The CLIP model encodes the image and extracts high-level visual feature vectors. The visual feature vectors are passed to a multi-modal large model, combined with text prompts processed by a pre-trained LLaMA word-segmenter.
In this way, within the multimodal big model, the visual features and text cues are fused by a specific fusion mechanism to form a unified input representation.
S104, based on the continuous thinking fine adjustment reasoning grabbing strategy, the operation posture of the mechanical arm end effector corresponding to the processed text information and the processed image information is predicted.
In some embodiments, in LLaMA-Adapter, the visual features extracted by CLIP are fused with text cues and processed through an adaptive cue layer inside the model.
The model of the embodiment can predict the optimal operation posture of the mechanical arm end effector according to the fused input information. Optionally, the position of the contact point, the direction and force of the gripper, etc. may be included.
The present embodiment also utilizes continuous thinking (e.g., a system of speed that mimics the human thinking process) to fine tune the model to optimize the prediction results. For example, in the event of a prediction error, the model can think itself back and generate a new prediction.
In this embodiment, the multi-modal large model predicts the optimal operating pose of the robotic arm end effector based on the fused input representation. The prediction result comprises key parameters such as the position of a contact point, the direction and the strength of a grabber and the like.
The present embodiment fine-tunes the prediction strategy by introducing a continuous thinking mechanism (e.g., a fast-slow system model). When the predicted result deviates from the actual execution result, the model can self-think and generate a new prediction strategy. And through continuous learning and optimization, the adaptability of the model to specific scenes and tasks is improved.
S105, operation posture operation data of the mechanical arm end effector are obtained in real time, and the multi-mode large model is verified based on an exponential moving average method.
In some embodiments, the operating pose and operational data of the robotic end effector are acquired in real-time by sensors. And comparing the actual operation data with the model prediction result, and smoothly evaluating the prediction performance of the model by using an Exponential Moving Average (EMA). EMA can help the model better adapt to data fluctuations in the short term while preserving long term trends.
According to the embodiment, the multi-mode large model can be adjusted as necessary according to the verification result. If the prediction error is large, some of the parameters of the model may be retrained or fine-tuned. Meanwhile, continuous strategy learning is performed by using a successfully corrected sample, so that the adaptability of the model to specific scene configuration is improved.
In some specific embodiments, acquiring operational pose and operational data of the robotic end effector in real time may acquire the operational pose and operational data of the robotic end effector in real time via sensors. And preprocessing and cleaning the collected data to ensure the accuracy and the integrity of the data.
Thereafter, the predictive performance of the multimodal large model was smoothly evaluated using an Exponential Moving Average (EMA). And comparing and analyzing the actual operation data with the model prediction result, and calculating indexes such as prediction error, accuracy and the like.
The present embodiment may also be optimized for feedback adjustment and persistence. The multi-mode large model can be adjusted and optimized as necessary according to the verification result. And continuous strategy learning is carried out by using the successfully corrected samples, so that the adaptability of the model to specific scene configuration is improved. The model is updated and iterated regularly to cope with the use requirement of the mechanical arm.
Through the steps, the mechanical arm posture self-correcting platform based on the multi-mode large model can be constructed, the platform can predict and correct the operation posture of the mechanical arm in real time, and the accuracy and stability of the operation of the mechanical arm are improved.
In one embodiment of the present invention, a possible example will be given below for non-limiting illustration of specific embodiments thereof, based on a robotic arm gripping prediction method.
In this embodiment, the multi-modal pre-training implementation process uses LLaMA-Adapter to construct a multi-modal large model (MLLM), extracts visual features of RGB images through the CLIP model, and encodes text prompts using a pre-trained LLaMA word segmentation engine.
The CLIP model of this embodiment includes a Text encoder that selects a Text transform as a training network and an image encoder that uses a depth convolution network as a training network, where the input form of the image encoder is [ n, h, w, c ], n is the batch size, and h, w, c is the size of the image.
Such as 224 x 3, the input form of the text encoder is n, l, the batch in the text encoder is the same as the batch in the image encoder because of the image text pairs, and l is the sequence length. And then calculating the similarity of the text vector and the image vector to predict whether the text vector and the image vector are a pair, and pretraining the CLIP by adopting a contrast learning method, wherein the CLIP model pretrains a large number of picture-text pairs to obtain the association relation between the image content and the natural language description.
The pre-training stage of this embodiment specifically comprises the following steps:
① The input text and the input image are respectively processed into feature vectors through Encoder;
② Constructing a relation matrix. Each element in the relationship matrix is the cosine similarity of each image feature vector to other text feature vectors. The elements of the main diagonal in the matrix are all matched (the image and text features are fully corresponding) and the elements elsewhere are not matched.
③ The loss function formula adopted in the pretrained contrast learning method is as follows:
τ is the set super parameter, q is the coding feature, k is the coded sample, and k + is the high-match sample.
In some specific embodiments, a continuous thought fine-tuning inference grabbing strategy (ManipLLM) is configured to achieve robust and interpretable actuator pose prediction using MLLM's inference capabilities.
The present embodiment captures RGB images and their corresponding successful steering end effector poses in a simulator during pre-collection of training data. During the reasoning process, focus is placed on the 2D coordinates x, y of the contact pixels on the predicted image, which are then converted to 3D coordinates using depth information. At the same time, we acquire the left (Z-axis direction) of the clip from its upward and forward directions according to the geometric relationship.
MLLM of this embodiment aligns visual features with the embedded space of a Large Language Model (LLM) through a projection layer, enabling LLaMA to multi-modal understanding and generate corresponding answers. During the training process we only fine-tune the injection adapter in LLM while keeping the main pre-training parameters unchanged to preserve the powerful capabilities of existing MLLM and enhance the model's functionality in terms of manipulation and failure correction.
The inference capability of MLLM may be utilized herein to configure a continuous thought fine-tuning inference grabbing strategy (ManipLLM) to achieve robust and interpretable actuator pose prediction. The grasping strategy based on continuous thinking fine adjustment reasoning can comprise operation category understanding of the end effector of the mechanical arm, priori force field reasoning and gesture prediction taking the end effector of the mechanical arm as a center.
For operation classification of the mechanical arm end effector, the operation targets of different classes can be classified according to geometric attributes by adopting a deep learning method based on target class identification (OCI).
The present embodiment also uses a priori force field reasoning. The stage of classifying the type of grabbing motion into "rotation" and "translation" and gathering the corresponding Affordance map models is intended to make the model aware of which object regions can be manipulated. The object moving part is first found and the object part is moved along the axis. Affordance map model A ε R H×W is obtained as follows
D ε R H×W calculates the Euclidean distance of the 3D position (corresponding to each pixel) before and after the movement. A is the probability of movability for each pixel.
In one embodiment of the application, object-centric pose prediction, after training data is collected, RGB images and corresponding actuator poses are recorded as model input and result rewards, and 2D coordinate poses [ x, y ] of the end-effectors are predicted by RGB images and text cues.
Depth values are converted to Z coordinate in space by intra-parametric conversion using depth frame information provided by a depth camera. Meanwhile, the Z-axis direction of the grab clamp is acquired from the upward and forward directions according to the geometric relation, and the accurate pose interpretation of the initial contact actuator is generated in an inference mode.
In this embodiment, since the relative positions of the robot and the object may change during each operation, a continuous strategy learning method is configured in order to enable the robotic arm gripping gesture prediction platform to have handling and fault correction capabilities. This approach aims to enhance posture prediction capability without expert feedback cues, and thus explores the use of Exponential Moving Average (EMA) to continually learn from new data, obtaining samples of successful corrections. The strategy formula is mu τ=αμτ-1+(1-α)μτ.
Where τ is the time step and μ represents the model of the invention. The update weight α=0.99 may be based on being set. The effectiveness of the EMA scheme in action sequential learning was evaluated by performing repeated closed loop correction and sequential policy learning sessions for each scene configuration.
Therefore, the invention realizes the use of the multi-mode large model so as to simultaneously predict the operation posture of the mechanical arm end effector and automatically recognize and correct the failed operation actions.
The application further provides electronic equipment, which is used for realizing the steps of the mechanical arm grabbing prediction method.
The electronic device of the present embodiment is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the embodiments of the application described and/or claimed herein.
Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 500 includes, but is not limited to, a network module 502, an audio output unit 503, an input unit 504, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 501, and a power module 511.
The processor 501 may include one or more processing units, such as the processor 501 may include a central processing unit (central processing unit, CPU), etc.), an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural-network processing unit, NPU, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The input unit 504 is used for receiving an audio or video signal. The input unit 504 may include a graphics processor (Graphics Processing Unit, GPU and microphone) that processes image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode.
The user input unit 507 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 507 includes a touch panel and other input devices.
The present application also provides a storage medium storing a program product capable of implementing the robot arm gripping prediction method described in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
A storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411168309.1A CN119168835A (en) | 2024-08-23 | 2024-08-23 | A mechanical arm grasping prediction method, electronic device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411168309.1A CN119168835A (en) | 2024-08-23 | 2024-08-23 | A mechanical arm grasping prediction method, electronic device and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN119168835A true CN119168835A (en) | 2024-12-20 |
Family
ID=93880853
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411168309.1A Pending CN119168835A (en) | 2024-08-23 | 2024-08-23 | A mechanical arm grasping prediction method, electronic device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119168835A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120580475A (en) * | 2025-05-20 | 2025-09-02 | 北京智源人工智能研究院 | Robot grasping posture prediction method, device, equipment and storage medium |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117773920A (en) * | 2023-12-21 | 2024-03-29 | 浙江大学 | A natural language driven robotic grasping method |
-
2024
- 2024-08-23 CN CN202411168309.1A patent/CN119168835A/en active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117773920A (en) * | 2023-12-21 | 2024-03-29 | 浙江大学 | A natural language driven robotic grasping method |
Non-Patent Citations (3)
| Title |
|---|
| AI大道理: "CLIP:万物分类(视觉语言大模型)", Retrieved from the Internet <URL:https://blog.csdn.net/qq_42734492/article/details/134387789> * |
| JIAMING LIU ET AL.: "Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation", ARXIV, 27 May 2024 (2024-05-27), pages 1 - 18 * |
| XIAOQI LI ET AL.: "ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation", ARXIV, 24 December 2023 (2023-12-24) * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120580475A (en) * | 2025-05-20 | 2025-09-02 | 北京智源人工智能研究院 | Robot grasping posture prediction method, device, equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230311335A1 (en) | Natural language control of a robot | |
| US11325252B2 (en) | Action prediction networks for robotic grasping | |
| Simeonov et al. | A long horizon planning framework for manipulating rigid pointcloud objects | |
| Wang et al. | Hierarchical policies for cluttered-scene grasping with latent plans | |
| CN112512755B (en) | Robotic manipulation using domain-invariant 3D representations predicted from 2.5D visual data | |
| Wu et al. | Pixel-attentive policy gradient for multi-fingered grasping in cluttered scenes | |
| JP2023525676A (en) | Training and/or utilizing machine learning models for use in natural language based robot control | |
| CN115812180A (en) | Robot-controlled offline learning using reward prediction model | |
| Zhang et al. | Modular deep q networks for sim-to-real transfer of visuo-motor policies | |
| US20220402125A1 (en) | System and method for determining a grasping hand model | |
| Aslan et al. | New CNN and hybrid CNN-LSTM models for learning object manipulation of humanoid robots from demonstration | |
| Gao et al. | An improved SAC-based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments | |
| CN118013838B (en) | A Smart Flexible Assembly Method for 3C Products | |
| CN119526422A (en) | A method for interactive operation control of deformable objects based on a visual-touch-language-action multimodal model | |
| CN114519813A (en) | Mechanical arm target grabbing method and system | |
| CN117772648B (en) | Part sorting processing method, device, equipment and medium based on body intelligence | |
| CN119168835A (en) | A mechanical arm grasping prediction method, electronic device and storage medium | |
| CN118664590A (en) | Mechanical arm pushing and grabbing cooperative operation system based on language interaction and control method thereof | |
| Park et al. | Sim-to-real visual grasping via state representation learning based on combining pixel-level and feature-level domain adaptation | |
| Peng et al. | A pushing-grasping collaborative method based on deep Q-network algorithm in dual viewpoints | |
| Tsai et al. | Visually guided picking control of an omnidirectional mobile manipulator based on end-to-end multi-task imitation learning | |
| CN112045680B (en) | A cloth palletizing robot control system and control method based on behavior clone | |
| US20240412063A1 (en) | Demonstration-driven reinforcement learning | |
| Luan et al. | Dynamic hand gesture recognition for robot ARM teaching based on improved LRCN model | |
| EP4643272A1 (en) | Open-vocabulary robotic control using multi-modal language models |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |