CN120781874A

CN120781874A - Self intelligent data acquisition method, equipment, medium and product

Info

Publication number: CN120781874A
Application number: CN202511241789.4A
Authority: CN
Inventors: 山栋明; 黄海清; 王长冕; 徐志云
Original assignee: Shanghai Kupas Technology Co ltd
Current assignee: Shanghai Kupas Technology Co ltd
Priority date: 2025-09-02
Filing date: 2025-09-02
Publication date: 2025-10-14
Anticipated expiration: 2045-09-02
Also published as: CN120781874B

Abstract

The embodiment of the application relates to the technical field of information and discloses a method, equipment, medium and product for acquiring intelligent body data, wherein the method for acquiring intelligent body data comprises the steps of acquiring artificial demonstration data through an exoskeleton system, wherein the artificial demonstration data comprise video stream data, motion data and force sense data; the video stream data is imported into a first large language-visual model, segmented into a plurality of independent task segments and the independent task segment description is generated, the independent task segment description is imported into the large language model to generate task labels based on natural language description, and the independent task segments, the task labels, the corresponding motion data and the corresponding force sense data are exported to generate body intelligent data. Therefore, a set of low-cost, high-efficiency and large-scale intelligent body-building data acquisition and processing method is established, and automatic conversion from human operation data to high-quality VLA training data is realized.

Description

Self intelligent data acquisition method, equipment, medium and product

Technical Field

The present application relates to the field of information technologies, and in particular, to a method, an apparatus, a medium, and a product for acquiring self-contained intelligent data.

Background

In recent years, self-contained intelligence (Embodied AI) is becoming an important path for implementing general-purpose artificial intelligence (AGI) as a key carrier for artificial intelligence going to the real world. The intelligent body is essentially to deeply fuse the cognitive intelligence and the physical execution system, so that the machine can finish complex tasks through the cooperation of perception, understanding and actions. With the rapid development of multi-Modal Large Models (MLMs) and vision-language-action (VLA) models, the ability of self-contained agents to perceive, interact and infer has been unprecedented.

However, intelligent research and development and applications still face significant challenges in data acquisition. High-quality training data is the basis for building a powerful self-contained intelligent system, but the current data acquisition method has significant limitations:

1. traditional teleoperation data acquisition is high in cost and low in efficiency

Existing teleoperation systems of robots, while capable of providing high fidelity data sets, rely on specialized robotic equipment, requiring highly trained operators, and high costs severely limit the scale-up of data acquisition.

2. Lack of automation and standardization of data processing flows

The current data processing mainly relies on manual operation, and comprises links such as data cleaning, space-time alignment, segmentation labeling and the like, so that the efficiency is low, and the problem of inconsistency is easy to occur. Data scarcity has been a continuing challenge in intelligent research, and the collection of real world robot data faces many technical and cost hurdles.

3. The existing data scene and diversity are single

The VLA model requires extensive, diversified training of visual-language-action data, including accurate spatiotemporal synchronization data, structured annotations, and metadata. Existing datasets are limited by acquisition equipment and sites, often lacking sufficient diversity and scale, and are difficult to support the training requirements of generalized VLA models.

Therefore, how to establish a set of low-cost, high-efficiency and scalable intelligent body-building data acquisition and processing method to realize the automatic conversion from human operation data to high-quality VLA training data is a key technical problem to be solved in the current intelligent body-building field.

Disclosure of Invention

An object of the present application is to provide a method, apparatus, medium and product for acquiring intelligent data of a body, at least to solve the problems of high acquisition cost and low efficiency of the data set of the body.

To achieve the above object, some embodiments of the present application provide the following aspects:

The application provides a method for acquiring intelligent data of a body, which comprises the following steps:

collecting artificial demonstration data through an exoskeleton system, wherein the artificial demonstration data comprise video stream data, motion data and force sense data;

Importing the video stream data into a first large language-visual model, segmenting the video stream data into a plurality of independent task segments, and generating independent task segment descriptions;

Importing the independent task segment description into a large language model to generate a task annotation based on natural language description;

And exporting the independent task segment, the task label and the corresponding motion data and force sense data to generate intelligent data of the body.

In a second aspect, some embodiments of the application also provide an electronic device comprising one or more processors and a memory storing computer program instructions that, when executed, cause the processors to perform the steps of the method as described above.

In a third aspect, some embodiments of the application also provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement a method as described above.

In a fourth aspect, some embodiments of the application also provide a computer program product comprising a computer program/instruction which, when executed by a processor, implements the steps of the method as described above.

Compared with the related art, in the scheme provided by the embodiment of the application, the exoskeleton equipment is used for collecting human operation data, so that the data collection cost is obviously reduced, the efficiency is improved, and then, the collected data is combined with a large language-visual model, so that the data processing flow can be automated, and a high-quality data set which can be directly used for VLA model training is generated. Therefore, a set of low-cost, high-efficiency and large-scale intelligent body-building data acquisition and processing method is established, and automatic conversion from human operation data to high-quality VLA training data is realized.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is a flow chart of a method for acquiring personal intelligent data in accordance with an exemplary embodiment of the present disclosure;

FIG. 2 is a flowchart of another method for acquiring personal intelligent data in accordance with an exemplary embodiment of the present disclosure;

fig. 3 is an exemplary structural diagram of the electronic device according to some embodiments of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Fig. 1 is a schematic diagram of a method for acquiring intelligent data of a body according to an exemplary embodiment of the present disclosure, where the method for acquiring intelligent data of a body includes:

s101, acquiring artificial demonstration data through an exoskeleton system, wherein the artificial demonstration data comprise video stream data, motion data and force sense data.

Specifically, various manual demonstration data, such as taking cups, arranging desktops and the like, are collected by manually wearing an exoskeleton system. The exoskeleton system can simultaneously acquire data streams of vision data (including RGB-D image sequences, including third person viewing angles and palm viewing angles), motion data (joint angles, end effector pose, speed and acceleration information), force sense data (contact force, torque and grip force information). The exoskeleton can employ commercially available devices of the mature low cost exoskeleton system AirExo. The exoskeleton device is provided with a high-precision force sensor and a camera, and can record hand actions and gestures of an operator and visual action information in an operation process in real time. By utilizing mature low-cost exoskeleton equipment in the market to acquire human teaching data, the high hardware and operation cost required by the traditional robot teaching method are obviously reduced, the cost is reduced by more than 80% compared with that of a traditional robot teleoperation system, and meanwhile, the acquisition efficiency is improved by 3-5 times.

S102, importing the video stream data into a first large language-visual model, segmenting the video stream data into a plurality of independent task segments, and generating the independent task segment description.

Specifically, this step utilizes the powerful world knowledge and context awareness capabilities of large language-visual models (VLMs) to identify "meaningful" task transition points in the human perspective as boundaries in the data stream. The input is a multi-mode data stream after time alignment, mainly a Video stream, the Video stream is segmented based on methods such as content semantic segmentation or based on action unit detection, a first large language-visual model (such as Video-XL-2) is used for carrying out Video understanding on each small Video segment and generating scene description words, then the time scale of each Video data segment in an original Video is calculated, and finally, independent task segments and independent task segment descriptions, such as ' 0.0s,2.5s ', the robot hand extends to a water cup ', ' 2.5s,4.5s ', the robot hand holds the water cup ', ' 4.5s,10.0s, and the robot is pouring water.

S103, importing the independent task segment description into a large language model to generate a task annotation based on natural language description.

Specifically, the procedure is to use the LLM color primitive label to obtain more natural subtask instructions, and finally, to input a series of subtask instructions as context information to LLM for intention understanding labeling. By means of the strong common sense reasoning capability of LLM, the independent task segment description is changed into a natural language task instruction which is more in line with the expression habit of human beings, for example, the primitive label of 'a robot holds a cup' and turns into 'picks up a cup on a table'. Therefore, a series of coherent natural language subtask instructions can be obtained, and then LLM is utilized to analyze the instructions and carry out high-level induction and reasoning, so that the most core user intention behind the whole teaching data or the 'skill' required to be learned by a robot are marked.

And S104, exporting the independent task segment, the task label and the corresponding motion data and force sense data to generate intelligent data of the body.

And finally, packaging and outputting the independent task segments, the task labels, the motion data and the force sense data corresponding to the segments together to form body intelligent data, and subsequently forming a body intelligent data set for subsequent development and simulation.

In this embodiment, human operation data is collected through the exoskeleton device, so that data collection cost is significantly reduced and efficiency is improved, and then the collected data is combined with a large language-visual model, so that a data processing flow can be automated, and a high-quality data set which can be directly used for training a VLA model is generated. Therefore, a set of low-cost, high-efficiency and large-scale intelligent body-building data acquisition and processing method is established, and automatic conversion from human operation data to high-quality VLA training data is realized. The large-scale, high-quality, diversified and semantically rich syndrome data generated by the method directly meets the core requirement of the VLA model on training data, and effectively overcomes the challenge of the lack of a large-scale data set of the existing VLA model. By utilizing human operational data, the VLA model is able to learn a wider range of skills and a stronger generalization ability, achieving "few sample generalization" and "fit across bodies" and thus performing better in unseen tasks and novel environments. These high quality data are the cornerstone for the VLA model to realize its general intelligent potential.

In one embodiment, as shown in fig. 2, the importing the video stream data into a first large language-visual model, segmenting into a plurality of independent task segments, and generating the independent task segment description specifically includes:

S201, compressing the video stream data into a plurality of video clips by using a frame extraction algorithm.

Specifically, in this embodiment, a frame-extraction algorithm is used to compress successive video frames into different small segments. The key principle is that by selectively reserving or discarding video frames, the data volume is reduced, and meanwhile, key information (such as actions, scene changes and the like) of the video is reserved as much as possible, so that a simplified and meaningful video fragment is provided for subsequent processing (such as VLM analysis).

Frames may be uniformly extracted from consecutive frames using a predetermined time interval (e.g., one frame every 0.25 seconds, 0.5 seconds) or frame rate (e.g., 10fps from 30fps video). For example, a 10 second, 30fps video (300 frames total) would eventually hold 10 frames (1 frame every 30 frames) if frames were decimated at 1 second intervals.

Dynamic threshold sampling may also be used to calculate pixel differences (e.g., gray values, RGB value changes) or feature differences (e.g., edges, textures, target contour changes) for two adjacent frames, and to numerically quantify (e.g., the larger the difference value, the more severe the scene change). Setting a threshold value, when the difference between two frames exceeds the threshold value, retaining the current frame (regarded as a key change point), and if the difference is smaller than the threshold value, discarding the current frame (regarded as content repetition).

The method can also be used for tracking the key target, firstly, the key target (such as a robot hand) in the video is identified through a target detection algorithm, and then the motion state (position and gesture change) of the key target is tracked. When the target motion state changes significantly (e.g., from "stationary" to "moving", or from "extending to" holding "the cup"), a frame is triggered to take frames, which remain at that time. For example, in a robot pouring video, when the "hand position moves from above the cup to the water flow" the frame is extracted to keep the frame as the action turning point.

S202, generating video clip description words through the video clip by the first large language-visual model.

Specifically, a first large language-visual model (e.g., video-XL-2) is used to perform Video understanding on each small Video segment and generate scene description text.

S203, calculating the time scale of the video segment in the video stream data according to the frame extraction frequency in the frame extraction algorithm.

Specifically, the core of the step is to establish the mapping between the video clip and the video stream data time axis through the corresponding relation between the frame extraction frequency and the original video frame frequency, and the essence is to reversely restore the lost time information in the frame extraction process. The video segment obtained after frame extraction consists of a plurality of key frames, each key frame has a unique frame number in the original video, and the time stamp of the key frame in the original video can be reversely deduced through the frame number.

S204, segmenting the video stream data into the independent task segments according to the video segment description text and the time scale, and generating the independent task segment description.

Specifically, according to the video segment description text and the time scale, the independent task segment and the independent task segment description are finally obtained, for example, "0.0s,2.5s," the robot hand is extending to the water cup, "" 2.5s,4.5s, "the robot hand is holding the water cup," "4.5s,10.0s," and the robot is pouring water.

In the embodiment, the processing cost is reduced by frame extraction, key time sequence information, multi-mode association and semantic rationality are reserved by time conversion, and finally, the automatically segmented subtask fragments are truly attached to 'human cognition logic of task stages', so that a reliable basis is provided for subsequent processing.

In one embodiment, the importing the video stream data into the first large language-visual model, segmenting the video stream data into a plurality of independent task segments, and generating the independent task segment description specifically further includes:

s205, extracting motion characteristics through the motion data and recording time points.

S206, searching the motion characteristics in the independent task segments, and adjusting the time scales of the independent task segments through the searched time points of the motion characteristics.

Specifically, the time scale of the video stream data obtained through the video semantic understanding may not be accurate enough, and the step realizes more accurate and segmentation by fusing the high-level semantic boundary and the bottom-level time sequence feature. Firstly, processing time sequence data such as the tail end speed of the mechanical arm (such as time sequence data (aligned with video stream) of sensors such as the tail end speed of the mechanical arm, acceleration, joint angle, clamping force and the like), extracting relevant motion characteristics, and recording time points (such as judging the start and stop of motion through a speed threshold). For example, a kinematic threshold is set (e.g., speed >0.05m/s is considered "in motion", speed=0 is considered "stationary"), and key time points such as "motion start", "motion stop", "motion switch" are identified by threshold crossing points. Then, a small time window (such as + -0.2 s, i.e. 2.3s-2.7 s) is set centered on the time scale boundary (such as 2.5 s) of the independent task sheet, focusing on the fine features near the boundary. And replacing the coarse granularity boundary (such as 2.5 s) of the original VLM output by the searched motion characteristic time point (such as 2.48 s) to form a refined segment, so as to achieve the aim of more accurate segmentation. The finer subtask fragments are finally obtained, for example, 0.0s,2.48s, the robot hand is extending to the water cup, 2.48s and 4.62s, the robot hand holds the water cup, 4.62s and 10.0s, and the robot is pouring water. And splitting the original video stream data stream according to the time scales to obtain the time scales of the final independent task segments, thereby realizing the independent task segments with time accurate to millisecond level.

In the embodiment, through the cooperative logic of high-level semantic orientation and bottom-layer feature correction, the semantic understanding capability of VLM on task stages is reserved (meaning of division is ensured), and the physical accuracy of kinematic data (precision of division is ensured), so that the finally obtained subtask fragments can be intuitively understood by human beings, the technical requirement of a robot system on time sequence accuracy can be met, and a reliable foundation is laid for task automation and intelligent analysis.

In one embodiment, the splitting the video stream data into the independent task segments according to the video segment description text and the time scale and generating the independent task segment description specifically includes:

S207, inputting the independent task segment into a second large language-visual model, and generating labels of actions and objects in the independent task segment.

Specifically, primitive label generation is performed on actions and objects in the independent task segments through the second large language-visual model. For example, the basic action types of grabbing, placing, pushing and pulling, etc. are automatically identified and the objects involved in the operation are marked. Wherein the second large language-visual model may employ the same large language-visual model as the first large language-visual model, or employ a different large language-visual model, or a large language-visual model specifically optimized for a different scene.

S208, constructing the structured description of the action and the article through the second large language-visual model.

Specifically, a structured prompt word (prompt) is proposed to the VLM by inputting a segmented independent task segment to the VLM (Qwen 2.5-VL), for example, "please describe the core action of the mechanical arm in the video with a sentence phrase, in the format of 'skill+object'. The VLM generates a corresponding primitive label such as "pick up + cup".

S209, determining the structural description as the independent task segment description.

In particular, the structural description is changed into a natural language task instruction which is more in accordance with the expression habit of human beings, for example, primitive labels of "pick up+cup" are moistened into "pick up cup on desk". And deducing operation intention based on the context information to produce complete operation skill label. Therefore, a series of coherent natural language subtask instructions can be obtained, and then LLM is utilized to analyze the instructions and carry out high-level induction and reasoning, so that the most core user intention behind the whole teaching data or the 'skill' required to be learned by a robot are marked.

In one embodiment, the quality of the output data is also controlled by human-machine collaborative labeling. Based on the visualized labeling platform, a labeling person checks subtask instructions and intention labels generated by the original video clips, VLM primitives and LLMs, modifies, confirms or rejects the subtask instructions and the intention labels, and manually verifies and corrects the labeling results, so that the data quality is ensured and potential systematic errors are eliminated. The man-machine cooperation labeling mechanism ensures the accuracy and consistency of labeling through manual auditing while realizing automation efficiency, further improves the data quality, and avoids the illusion or deviation possibly brought by pure automation labeling.

In this embodiment, the second VLM abstracts visual content in the video into discrete semantic units with definite logical relationships by generating action tags (e.g. "close to", "grasp") and item tags (e.g. "cup", "table") and constructing a structured description of "skill+object" (e.g. "take+cup"), converts the original visual information of the independent task segments into standardized, reusable semantic symbols, and provides a unified "language" for understanding, analysis and multiplexing of tasks. The large language model can then subsequently generate task labels based on natural language descriptions according to the unified "language".

To ensure "high quality" of the VLA model training data, the acquired data needs to be time-spatially aligned to ensure consistency in time and space for the different sensor data. The step of collecting artificial demonstration data through the exoskeleton system, wherein the artificial demonstration data comprises video stream data, motion data and force sense data, and preprocessing is needed before the step of collecting the artificial demonstration data.

In one embodiment, the self-contained intelligent data acquisition method includes time-stamping calibration of sensors in the exoskeleton system.

Specifically, by establishing a global time reference, all sensor data is time stamped.

In one embodiment, the self-contained intelligent data acquisition method further comprises calibrating the exoskeleton system global coordinate system using spatial markers.

Specifically, arUco markers are arranged in space, and the global coordinate system is calibrated by a relative state-action representation method.

In one embodiment, the self-contained intelligent data acquisition method further comprises eliminating sensor delays in the exoskeleton system by a time consistency algorithm.

Specifically, a left shoulder camera is selected as a main sensor through a time consistency algorithm, a time stamp sequence of the left shoulder camera is used as a reference, data closest to the reference time in other sensors is selected each time, and the found multi-sensor data are combined into a frame and marked as aligned data to be output. Thereby eliminating the effects of multiple sensor delays in the exoskeleton system.

In the embodiment, the accurate time space alignment technology ensures the high consistency between the human operation data and the robot operation space, and effectively solves the problems of mode mismatch and time sequence difference.

In order to ensure the purity and reliability of the artificial demonstration data, the VLA model is ensured to learn from clean and accurate data. After the step of collecting the artificial demonstration data through the exoskeleton system, wherein the artificial demonstration data comprises video stream data, motion data and force sense data, the intelligent body data acquisition method further comprises the step of performing data cleaning and denoising on the collected data.

In one embodiment, the personal intelligence data acquisition method further comprises identifying and removing anomalous data in the artificial demonstration data.

Specifically, abnormal data points caused by sensor faults or operation errors are identified and removed, whether the sensor readings fall in a reasonable numerical range or not is checked by judging, for example, camera data are read, the condition that the image numerical values are full black, full white or green is found to be abnormal, joint data are read, and the condition that the rotation angle or the speed in the joint is beyond or lower than a reasonable range which can be reached by a motor is found to be an abnormal value

In one embodiment, the self-contained intelligent data acquisition method further comprises denoising the artificial demonstration data.

Specifically, a Kalman filtering algorithm is applied to noisy multisensor measurements (e.g., IMU acceleration) to remove sensor noise.

In one embodiment, the personal intelligent data acquisition method further comprises performing an integrity check on the artificial demonstration data.

Specifically, a regularized script is used for detecting whether the data stream is continuously arranged in the increasing order of the time stamp, the disordered frames are reordered, the detected lost frames are subjected to linear interpolation compensation or the synchronous frame group at the moment is directly discarded, and the problems of data loss, disordered time sequence and the like are solved.

In one embodiment, the self-contained intelligent data acquisition method further comprises manual sampling quality inspection and secondary cleaning.

Specifically, the processed data are sampled for manual quality inspection, and the data which do not meet the quality requirement are manually cleaned for the second time.

In the above embodiment, the overall data cleaning and preprocessing procedure ensures the purity and reliability of the training data, and ensures that the VLA model can learn from clean and accurate data.

In one embodiment, the self-contained intelligent data acquisition method further comprises:

the self-contained intelligent data generated by export meets LeRobot standard format specifications.

Specifically, through the data acquisition, data preprocessing, automatic segmentation and labeling processes, the multi-mode data with the structure, the cleaning and the abundant semantics is finally generated. In order to enable the data to meet the training requirement of the current mainstream VLA model, the data also has good expansibility and compatibility, the data can be subjected to data conversion and adaptation flow, and a data set structure (a standard LeRobot data set format, including visual data, state information and action labels) meeting LeRobot standard format specifications is constructed, wherein the specific format is as follows:

<dataset_root>/

├── data/

│ ├── chunk-000/

│ │ ├── episode_000000.parquet

│ │ ├── episode_000001.parquet

│ │ └── ...

│ └── chunk-001/

├── meta/

│ ├── episodes.jsonl

│ ├── info.json

│ ├── stats.safetensors

│ └── tasks.jsonl

└── videos/

├── chunk-000/

│ ├── observation.images.palm_camera/

│ │ ├── episode_000000.mp4

│ │ └── ...

│ └── observation.images.global_camera/

└── chunk-001/

In this embodiment, in order to enable these data to meet the training requirements of the current mainstream VLA model, the data are further provided with good expansibility and compatibility, and the data are converted into self-contained intelligent data conforming to LeRobot standard formats by performing data conversion and adaptation processes, so that obstacles are cleared for subsequent VLA model training and self-contained intelligent system application.

In addition, some embodiments of the application also provide an electronic device. The electronic device may be a digital computer in various forms, such as a laptop computer, a desktop computer, a workstation, a personal digital assistant, a server, a blade server, a mainframe computer, and the like. The electronic device may also be various forms of mobile equipment, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.

The electronic device comprises one or more processors and a memory storing computer program instructions that, when executed, cause the processors to perform the steps of a method as provided in any one or more of the embodiments described above. Fig. 3 discloses an exemplary structural diagram of the electronic device. The electronic device includes one or more processors 1101, memory 1102, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, a plurality of electronic devices may be connected, each providing a part of the necessary operations. Wherein the components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

The electronic device may further comprise input means 1103 and output means 1104. The processor 1101, memory 1102, input device 1103 and output device 1104 may be connected by a bus or other means, as illustrated by a bus connection.

The input device 1103 may receive input digital or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output device 1104 may include a display device, auxiliary lighting (e.g., LEDs), and haptic feedback (e.g., a vibration motor), among others. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some embodiments, the display device may be a touch screen.

To provide for interaction with a user, the electronic device may be a computer. The computer has a display device (e.g., a cathode ray tube or LCD monitor) for displaying information to a user, and a keyboard and pointing device (e.g., a mouse) through which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with the user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback), and input from the user may be received in any form (e.g., voice input or tactile input).

In an embodiment of the present application, a computer readable medium has stored thereon a computer program/instruction which, when executed by a processor, implements the steps of the method provided by any one or more of the embodiments described above. The computer readable medium may be contained in the electronic device described in the above embodiment or may exist alone without being incorporated in the device. The computer-readable medium carries one or more computer-readable instructions.

Memory 1102 may be used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules. The processor 1101 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 1102 to implement program instructions/modules corresponding to the methods provided by any one or more of the embodiments of the present application.

The memory 1102 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the electronic device, etc. In addition, memory 1102 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1102 optionally includes memory remotely located relative to processor 1101, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The computer readable medium according to the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory, a read-only memory, an erasable programmable read-only memory, an optical fiber, a portable compact disc read-only memory, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory, static random access memory, dynamic random access memory, other types of random access memory, read only memory, electrically erasable programmable read only memory, flash memory or other memory technology, compact disc read only memory, digital versatile disks or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the C-language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user computer through any kind of network, including a local area network or a wide area network, or may be connected to an external computer (e.g., connected through the internet using an internet service provider).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. For example, an application specific integrated circuit, a general purpose computer, or any other similar hardware device may be employed. In some embodiments, the software program of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Embodiments of the present application provide a computer program product comprising one or more computer programs/instructions which, when executed by a processor, produce, in whole or in part, a process or function in accordance with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The scope of the application is indicated by the appended claims rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The words "first," "second," and the like are used merely to distinguish between descriptions and do not indicate any particular order, nor are they to be construed as indicating or implying relative importance.

The above embodiments are merely illustrative examples, but the scope of the present application is not limited thereto, and any person skilled in the art can easily mention variations or substitutions within the scope of the present application. The present application is therefore to be considered in all respects as illustrative and not restrictive, and the scope of the application is indicated by the appended claims.

Claims

1. A self-contained intelligent data acquisition method, characterized in that the self-contained intelligent data acquisition method comprises:

2. The method for acquiring self-contained intelligent data according to claim 1, wherein said importing said video stream data into a first large language-visual model, segmenting into a plurality of independent task segments, and generating said independent task segment description specifically comprises:

Compressing the video stream data into a plurality of video clips using a frame-extraction algorithm;

Generating video clip descriptive text from the video clip through the first large language-visual model;

Calculating the time scale of the video segment in the video stream data according to the frame extraction frequency in the frame extraction algorithm;

And cutting the video stream data into the independent task segments according to the video segment description text and the time scale, and generating the independent task segment description.

3. The method for acquiring self-contained intelligent data according to claim 2, wherein said importing said video stream data into a first large language-visual model, segmenting into a plurality of independent task segments, and generating said independent task segment description specifically further comprises:

extracting motion characteristics through the motion data and recording time points;

searching the motion characteristics in the independent task segments, and adjusting the time scales of the independent task segments through the searched time points of the motion characteristics.

4. The method for acquiring self-contained intelligent data according to claim 2, wherein said segmenting the video stream data into the independent task segments and generating the independent task segment description according to the video segment description text and the time scale specifically comprises:

inputting the independent task segment into a second large language-visual model, and generating labels of actions and articles in the independent task segment;

and building a structured description of the action and the item through the second large language-visual model;

The structured description is determined as the independent task segment description.

5. The method of claim 1, wherein prior to the step of collecting artificial demonstration data via the exoskeleton system, the artificial demonstration data including video stream data, motion data, force sense data, the method further comprises:

Performing time stamp calibration on the sensor in the exoskeleton system;

And/or;

calibrating the global coordinate system of the exoskeleton system by adopting a spatial marker;

And/or;

sensor delays in the exoskeleton system are eliminated by a time consistency algorithm.

6. The body-building intelligent data acquisition method according to claim 1, wherein after the step of acquiring artificial demonstration data by the exoskeleton system, the artificial demonstration data includes video stream data, motion data, and force sense data, the body-building intelligent data acquisition method further comprises:

Identifying and removing abnormal data in the manual demonstration data;

And/or;

Denoising the artificial demonstration data;

And/or;

and carrying out integrity check on the manual demonstration data.

7. The personal intelligence data acquisition method according to claim 1, the method is characterized in that the method for acquiring the intelligent data of the body further comprises the following steps:

8. An electronic device, the electronic device comprising:

One or more processors, and

A memory storing computer program instructions that, when executed, cause the processor to perform the steps of the method of any one of claims 1 to 7.

9. A computer readable medium having stored thereon a computer program/instruction, which when executed by a processor, implements the steps of the method according to any of claims 1 to 7.

10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 7.