[go: up one dir, main page]

CN116766178A - Shaft hole assembly method and related equipment - Google Patents

Shaft hole assembly method and related equipment Download PDF

Info

Publication number
CN116766178A
CN116766178A CN202310518497.5A CN202310518497A CN116766178A CN 116766178 A CN116766178 A CN 116766178A CN 202310518497 A CN202310518497 A CN 202310518497A CN 116766178 A CN116766178 A CN 116766178A
Authority
CN
China
Prior art keywords
current
training
shaft hole
action
assembly
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310518497.5A
Other languages
Chinese (zh)
Inventor
严少华
徐德
郝甜甜
马旭淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202310518497.5A priority Critical patent/CN116766178A/en
Publication of CN116766178A publication Critical patent/CN116766178A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1661Programme controls characterised by programming, planning systems for manipulators characterised by task planning, object-oriented languages
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed
    • B25J9/1687Assembly, peg and hole, palletising, straight line, weaving pattern movement
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mechanical Engineering (AREA)
  • Robotics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提供一种轴孔装配方法及相关设备,涉及自动化装配技术领域,所述方法包括:执行多次迭代操作;每一次迭代操作包括:获取轴的当前状态;当前状态包括当前受力信息;将当前状态输入轴孔装配模型,得到当前装配动作;当前装配动作包括平移量和旋转量;轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;在轴上执行所述当前装配动作;在轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。本发明的轴孔装配模型可以适用于不同类别的轴孔装配任务,从而提高轴孔装配的通用性和实用性。

The invention provides a shaft hole assembly method and related equipment, and relates to the field of automated assembly technology. The method includes: performing multiple iterative operations; each iterative operation includes: obtaining the current status of the shaft; the current status includes current force information; Input the current state into the shaft hole assembly model to obtain the current assembly action; the current assembly action includes translation and rotation; the shaft hole assembly model is trained on multiple shaft hole assembly tasks of different categories based on prior knowledge and meta-reinforcement learning. ; perform the current assembly action on the shaft; when the depth of the shaft installation hole does not meet the preset assembly requirements, perform the next iteration operation. The shaft hole assembly model of the present invention can be applied to different types of shaft hole assembly tasks, thereby improving the versatility and practicality of shaft hole assembly.

Description

轴孔装配方法及相关设备Shaft hole assembly methods and related equipment

技术领域Technical field

本发明涉及自动化装配技术领域,尤其涉及一种轴孔装配方法及相关设备。The present invention relates to the field of automated assembly technology, and in particular to a shaft hole assembly method and related equipment.

背景技术Background technique

工业领域中,机器人自动完成高精度装配任务的需求越来越多。轴孔装配是绝大部分装配任务的核心操作,常见的有单轴孔装配、双轴孔装配和三轴孔装配等。In the industrial field, there is an increasing demand for robots to automatically complete high-precision assembly tasks. Shaft hole assembly is the core operation of most assembly tasks. Common ones include single-axis hole assembly, dual-axis hole assembly, and three-axis hole assembly.

现有的轴孔装配方法往往只适用于同一类装配任务,同一类装配任务中的轴孔尺寸和轴孔数目均相同,在轴孔尺寸或轴孔数目发生变化的情况下,该轴孔装配方法就不适用,导致轴孔装配方法的通用性和实用性不强。Existing shaft hole assembly methods are often only suitable for the same type of assembly tasks. The shaft hole size and number of shaft holes in the same type of assembly task are the same. When the shaft hole size or the number of shaft holes changes, the shaft hole assembly The method is not applicable, resulting in the shaft hole assembly method being less versatile and practical.

因此,提供一种能让机器人完成多类别轴孔装配任务的轴孔装配方法是非常重要的。Therefore, it is very important to provide a shaft hole assembly method that allows robots to complete multi-category shaft hole assembly tasks.

发明内容Contents of the invention

本发明提供一种轴孔装配方法及相关设备,用以解决现有技术中轴孔装配的通用性和实用性不强的缺陷,提高轴孔装配的通用性和实用性。The invention provides a shaft hole assembly method and related equipment, which are used to solve the shortcomings of low versatility and practicality of shaft hole assembly in the prior art and improve the versatility and practicality of shaft hole assembly.

本发明提供一种轴孔装配方法,包括:The invention provides a shaft hole assembly method, which includes:

执行多次迭代操作;每一次迭代操作包括:Perform multiple iteration operations; each iteration operation includes:

获取轴的当前状态;所述当前状态包括当前受力信息;Obtain the current status of the axis; the current status includes current force information;

将所述当前状态输入轴孔装配模型,得到当前装配动作;所述当前装配动作包括平移量和旋转量;所述轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;Input the current state into the shaft hole assembly model to obtain the current assembly action; the current assembly action includes the translation amount and the rotation amount; the shaft hole assembly model is based on prior knowledge and meta-reinforcement learning on multiple different categories of shafts. Trained in the hole assembly task;

在所述轴上执行所述当前装配动作;Execute the current assembly action on the axis;

在所述轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。If the depth of the shaft installation hole does not meet the preset assembly requirements, the next iterative operation is performed.

在一些实施例中,所述轴孔装配模型是基于以下步骤训练得到的:In some embodiments, the shaft hole assembly model is trained based on the following steps:

构建多个不同类别的轴孔装配任务;Construct multiple different categories of shaft hole assembly tasks;

执行多次迭代训练操作;每一次迭代训练操作包括:Perform multiple iterative training operations; each iterative training operation includes:

从所述多个不同类别的轴孔装配任务中随机抽取一个轴孔装配任务,作为当前训练任务;Randomly select one shaft hole assembly task from the plurality of shaft hole assembly tasks of different categories as the current training task;

执行所述当前训练任务,并更新所述轴孔装配模型的参数;Execute the current training task and update the parameters of the shaft hole assembly model;

在所述当前训练任务完成后,获取所述当前训练任务对应的累计奖励;After the current training task is completed, obtain the cumulative reward corresponding to the current training task;

根据所述当前训练任务对应的累计奖励,以及获取的前序训练任务对应的累计奖励,确定训练任务对应的累计奖励的收敛情况;According to the cumulative reward corresponding to the current training task and the cumulative reward corresponding to the obtained previous training task, determine the convergence situation of the cumulative reward corresponding to the training task;

在所述训练任务对应的累计奖励未收敛的情况下,执行下一次迭代训练。If the cumulative reward corresponding to the training task has not converged, the next iterative training is performed.

在一些实施例中,所述执行所述当前训练任务,并更新所述轴孔装配模型的参数,包括:In some embodiments, performing the current training task and updating parameters of the shaft hole assembly model includes:

通过多次迭代训练回合执行所述当前训练任务,在每次迭代训练回合中更新所述轴孔装配模型的参数;其中,所述每次迭代训练回合包括:The current training task is executed through multiple iterative training rounds, and the parameters of the shaft hole assembly model are updated in each iterative training round; wherein each iterative training round includes:

获取前一个装配训练动作与前一个演示动作之间的相似度,与当前训练状态;所述前一个装配训练动作是在前一次迭代训练回合中,所述轴孔装配模型中的网络选取网络根据前一个训练状态输出的训练动作;所述前一个演示动作是装配演示模型根据所述前一个训练状态输出的演示动作;所述当前训练状态包括训练轴的当前受力信息;Obtain the similarity between the previous assembly training action and the previous demonstration action, and the current training status; the previous assembly training action was in the previous iterative training round, and the network selection network in the shaft hole assembly model was based on The training action output by the previous training state; the previous demonstration action is the demonstration action output by the assembly demonstration model according to the previous training state; the current training state includes the current force information of the training axis;

将所述相似度和所述当前训练状态输入所述轴孔装配模型中的动作评价网络,获取当前第一损失函数值;Input the similarity and the current training status into the action evaluation network in the shaft hole assembly model to obtain the current first loss function value;

基于所述当前第一损失函数值更新动作评价网络的参数;Update parameters of the action evaluation network based on the current first loss function value;

将所述相似度和所述当前训练状态输入所述轴孔装配模型中的动作选取网络,获取当前第二损失函数值;Input the similarity and the current training status into the action selection network in the shaft hole assembly model to obtain the current second loss function value;

基于所述当前第二损失函数值更新所述动作选取网络的参数;Update parameters of the action selection network based on the current second loss function value;

在所述当前训练任务未完成的情况下,执行下一次迭代训练回合。If the current training task is not completed, the next iterative training round is executed.

在一些实施例中,所述将所述相似度和所述当前训练状态输入所述轴孔装配模型中的动作评价网络,获取当前第一损失函数值,包括:In some embodiments, inputting the similarity and the current training status into the action evaluation network in the shaft hole assembly model to obtain the current first loss function value includes:

将所述相似度和所述当前训练状态输入所述动作评价网络,得到当前状态价值和当前奖励值;Input the similarity and the current training status into the action evaluation network to obtain the current status value and the current reward value;

根据所述当前状态价值和所述当前奖励值,以及后一状态价值,获取当前优势值;Obtain the current advantage value according to the current state value, the current reward value, and the latter state value;

根据所述当前优势值,获取所述当前第一损失函数值。According to the current advantage value, the current first loss function value is obtained.

在一些实施例中,所述将所述相似度和所述当前训练状态输入所述轴孔装配模型中的动作选取网络,获取当前第二损失函数值,包括:In some embodiments, inputting the similarity and the current training status into the action selection network in the shaft hole assembly model to obtain the current second loss function value includes:

将所述相似度和所述当前训练状态输入所述动作选取网络,得到当前装配训练动作;Input the similarity and the current training status into the action selection network to obtain the current assembly training action;

基于所述当前装配训练动作的概率与上一装配训练动作的概率,以及所述当前优势值,获取所述当前第二损失函数值。The current second loss function value is obtained based on the probability of the current assembly training action, the probability of the previous assembly training action, and the current advantage value.

在一些实施例中,在获取前一个装配训练动作与前一个演示动作之间的相似度之前,还包括:In some embodiments, before obtaining the similarity between the previous assembly training action and the previous demonstration action, the method further includes:

获取示教状态和示教动作;Get the teaching status and teaching actions;

对所述示教状态和所述示教动作进行建模,获取装配演示模型。Model the teaching state and the teaching action to obtain an assembly demonstration model.

本发明还提供一种轴孔装配装置,包括:The invention also provides a shaft hole assembly device, which includes:

执行模块,用于执行多次迭代操作;每一次迭代操作包括:Execution module, used to perform multiple iteration operations; each iteration operation includes:

获取轴的当前状态;所述当前状态包括当前受力信息;Obtain the current status of the axis; the current status includes current force information;

将所述当前状态输入轴孔装配模型,得到当前装配动作;所述当前装配动作包括平移量和旋转量;所述轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;Input the current state into the shaft hole assembly model to obtain the current assembly action; the current assembly action includes the translation amount and the rotation amount; the shaft hole assembly model is based on prior knowledge and meta-reinforcement learning on multiple different categories of shafts. Trained in the hole assembly task;

在所述轴上执行所述当前装配动作;Execute the current assembly action on the axis;

在所述轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。If the depth of the shaft installation hole does not meet the preset assembly requirements, the next iterative operation is performed.

本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述轴孔装配方法。The present invention also provides an electronic device, including a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the program, the assembly of the shaft hole as described above is realized. method.

本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述轴孔装配方法。The present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, it implements any of the above-mentioned shaft hole assembly methods.

本发明还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述轴孔装配方法。The present invention also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the computer program implements any of the above-mentioned shaft hole assembly methods.

本发明提供的一种轴孔装配方法及相关设备,通过元强化学习对多个不同类别的轴孔装配任务进行学习,通过先验知识提高元强化学习的学习效率,得到的轴孔装配模型可以适用于不同类别的轴孔装配任务,从而提高轴孔装配的通用性和实用性。The invention provides a shaft hole assembly method and related equipment, which uses meta-reinforcement learning to learn a plurality of different categories of shaft hole assembly tasks, and improves the learning efficiency of meta-reinforcement learning through prior knowledge. The obtained shaft hole assembly model can It is suitable for different types of shaft hole assembly tasks, thereby improving the versatility and practicality of shaft hole assembly.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1是本发明一示例性实施例提供的轴孔装配方法的流程示意图之一;Figure 1 is one of the flow diagrams of a shaft hole assembly method provided by an exemplary embodiment of the present invention;

图2是本发明一示例性实施例提供的轴孔装配方法的流程示意图之二;Figure 2 is the second schematic flow chart of the shaft hole assembly method provided by an exemplary embodiment of the present invention;

图3是本发明一示例性实施例提供的轴孔装配方法的流程示意图之三;Figure 3 is a third schematic flowchart of the shaft hole assembly method provided by an exemplary embodiment of the present invention;

图4是本发明一示例性实施例提供的轴孔装配方法的流程示意图之四;Figure 4 is the fourth schematic flowchart of the shaft hole assembly method provided by an exemplary embodiment of the present invention;

图5是本发明一示例性实施例提供的轴孔装配方法的流程示意图之五;Figure 5 is a schematic flow chart of a shaft hole assembly method fifth according to an exemplary embodiment of the present invention;

图6是本发明一示例性实施例提供的轴孔装配方法的流程示意图之六;Figure 6 is a schematic flowchart No. 6 of a shaft hole assembly method provided by an exemplary embodiment of the present invention;

图7是本发明一示例性实施例提供的轴孔装配装置的结构示意图;Figure 7 is a schematic structural diagram of a shaft hole assembly device provided by an exemplary embodiment of the present invention;

图8是本发明一示例性实施例提供的电子设备的实体结构示意图。FIG. 8 is a schematic diagram of the physical structure of an electronic device provided by an exemplary embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。需要说明的是,在不冲突的情况下,本发明的实施方式及实施方式中的特征可以相互组合。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention more clear, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. It should be noted that, as long as there is no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施方式的目的,不是旨在于限制本发明。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which the invention belongs. The terminology used herein in the description of the invention is for the purpose of describing specific embodiments only and is not intended to limit the invention.

进一步需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be further noted that, as used herein, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or device that includes a list of elements includes not only those elements , but also includes other elements not expressly listed or inherent in such process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element.

本发明中“至少一个”是指一个或者多个,“多个”是指两个或多于两个。本发明的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不是用于描述特定的顺序或先后次序。In the present invention, "at least one" means one or more, and "plurality" means two or more than two. The terms "first", "second", "third", "fourth", etc. (if present) in the present invention are used to distinguish similar objects and are not used to describe a specific order or sequence.

在本发明实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本发明实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present invention, words such as “exemplary” or “for example” are used to represent examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "such as" in the embodiments of the invention is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary" or "such as" is intended to present the concept in a concrete manner.

请参考图1,图1是本发明一示例性实施例提供的轴孔装配方法的流程示意图之一。本发明实施例提供一种轴孔装配方法,其执行主体可以为智能体,例如,机器人等。该轴孔装配方法包括以下步骤:Please refer to FIG. 1 , which is a schematic flowchart of a shaft hole assembly method provided by an exemplary embodiment of the present invention. Embodiments of the present invention provide a shaft hole assembly method, the execution subject of which may be an intelligent agent, such as a robot. The shaft hole assembly method includes the following steps:

步骤110,执行多次迭代操作。Step 110: Perform multiple iterative operations.

具体地,智能体执行一次迭代操作可以得到一个装配动作,由于一个装配动作往往不能使轴孔装配成功,因此,智能体需要执行多次迭代操作,得到多个装配动作,通过多个装配动作实现轴孔装配成功。每一次迭代操作可以包括步骤1101至步骤1104。Specifically, the agent can obtain an assembly action by performing an iterative operation. Since one assembly action often cannot successfully assemble the shaft hole, the agent needs to perform multiple iterative operations to obtain multiple assembly actions. This is achieved through multiple assembly actions. The shaft hole is assembled successfully. Each iterative operation may include steps 1101 to 1104.

步骤1101,获取轴的当前状态。Step 1101, obtain the current status of the axis.

在当前迭代操作中,智能体通过安装在轴上的力传感器获取轴的当前状态,当前状态包括当前受力信息。In the current iterative operation, the agent obtains the current status of the shaft through the force sensor installed on the shaft. The current status includes the current force information.

状态s的表达式如下所示:The expression for state s is as follows:

s=[Fx,Fy,Fz,Mx,My,Mz]s=[F x ,F y ,F z ,M x ,M y ,M z ]

式中,s表示状态,Fx、Fy和Fz分别表示力传感器测量得到的沿x方向、y方向和z方向的力,Mx、My和Mz分别表示沿x方向、y方向和z方向的力矩。其中,x方向为轴进行前后移动的方向,y方向为轴进行左右移动的方向,z方向为轴进行上下移动的方向。In the formula, s represents the state, F x , F y and F z respectively represent the forces along the x direction, y direction and z direction measured by the force sensor, M x , M y and M z represent the forces along the x direction and y direction respectively. and the moment in the z direction. Among them, the x direction is the direction in which the axis moves forward and backward, the y direction is the direction in which the axis moves left and right, and the z direction is the direction in which the axis moves up and down.

在一些实施例中,轴与孔的数目是相同的,一个轴对应一个孔。轴的数目可以是单轴、双轴或者三轴等,本发明对轴的数目不做具体限定。In some embodiments, the number of shafts and holes is the same, with one shaft corresponding to one hole. The number of axes may be single axes, dual axes, or three axes, etc., and the number of axes is not specifically limited in the present invention.

步骤1102,将当前状态输入轴孔装配模型,得到当前装配动作。Step 1102: Input the current status into the shaft hole assembly model to obtain the current assembly action.

具体地,在相关技术中,采用Actor-Critic算法构建轴孔装配模型。Actor-Critic算法是标准深度强化学习(Deep Reinforcement Learning,DRL)的一种算法。Specifically, in the related technology, the Actor-Critic algorithm is used to construct the shaft hole assembly model. The Actor-Critic algorithm is an algorithm of standard deep reinforcement learning (Deep Reinforcement Learning, DRL).

标准DRL是针对特定的马尔可夫决策过程(Markov Decision Process,MDP),通过某些学习算法求解一个最优策略,最优策略指导智能体在特定任务下做出最佳决策(什么状态下采取什么动作)。标准DRL中的学习算法依赖于智能体与环境之间的大量交互,训练成本高,一旦环境发生变化,原本学习好的最优策略就不再适用,此时必须针对新的环境重新训练,学习效率较低。Standard DRL is aimed at a specific Markov Decision Process (Markov Decision Process, MDP) and uses certain learning algorithms to solve an optimal strategy. The optimal strategy guides the agent to make the best decision under a specific task (what state to take? what action). The learning algorithm in standard DRL relies on a large number of interactions between the agent and the environment, and the training cost is high. Once the environment changes, the originally learned optimal strategy is no longer applicable. At this time, it must be retrained and learned for the new environment. Less efficient.

在标准DRL中引入元学习(Meta-learning),构成元强化学习。元强化学习是能够迅速适应不同新任务并得到相应最优策略的学习算法Meta-learning is introduced into standard DRL to form meta-reinforcement learning. Meta-reinforcement learning is a learning algorithm that can quickly adapt to different new tasks and obtain corresponding optimal strategies.

本发明中的轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的。The shaft hole assembly model in the present invention is trained in multiple different categories of shaft hole assembly tasks based on prior knowledge and meta-reinforcement learning.

智能体应用元强化学习对多个不同类别的轴孔装配任务进行学习,应用先验知识提高元强化学习的学习效率,从而得到训练好的轴孔装配模型。因此,训练好的轴孔装配模型可以应用于不同类别的轴孔装配任务。在得到训练好的轴孔装配模型后,智能体将获取的当前状态输入轴孔装配模型,轴孔装配模型输出当前装配动作。当前装配动作包括平移量和旋转量。The agent applies meta-reinforcement learning to learn multiple different categories of shaft hole assembly tasks, and uses prior knowledge to improve the learning efficiency of meta-reinforcement learning, thereby obtaining a trained shaft hole assembly model. Therefore, the trained shaft hole assembly model can be applied to different categories of shaft hole assembly tasks. After obtaining the trained shaft hole assembly model, the agent inputs the obtained current status into the shaft hole assembly model, and the shaft hole assembly model outputs the current assembly action. The current assembly action includes translation and rotation.

装配动作a的表达式如下所示:The expression of assembly action a is as follows:

a=[Δxyzxyz]a=[Δ xyzxyz ]

式中,a表示装配动作,Δx、Δy和Δz分别表示沿x方向、y方向和z方向的平移量,αx、αy和αz分别表示沿x方向、y方向和z方向的旋转量。In the formula, a represents the assembly action, Δ x , Δ y and Δ z represent the translation amounts along the x direction, y direction and z direction respectively, α x , α y and α z represent the translation along the x direction, y direction and z direction respectively. amount of rotation.

在一些实施例中,智能体将当前状态输入轴孔装配模型,轴孔装配模型输出最优动作策略,选取最优动作策略中概率最大的动作进行反归一化,从而得到当前装配动作。In some embodiments, the agent inputs the current state into the shaft hole assembly model, the shaft hole assembly model outputs the optimal action strategy, and selects the action with the highest probability in the optimal action strategy for denormalization to obtain the current assembly action.

步骤1103,在轴上执行当前装配动作。Step 1103: Execute the current assembly action on the axis.

具体地,智能体在得到当前装配动作后,在轴上执行当前装配动作,也就是按照当前装配动作中的平移量和旋转量对轴进行平移和旋转。Specifically, after the agent obtains the current assembly action, it executes the current assembly action on the axis, that is, it translates and rotates the axis according to the translation amount and rotation amount in the current assembly action.

步骤1104,检测轴装入孔的深度是否满足预设装配要求。Step 1104: Check whether the depth of the shaft installation hole meets the preset assembly requirements.

具体地,在一些实施例中,预设装配要求可以包括预设最小装入深度和预设最大装入深度等。Specifically, in some embodiments, the preset assembly requirements may include a preset minimum loading depth, a preset maximum loading depth, and the like.

在轴上执行完当前装配动作后,智能体检测轴装入孔的深度是否满足预设装配要求,也就是检测轴装入孔的深度是否在预设最小装入深度和预设最大装入深度之间。After executing the current assembly action on the shaft, the agent detects whether the depth of the shaft installation hole meets the preset assembly requirements, that is, whether the depth of the shaft installation hole is within the preset minimum installation depth and the preset maximum installation depth. between.

在轴装入孔的深度未满足预设装配要求的情况下,也即在检测到轴装入孔的深度小于预设最小装入深度或大于预设最大装入深度的情况下,智能体执行下一次迭代操作,也即智能体再次执行步骤1101至步骤1104。When the depth of the shaft installation hole does not meet the preset assembly requirements, that is, when it is detected that the depth of the shaft installation hole is less than the preset minimum installation depth or greater than the preset maximum installation depth, the agent executes The next iteration operation, that is, the agent performs steps 1101 to 1104 again.

在轴装入孔的深度满足预设装配要求的情况下,也即在检测到轴装入孔的深度在预设最小装入深度和预设最大装入深度之间的情况下,表明轴孔装配成功,智能体结束迭代。When the depth of the shaft installation hole meets the preset assembly requirements, that is, when it is detected that the depth of the shaft installation hole is between the preset minimum installation depth and the preset maximum installation depth, it indicates that the shaft hole The assembly is successful and the agent ends the iteration.

本实施例提供的轴孔装配方法,通过元强化学习对多个不同类别的轴孔装配任务进行学习,通过先验知识提高元强化学习的学习效率,得到的轴孔装配模型可以适用于不同类别的轴孔装配任务,从而提高轴孔装配的通用性和实用性。The shaft hole assembly method provided in this embodiment uses meta-reinforcement learning to learn a plurality of different categories of shaft hole assembly tasks, and uses prior knowledge to improve the learning efficiency of meta-reinforcement learning. The obtained shaft hole assembly model can be suitable for different categories. shaft hole assembly tasks, thereby improving the versatility and practicality of shaft hole assembly.

请参考图2,图2是本发明一示例性实施例提供的轴孔装配方法的流程示意图之二。本实施例是对前述实施例的进一步改进,主要改进之处在于:智能体训练轴孔装配模型的具体过程。本实施例的流程如图2所示,包括以下步骤:Please refer to FIG. 2 , which is a second schematic flowchart of a shaft hole assembly method provided by an exemplary embodiment of the present invention. This embodiment is a further improvement on the previous embodiment, and the main improvement lies in the specific process of the intelligent agent training the shaft hole assembly model. The process of this embodiment is shown in Figure 2, including the following steps:

步骤210,构建多个不同类别的轴孔装配任务。Step 210: Construct multiple shaft hole assembly tasks of different categories.

具体地,智能体自动进行装配过程可以视为MDP过程,每一种装配任务Tn可以表示为一个五元组<Sn,An,Pn,Rn,γ>。其中,Sn为第n个装配任务中状态的有限集合,集合中某个状态表示为sk∈Sn;An为第n个装配任务中动作的有限集合,集合中某个动作表示为ak∈An,ak为状态sk下可执行的动作;pn为第n个装配任务的状态转移方程,Pk表示在状态sk执行动作ak后将以P(s′k|sk,ak)的概率跳转到状态s′k;Rn为奖励函数;γ为折扣系数,0≤γ≤1。轴孔装配模型如何学习得到每个装配任务对应的马尔可夫决策过程,是实现自动装配多类别轴孔装配任务的关键。下面对轴孔装配模型的训练过程进行详细说明。Specifically, the automatic assembly process of an agent can be regarded as an MDP process, and each assembly task T n can be expressed as a five-tuple <S n ,A n ,P n ,R n ,γ>. Among them, S n is a finite set of states in the nth assembly task, and a certain state in the set is expressed as s k ∈S n ; A n is a finite set of actions in the nth assembly task, and an action in the set is expressed as a k ∈A n , a k is the executable action in state s k ; p n is the state transition equation of the nth assembly task, P k means that after executing action a k in state s k , it will be P(s′ k The probability of |s k , a k ) jumps to state s′ k ; R n is the reward function; γ is the discount coefficient, 0≤γ≤1. How the shaft hole assembly model learns the Markov decision process corresponding to each assembly task is the key to realizing automatic assembly of multi-category shaft hole assembly tasks. The training process of the shaft hole assembly model is explained in detail below.

智能体构建多个不同的环境,例如,物理仿真环境、数学仿真环境和真实装配环境等。在每个环境中根据轴孔的尺寸和数目构建多个不同类别的轴孔装配任务。The agent builds multiple different environments, such as physical simulation environment, mathematical simulation environment and real assembly environment. Build multiple different categories of shaft hole assembly tasks in each environment based on the size and number of shaft holes.

在一些实施例中,多个不同类别的轴孔装配任务可以包括:在不同环境下装配轴孔尺寸和轴孔数目均不同的任务;在不同环境下装配轴孔尺寸相同而轴孔数目不同的任务;在不同环境下装配轴孔尺寸不同而轴孔数目相同的任务;在不同环境下装配轴孔尺寸和轴孔数目均相同的任务;在相同环境下装配轴孔尺寸相同而轴孔数目不同的任务;在相同环境下装配轴孔尺寸不同而轴孔数目相同的任务等。In some embodiments, multiple different categories of shaft hole assembly tasks may include: assembling tasks with different shaft hole sizes and numbers of shaft holes in different environments; assembling tasks with the same shaft hole size and different numbers of shaft holes in different environments. Tasks; tasks of assembling shaft holes with different sizes and the same number of shaft holes in different environments; tasks of assembling shaft holes with the same size and number of shaft holes in different environments; tasks of assembling shaft holes with the same size but different numbers of shaft holes in the same environment Tasks; tasks of assembling shaft holes with different sizes but the same number of shaft holes under the same environment, etc.

在一些实施例中,轴孔数目可以是单轴孔、双轴孔和三轴孔等。In some embodiments, the number of axial holes may be uniaxial holes, biaxial holes, triaxial holes, etc.

在数学仿真环境中,假设轴与孔之间为点接触,并且满足弹性形变原则。In the mathematical simulation environment, it is assumed that there is point contact between the shaft and the hole and the principle of elastic deformation is satisfied.

轴末端受到的三维力的表达式如下所示:The expression for the three-dimensional force on the end of the shaft is as follows:

式中,F表示轴末端受到的三维力,K表示轴孔数目,δ表示弹性形变系数,pi 1表示第i个轴在孔上部的最大形变向量,pi 2第i个轴在孔下部的最大形变向量。In the formula, F represents the three-dimensional force on the end of the shaft, K represents the number of shaft holes, δ represents the elastic deformation coefficient, p i 1 represents the maximum deformation vector of the i-th axis at the upper part of the hole, p i 2 represents the i-th axis at the lower part of the hole. the maximum deformation vector.

轴末端受到的三维力矩的表达式如下所示:The expression for the three-dimensional moment exerted on the end of the shaft is as follows:

式中,M表示轴末端受到的三维力矩,K表示轴孔数目,δ表示弹性形变系数,pi 1表示第i个轴在孔上部的最大形变向量,li 1表示第i个轴的中心点到孔上部的最大形变点的偏移向量,pi 2第i个轴在孔下部的最大形变向量,li 2表示第i个轴的中心点到孔下部的最大形变点的偏移向量。In the formula, M represents the three-dimensional moment experienced by the end of the shaft, K represents the number of shaft holes, δ represents the elastic deformation coefficient, p i 1 represents the maximum deformation vector of the i-th axis at the upper part of the hole, l i 1 represents the center of the i-th axis The offset vector from the point to the maximum deformation point at the upper part of the hole, p i 2 The maximum deformation vector of the i-th axis at the bottom of the hole, l i 2 represents the offset vector from the center point of the i-th axis to the maximum deformation point at the bottom of the hole .

根据轴末端受到的三维力和三维力矩,可以得到轴的状态。According to the three-dimensional force and three-dimensional moment exerted on the end of the shaft, the state of the shaft can be obtained.

步骤220,执行多次迭代训练操作。Step 220: Perform multiple iterative training operations.

具体地,为了让轴孔装配模型对多个不同类别的轴孔装配任务进行学习,智能体执行多次迭代训练操作。每一次迭代训练操作可以包括以下步骤2201至步骤2205。Specifically, in order for the shaft hole assembly model to learn multiple shaft hole assembly tasks of different categories, the agent performs multiple iterative training operations. Each iterative training operation may include the following steps 2201 to 2205.

步骤2201,从多个不同类别的轴孔装配任务中随机抽取一个轴孔装配任务,作为当前训练任务。Step 2201: Randomly select one shaft hole assembly task from a plurality of shaft hole assembly tasks of different categories as the current training task.

具体地,在每次迭代训练中,智能体从多个不同类别的轴孔装配任务中随机抽取一个轴孔装配任务,每一个轴孔装配任务被抽中的概率是相同的。例如,不同类别的轴孔装配任务有6个,那每个轴孔装配任务被抽中的概率均是1/6。智能体将被抽中的轴孔装配任务作为当前训练任务。Specifically, in each iteration of training, the agent randomly selects a shaft hole assembly task from multiple shaft hole assembly tasks of different categories, and the probability of being selected for each shaft hole assembly task is the same. For example, if there are 6 shaft hole assembly tasks of different categories, then the probability of being selected for each shaft hole assembly task is 1/6. The agent takes the selected shaft hole assembly task as the current training task.

在一些实施例中,由于每次均是从多个不同类别的轴孔装配任务中随机抽取,因此,当前训练任务可以与前序迭代训练中的训练任务相同,也可以不相同。In some embodiments, since the shaft hole assembly tasks of multiple different categories are randomly selected each time, the current training task may be the same as the training task in the previous iterative training, or may be different.

步骤2202,执行当前训练任务,并更新轴孔装配模型的参数。Step 2202: Execute the current training task and update the parameters of the shaft hole assembly model.

具体地,每个迭代训练回合,轴孔装配模型输出一个装配训练动作,通过多个迭代训练回合执行当前训练任务,在每个迭代训练回合中更新轴孔装配模型的参数。Specifically, in each iterative training round, the shaft hole assembly model outputs an assembly training action, the current training task is performed through multiple iterative training rounds, and the parameters of the shaft hole assembly model are updated in each iterative training round.

步骤2203,检测当前训练任务是否完成。Step 2203: Check whether the current training task is completed.

具体地,检测当前训练任务是否完成,可以通过检测轴孔是否装配成功,或者,轴的受力是否超过最大径向力来实现。Specifically, detecting whether the current training task is completed can be achieved by detecting whether the shaft hole is successfully assembled, or whether the force on the shaft exceeds the maximum radial force.

在检测到检测轴孔装配成功,或者,轴的受力超过最大径向力的情况下,表明当前训练任务完成,智能体执行步骤2204。When it is detected that the shaft hole is successfully assembled, or the force on the shaft exceeds the maximum radial force, it indicates that the current training task is completed, and the agent executes step 2204.

在检测到检测轴孔未装配成功,和,轴的受力未超过最大径向力的情况下,表明当前训练任务未完成,智能体执行步骤2203。When it is detected that the shaft hole has not been assembled successfully, and the force on the shaft does not exceed the maximum radial force, it indicates that the current training task is not completed, and the agent executes step 2203.

步骤2204,获取当前训练任务对应的累计奖励。Step 2204: Obtain the cumulative reward corresponding to the current training task.

具体地,每个迭代训练回合对应一个奖励,将多个迭代训练回合分别对应的多个奖励进行累加,得到当前训练任务对应的累计奖励。Specifically, each iterative training round corresponds to a reward, and multiple rewards corresponding to multiple iterative training rounds are accumulated to obtain the cumulative reward corresponding to the current training task.

步骤2205,检测训练任务对应的累计奖励是否收敛。Step 2205: Check whether the cumulative reward corresponding to the training task has converged.

具体地,训练任务包括当前训练任务和前序训练任务。智能体获取前序迭代训练中前序训练任务对应的累计奖励。Specifically, the training tasks include current training tasks and previous training tasks. The agent obtains the cumulative reward corresponding to the previous training task in the previous iterative training.

智能体根据当前训练任务对应的累计奖励,以及前序训练任务对应的累计奖励,确定训练任务对应的累计奖励的收敛情况。具体过程可以是:通过从当前训练任务对应的累计奖励开始,检测预设个训练任务对应的累计奖励是否接近来实现。若预设个训练任务对应的累计奖励接近,表明训练任务对应的累计奖励收敛;若预设个训练任务对应的累计奖励差异较大,表明训练任务对应的累计奖励不收敛。The agent determines the convergence of the cumulative rewards corresponding to the training tasks based on the cumulative rewards corresponding to the current training task and the cumulative rewards corresponding to the previous training tasks. The specific process may be: starting from the cumulative reward corresponding to the current training task, and detecting whether the cumulative rewards corresponding to the preset training tasks are close to each other. If the cumulative rewards corresponding to the preset training tasks are close, it indicates that the cumulative rewards corresponding to the training tasks have converged; if the cumulative rewards corresponding to the preset training tasks are greatly different, it indicates that the cumulative rewards corresponding to the training tasks have not converged.

在训练任务对应的累计奖励收敛的情况下,智能体结束迭代训练,得到训练好的轴孔装配模型;在训练任务对应的累计奖励不收敛的情况下,智能体执行下一次迭代训练,也即再次执行步骤2201至步骤2205。When the cumulative reward corresponding to the training task converges, the agent ends the iterative training and obtains the trained shaft hole assembly model; when the cumulative reward corresponding to the training task does not converge, the agent performs the next iterative training, that is, Perform steps 2201 to 2205 again.

本实施例提供的轴孔装配方法,通过构建多个不同类别的轴孔装配任务,从中随机抽取一个轴孔装配任务进行训练,实现训练好的轴孔装配模型能适应于不同类别的轴孔装配任务,提高轴孔装配模型的通用性和实用性。The shaft hole assembly method provided in this embodiment constructs a plurality of shaft hole assembly tasks of different categories, and randomly selects one shaft hole assembly task from them for training, so that the trained shaft hole assembly model can be adapted to different types of shaft hole assembly. The task is to improve the versatility and practicality of the shaft hole assembly model.

请参考图3,图3是本发明一示例性实施例提供的轴孔装配方法的流程示意图之三。本实施例是对前述实施例的具体说明,主要说明了:智能体执行当前训练任务,并更新轴孔装配模型的参数的具体过程。本实施例的流程如图3所示,包括以下步骤:Please refer to FIG. 3 , which is a third schematic flowchart of a shaft hole assembly method provided by an exemplary embodiment of the present invention. This embodiment is a detailed explanation of the foregoing embodiment, and mainly describes the specific process of the agent performing the current training task and updating the parameters of the shaft hole assembly model. The process of this embodiment is shown in Figure 3, including the following steps:

步骤310,通过多次迭代训练回合执行当前训练任务,在每次迭代训练回合中更新轴孔装配模型的参数。Step 310: Execute the current training task through multiple iterative training rounds, and update the parameters of the shaft hole assembly model in each iterative training round.

具体地,轴孔装配模型包括动作选取网络和动作评价网络。动作选取网络主要用来训练动作选取策略,从而确定最优的装配动作。动作评价网络用来给装配动作进行打分,引导出最优的装配动作。Specifically, the shaft hole assembly model includes an action selection network and an action evaluation network. The action selection network is mainly used to train the action selection strategy to determine the optimal assembly action. The action evaluation network is used to score assembly actions and guide optimal assembly actions.

在一些实施例中,动作选取网络可以是Actor网络,动作评价网络可以是Critic网络。In some embodiments, the action selection network may be an Actor network, and the action evaluation network may be a Critic network.

每个迭代训练回合对应一个装配动作,通过多次迭代训练回合,得到多个装配动作,通过多个装配动作执行当前训练任务。在每次迭代训练回合中更新动作选取网络和动作评价网络的参数。Each iterative training round corresponds to an assembly action. Through multiple iterative training rounds, multiple assembly actions are obtained, and the current training task is executed through multiple assembly actions. The parameters of the action selection network and the action evaluation network are updated in each iterative training round.

每一次迭代训练回合可以包括以下步骤3101至步骤3106。Each iterative training round may include the following steps 3101 to 3106.

步骤3101,获取前一个装配训练动作与前一个演示动作之间的相似度,与当前训练状态。Step 3101: Obtain the similarity between the previous assembly training action and the previous demonstration action, and the current training status.

具体地,智能体通过力传感器获取当前训练状态。当前训练状态包括训练轴的当前受力信息。Specifically, the agent obtains the current training status through the force sensor. The current training status includes the current force information of the training axis.

智能体获取前一个装配训练动作和前一个演示动作。前一个装配训练动作是动作选取网络根据前一个训练状态输出的装配训练动作。前一个演示动作是装配演示模型根据前一个训练状态输出的演示动作。装配演示模型是根据先验知识搭建的高斯模型。The agent obtains the previous assembly training action and the previous demonstration action. The previous assembly training action is the assembly training action output by the action selection network based on the previous training state. The previous demonstration action is the demonstration action output by the assembly demonstration model based on the previous training state. The assembly demonstration model is a Gaussian model built based on prior knowledge.

智能体计算前一个装配训练动作与前一个演示动作之间的相似程度,得到相似度。The agent calculates the similarity between the previous assembly training action and the previous demonstration action to obtain the similarity.

在一些实施例中,相似度的表达式如下所示:In some embodiments, the expression of similarity is as follows:

式中,mt-1表示装配训练动作与演示动作/>之间的相似程度,/>表示动作选取网络根据第t-1次迭代训练对应的训练状态st-1输出的装配训练动作,/>表示装配演示模型根据第t-1次迭代训练回合对应的训练状态st-1输出的演示动作。In the formula, m t-1 represents the assembly training action with demonstration actions/> The degree of similarity between Represents the assembly training action output by the action selection network based on the training state s t-1 corresponding to the t-1th iteration training,/> Indicates the demonstration action output by the assembly demonstration model according to the training state s t -1 corresponding to the t-1th iteration training round.

步骤3102,将相似度和当前训练状态输入轴孔装配模型中的动作评价网络,获取当前第一损失函数值。Step 3102: Input the similarity and current training status into the action evaluation network in the shaft hole assembly model to obtain the current first loss function value.

具体地,智能体将相似度和当前训练状态输入轴孔装配模型中的动作评价网络,获取动作评价网络输出的数据。Specifically, the agent inputs the similarity and the current training status into the action evaluation network in the shaft hole assembly model, and obtains the data output by the action evaluation network.

在一些实施例中,动作评价网络输出的数据可以包括用于评估当前状态的价值的数据,相似度和当前训练状态对应的奖励值等。In some embodiments, the data output by the action evaluation network may include data used to evaluate the value of the current state, similarity and reward values corresponding to the current training state, etc.

智能体根据动作评价网络输出的数据,得到当前第一损失函数值。The agent evaluates the data output by the network based on the action and obtains the current first loss function value.

步骤3103,基于当前第一损失函数值更新动作评价网络的参数。Step 3103: Update the parameters of the action evaluation network based on the current first loss function value.

具体地,智能体根据动作评价网络的允许精度设置一个预设第一损失函数值,将当前第一损失函数值与预设第一损失函数值进行比较。在当前第一损失函数值大于预设第一损失函数值的情况下,更新动作评价网络的参数;在当前第一损失函数值小于预设第一损失函数值的情况下,动作评价网络的参数已是最优参数,不需要跟新。Specifically, the agent sets a preset first loss function value according to the allowed accuracy of the action evaluation network, and compares the current first loss function value with the preset first loss function value. When the current first loss function value is greater than the preset first loss function value, update the parameters of the action evaluation network; when the current first loss function value is less than the preset first loss function value, update the parameters of the action evaluation network It is already the optimal parameter and does not need to be updated.

在一些实施例中,动作评价网络最优的参数包括权重参数θVIn some embodiments, the optimal parameters of the action evaluation network include weight parameters θ V .

步骤3104,将相似度和当前训练状态输入轴孔装配模型中的动作选取网络,获取当前第二损失函数值。Step 3104: Input the similarity and current training status into the action selection network in the shaft hole assembly model to obtain the current second loss function value.

具体地,智能体将相似度和当前训练状态输入轴孔装配模型中的动作选取网络,获取动作选取网络输出的数据。Specifically, the agent inputs the similarity and the current training status into the action selection network in the shaft hole assembly model, and obtains the data output by the action selection network.

在一些实施例中,动作选取网络输出的数据可以包括:当前动作选择策略和当前装配训练动作等。In some embodiments, the data output by the action selection network may include: the current action selection strategy, the current assembly training action, etc.

智能体根据动作选取网络输出的数据以及动作评价网络输出的数据,得到当前第二损失函数值。The agent obtains the current second loss function value based on the data output by the action selection network and the data output by the action evaluation network.

步骤3105,基于当前第二损失函数值更新动作选取网络的参数。Step 3105: Update action selection network parameters based on the current second loss function value.

具体地,智能体根据动作选取网络的允许精度设置一个预设第二损失函数值,将当前第二损失函数值与预设第二损失函数值进行比较。在当前第二损失函数值大于预设第二损失函数值的情况下,更新动作评价网络的参数;在当前第二损失函数值小于预设第二损失函数值的情况下,动作评价网络的参数已是最优参数,不需要跟新。Specifically, the agent sets a preset second loss function value according to the allowed accuracy of the action selection network, and compares the current second loss function value with the preset second loss function value. When the current second loss function value is greater than the preset second loss function value, update the parameters of the action evaluation network; when the current second loss function value is less than the preset second loss function value, update the parameters of the action evaluation network It is already the optimal parameter and does not need to be updated.

在一些实施例中,动作选取网络最优的参数包括权重参数θAIn some embodiments, the optimal parameters of the action selection network include weight parameters θ A .

步骤3106,检测当前训练任务是否完成。Step 3106: Check whether the current training task is completed.

具体地,检测当前训练任务是否完成,可以通过检测轴孔是否装配成功,或者,轴的受力是否超过最大径向力来实现。Specifically, detecting whether the current training task is completed can be achieved by detecting whether the shaft hole is successfully assembled, or whether the force on the shaft exceeds the maximum radial force.

在检测到检测轴孔装配成功,或者,轴的受力超过最大径向力的情况下,表明当前训练任务完成,智能体执行下一训练任务。When it is detected that the shaft hole is successfully assembled, or the force on the shaft exceeds the maximum radial force, it indicates that the current training task is completed and the agent performs the next training task.

在检测到检测轴孔未装配成功,和,轴的受力未超过最大径向力的情况下,表明当前训练任务未完成,智能体执行下一次迭代训练回合,也即步骤3101至步骤3106。When it is detected that the shaft hole has not been assembled successfully, and the force on the shaft does not exceed the maximum radial force, it indicates that the current training task is not completed, and the agent executes the next iterative training round, that is, steps 3101 to 3106.

本实施例提供的轴孔装配方法,通过将相似度和当前训练状态输入到动作评价网络,获取当前第一损失函数值,基于当前第一损失函数值更新动作评价网络的参数,通过将相似度和当前训练状态输入动作选取网络,获取当前第二损失函数值,基于当前第二损失函数值更新动作选取网络的参数,实现精确获取动作评价网络和动作选取网络。The shaft hole assembly method provided in this embodiment obtains the current first loss function value by inputting the similarity and the current training status into the action evaluation network, and updates the parameters of the action evaluation network based on the current first loss function value. Input the action selection network with the current training status, obtain the current second loss function value, and update the parameters of the action selection network based on the current second loss function value to achieve accurate acquisition of the action evaluation network and action selection network.

请参考图4,图4是本发明一示例性实施例提供的轴孔装配方法的流程示意图之四。本实施例是对前述实施例的具体说明,主要说明了:将相似度和当前训练状态输入轴孔装配模型中的动作评价网络,获取当前第一损失函数值的具体过程。本实施例的流程如图4所示,包括以下步骤:Please refer to FIG. 4 , which is a schematic flowchart No. 4 of a shaft hole assembly method provided by an exemplary embodiment of the present invention. This embodiment is a detailed explanation of the previous embodiment, and mainly describes the specific process of inputting the similarity and the current training status into the action evaluation network in the shaft hole assembly model to obtain the current first loss function value. The process of this embodiment is shown in Figure 4, including the following steps:

步骤410,将相似度和当前训练状态输入动作评价网络,得到当前状态价值和当前奖励值。Step 410: Input the similarity and current training status into the action evaluation network to obtain the current status value and current reward value.

具体地,智能体将相似度和当前训练状态输入动作评价网络,动作评价网络输出当前状态价值和当前奖励值。当前状态价值是与相似度和当前训练状态相对应,当前奖励值也与相似度和当前训练状态相对应。Specifically, the agent inputs the similarity and the current training state into the action evaluation network, and the action evaluation network outputs the current state value and the current reward value. The current state value corresponds to the similarity and the current training state, and the current reward value also corresponds to the similarity and the current training state.

在第t次迭代训练中,动作评价网络输出的状态价值为其中,/>表示状态价值函数,Ti表示第i个训练任务,p(T)表示抽取训练任务的概率,st表示第t次迭代训练回合对应的训练状态,mt-1表示装配训练动作/>与演示动作之间的相似程度,/>表示动作选取网络根据第t-1次迭代训练回合对应的训练状态st-1输出的装配训练动作,/>表示装配演示模型根据第t-1次迭代训练回合对应的训练状态st-1输出的演示动作,θV表示动作评价网络的权重参数。融合相似度的状态价值能够更有效地表征状态的价值。In the t-th iteration training, the state value output by the action evaluation network is Among them,/> Represents the state value function, T i represents the i-th training task, p(T) represents the probability of extracting the training task, s t represents the training state corresponding to the t-th iterative training round, m t-1 represents the assembly training action/> with demonstration actions The degree of similarity between Represents the assembly training action output by the action selection network according to the training state s t -1 corresponding to the t-1th iteration training round,/> represents the demonstration action output by the assembly demonstration model according to the training state s t- 1 corresponding to the t-1th iteration training round, and θ V represents the weight parameter of the action evaluation network. Status value of fused similarity Ability to more effectively represent the value of a state.

在第t次迭代训练中,动作评价网络输出的奖励值的表达式如下所示:In the t-th iteration of training, the expression of the reward value output by the action evaluation network is as follows:

rt=R(st,mt-1)r t =R (s t ,m t-1 )

式中,rt表示第t次迭代训练回合对应的奖励值,R()表示奖励函数,st表示第t次迭代训练回合对应的训练状态,mt-1表示装配训练动作与演示动作/>之间的相似程度,/>表示动作选取网络根据第t-1次迭代训练回合对应的训练状态st-1输出的装配训练动作,/>表示装配演示模型根据第t-1次迭代训练回合对应的训练状态st-1输出的演示动作。In the formula, r t represents the reward value corresponding to the t-th iterative training round, R() represents the reward function, s t represents the training state corresponding to the t-th iterative training round, and m t-1 represents the assembly training action. with demonstration actions/> The degree of similarity between Represents the assembly training action output by the action selection network according to the training state s t -1 corresponding to the t-1th iteration training round,/> Indicates the demonstration action output by the assembly demonstration model according to the training state s t -1 corresponding to the t-1th iteration training round.

步骤420,根据当前状态价值和当前奖励值,以及后一状态价值,获取当前优势值。Step 420: Obtain the current advantage value based on the current status value, the current reward value, and the latter status value.

具体地,智能体将当前装配训练动作与当前演示动作之间的相似度和后一训练状态输入动作评价网络,动作评价网络输出后一状态价值。其中,当前装配训练动作是动作选取网络根据当前训练状态输出的装配训练动作,当前演示动作是装配演示模型根据当前训练状态输出的演示动作。Specifically, the agent inputs the similarity between the current assembly training action and the current demonstration action and the latter training state into the action evaluation network, and the action evaluation network outputs the value of the latter state. Among them, the current assembly training action is the assembly training action output by the action selection network according to the current training state, and the current demonstration action is the demonstration action output by the assembly demonstration model according to the current training state.

在第t次迭代训练回合中,优势值的表达式如下所示:In the t-th iterative training round, the expression of the advantage value is as follows:

式中,At表示第t次迭代训练回合对应的优势值,rt表示第t次迭代训练回合对应的奖励值,γ表示折扣系数,0≤γ≤1,表示第t+1次迭代训练回合对应的状态价值,其中,/>表示状态价值函数,Ti表示第i个训练任务,p(T)表示抽取训练任务的概率,st+1表示第t+1次迭代训练回合对应的训练状态,mt表示装配训练动作与演示动作/>之间的相似程度,/>表示动作选取网络根据第t次迭代训练回合对应的训练状态st输出的装配训练动作,/>表示装配演示模型根据第t次迭代训练回合对应的训练状态st输出的演示动作,θV表示动作评价网络的权重参数,/>表示第t次迭代训练回合对应的状态价值,st表示第t次迭代训练回合对应的训练状态,mt-1表示装配训练动作/>与演示动作/>之间的相似程度。In the formula, A t represents the advantage value corresponding to the t-th iterative training round, r t represents the reward value corresponding to the t-th iterative training round, γ represents the discount coefficient, 0≤γ≤1, Represents the state value corresponding to the t+1 iteration training round, where,/> represents the state value function, T i represents the i-th training task, p(T) represents the probability of extracting the training task, s t+1 represents the training state corresponding to the t+1 iterative training round, m t represents the assembly training action with demonstration actions/> The degree of similarity between Represents the assembly training action output by the action selection network according to the training state s t corresponding to the t-th iteration training round,/> Represents the demonstration action output by the assembly demonstration model according to the training state s t corresponding to the t-th iteration training round, θ V represents the weight parameter of the action evaluation network, /> represents the state value corresponding to the t-th iterative training round, s t represents the training state corresponding to the t-th iterative training round, m t-1 represents the assembly training action/> with demonstration actions/> degree of similarity between them.

在得到后一状态价值后,智能体将当前状态价值和当前奖励值,以及后一状态价值代入优势值的表达式,获取当前优势值。After obtaining the value of the latter state, the agent substitutes the value of the current state, the current reward value, and the value of the latter state into the expression of the advantage value to obtain the current advantage value.

步骤430,根据当前优势值,获取当前第一损失函数值。Step 430: Obtain the current first loss function value based on the current advantage value.

具体地,第一损失函数的表达式如下所示:Specifically, the expression of the first loss function is as follows:

式中,L(θV)表示第一损失函数,L()表示损失函数,At表示第t次迭代训练对应的优势值,Ti表示第i个训练任务,p(T)表示抽取训练任务的概率,表示执行以概率p(T)选取的Ti任务中以迭代训练回合次数t为变量的数学期望。In the formula, L(θ V ) represents the first loss function, L() represents the loss function, A t represents the advantage value corresponding to the t-th iterative training, T i represents the i-th training task, and p(T) represents extraction training. the probability of the task, Represents the mathematical expectation with the number of iterative training rounds t as the variable when executing the T i task selected with probability p(T).

在得到当前优势值,智能体将当前优势值代入第一损失函数的表达式,获取当前第一损失函数值。After obtaining the current advantage value, the agent substitutes the current advantage value into the expression of the first loss function to obtain the current first loss function value.

本实施例提供的轴孔装配方法,通过将相似度和当前训练状态输入动作评价网络,得到当前状态价值和当前奖励值,根据当前状态价值和当前奖励值,以及后一状态价值,获取当前优势值,根据当前优势值,获取当前第一损失函数值,实现了对当前第一损失函数值的精确获取,有利于精确更新动作评价网络的参数,进一步有利于得到精确的轴孔装配模型。The shaft hole assembly method provided in this embodiment obtains the current state value and the current reward value by inputting the similarity and the current training state into the action evaluation network, and obtains the current advantage based on the current state value, the current reward value, and the latter state value. value, according to the current advantage value, the current first loss function value is obtained, which achieves accurate acquisition of the current first loss function value, which is conducive to accurately updating the parameters of the action evaluation network, and is further conducive to obtaining an accurate shaft hole assembly model.

请参考图5,图5是本发明一示例性实施例提供的轴孔装配方法的流程示意图之五。本实施例是对前述实施例的具体说明,主要说明了:将相似度和当前训练状态输入轴孔装配模型中的动作选取网络,获取当前第二损失函数值的具体过程。本实施例的流程如图5所示,包括以下步骤:Please refer to FIG. 5 , which is a fifth schematic flowchart of a shaft hole assembly method provided by an exemplary embodiment of the present invention. This embodiment is a detailed explanation of the previous embodiment, and mainly describes the specific process of inputting the similarity and the current training status into the action selection network in the shaft hole assembly model to obtain the current second loss function value. The process of this embodiment is shown in Figure 5, including the following steps:

步骤510,将相似度和当前训练状态输入当前动作选取网络,得到当前装配训练动作。Step 510: Input the similarity and current training status into the current action selection network to obtain the current assembly training action.

具体地,智能体将相似度和当前训练状态输入动作选取网络,动作选取网络输出当前动作选择策略。动作选择策略是一系列装配动作与对应概率的集合,动作选择策略呈高斯分布,横轴是装配动作,纵轴是概率。Specifically, the agent inputs the similarity and current training status into the action selection network, and the action selection network outputs the current action selection strategy. The action selection strategy is a set of a series of assembly actions and corresponding probabilities. The action selection strategy presents a Gaussian distribution. The horizontal axis is the assembly action and the vertical axis is the probability.

在第t次迭代训练回合中,动作选择策略为其中,θA表示动作选取网络的权重参数,at表示第t次迭代训练回合对应的装配训练动作,st表示第t次迭代训练回合对应的训练状态,mt-1表示装配训练动作/>与演示动作/>之间的相似程度。In the t-th iterative training round, the action selection strategy is Among them, θ A represents the weight parameter of the action selection network, a t represents the assembly training action corresponding to the t-th iterative training round, s t represents the training state corresponding to the t-th iterative training round, m t-1 represents the assembly training action/ > with demonstration actions/> degree of similarity between them.

智能体将当前动作选择策略中概率最大的装配动作作为当前装配训练动作。The agent uses the assembly action with the highest probability in the current action selection strategy as the current assembly training action.

步骤520,基于当前装配训练动作的概率与上一装配训练动作的概率,以及当前优势值,得到当前第二损失函数值。Step 520: Obtain the current second loss function value based on the probability of the current assembly training action, the probability of the previous assembly training action, and the current advantage value.

具体地,第二损失函数的表达式如下所示:Specifically, the expression of the second loss function is as follows:

式中,L(θV)表示第二损失函数,L()表示损失函数,θA表示动作选取网络的权重参数,Ti表示第i个训练任务,p(T)表示抽取当前训练任务的概率,表示执行以概率p(T)选取的Ti任务中以迭代训练回合次数t为变量的数学期望,rt(θ)表示新旧动作选择策略中最大概率的比值,At表示第t次迭代训练对应的优势值,∈为超参数,一般取值为0.2。In the formula, L(θ V ) represents the second loss function, L() represents the loss function, θ A represents the weight parameter of the action selection network, Ti represents the i-th training task, and p(T) represents the extraction of the current training task. probability, Represents the mathematical expectation with the number of iterative training rounds t as the variable when executing the T i task selected with probability p(T), r t (θ) represents the ratio of the maximum probabilities in the old and new action selection strategies, A t represents the t-th iteration training The corresponding advantage value, ∈, is a hyperparameter, generally taking a value of 0.2.

其中,in,

式中,rt(θ)表示新旧动作选择策略中装配训练动作的概率的比值,表示新动作选择策略中装配训练动作的概率,θA表示动作选取网络的权重参数,/>表示旧动作选择策略中装配训练动作的概率,/>表示相对于θA而言,未更新的权重参数。In the formula, r t (θ) represents the ratio of the probability of assembling training actions in the new and old action selection strategies, represents the probability of assembling training actions in the new action selection strategy, θ A represents the weight parameter of the action selection network,/> Represents the probability of assembling training actions in the old action selection strategy, /> Indicates the unupdated weight parameters relative to θ A.

当前装配训练动作的概率,即是新动作选择策略中装配训练动作的概率;上一装配训练动作的概率,即是旧动作选择策略中装配训练动作的概率。The probability of the current assembly training action is the probability of the assembly training action in the new action selection strategy; the probability of the previous assembly training action is the probability of the assembly training action in the old action selection strategy.

智能体将当前装配训练动作的概率与上一装配训练动作的概率,以及当前优势值带入第二损失函数中,得到当前第二损失函数值。The agent brings the probability of the current assembly training action, the probability of the previous assembly training action, and the current advantage value into the second loss function to obtain the current second loss function value.

本实施例提供的轴孔装配方法,通过将相似度和当前训练状态输入当前动作选取网络,得到当前装配训练动作,基于当前装配训练动作的概率与上一装配训练动作的概率,以及当前优势值,得到当前第二损失函数值,实现了对当前第二损失函数值的精确获取,有利于精确更新动作选取网络的参数,进一步有利于得到精确的轴孔装配模型。The shaft hole assembly method provided in this embodiment obtains the current assembly training action by inputting the similarity and the current training status into the current action selection network, based on the probability of the current assembly training action, the probability of the previous assembly training action, and the current advantage value. , obtaining the current second loss function value, achieving accurate acquisition of the current second loss function value, which is conducive to accurately updating the parameters of the action selection network, and is further conducive to obtaining an accurate shaft hole assembly model.

请参考图6,图6是本发明一示例性实施例提供的轴孔装配方法的流程示意图之六。本实施例是对前述实施例的具体说明,主要说明了:获取装配演示模型的具体过程。本实施例的流程如图6所示,包括以下步骤:Please refer to FIG. 6 , which is a schematic flowchart No. 6 of a shaft hole assembly method provided by an exemplary embodiment of the present invention. This embodiment is a detailed description of the foregoing embodiment, and mainly describes the specific process of obtaining the assembly demonstration model. The process of this embodiment is shown in Figure 6, including the following steps:

步骤610,获取示教状态和示教动作。Step 610: Obtain the teaching status and teaching action.

具体地,先验知识可以是轴孔装配过程中的示教数据,示教数据包括示教状态和示教动作。Specifically, the prior knowledge can be teaching data during the shaft hole assembly process, and the teaching data includes teaching status and teaching actions.

智能体通过演示学习算法获取示教状态和示教动作,也就是对轴孔装配过程进行模仿,以获取示教状态和示教动作。The intelligent agent obtains the teaching status and teaching actions through the demonstration learning algorithm, that is, it imitates the shaft hole assembly process to obtain the teaching status and teaching actions.

步骤620,对示教状态和示教动作进行建模,获取装配演示模型。Step 620: Model the teaching status and teaching actions to obtain an assembly demonstration model.

具体地,初始装配演示模型的表达式如下所示:Specifically, the expression for the initial assembly demonstration model is as follows:

式中,α表示示教动作,ζ表示示教状态,λk表示第k个高斯模型的权重系数,μk表示第k个高斯模型的均值,∑k表示第k个高斯模型的协方差,θ={θ12,…,θK},θk是高斯混合模型的参数,θk={λkk,∑k}。In the formula, α represents the teaching action, ζ represents the teaching state, λ k represents the weight coefficient of the k-th Gaussian model, μ k represents the mean of the k-th Gaussian model, ∑ k represents the covariance of the k-th Gaussian model, θ = {θ 1 , θ 2 ,…, θ K }, θ k is the parameter of the Gaussian mixture model, θ k = {λ k , μ k , ∑ k }.

智能体将示教状态和示教动作输入初始装配演示模型,确定模型的参数,从而得到构建好的装配演示模型。The intelligent agent inputs the teaching status and teaching actions into the initial assembly demonstration model, determines the parameters of the model, and obtains the constructed assembly demonstration model.

本实施例提供的轴孔装配方法,通过示教状态和示教动作构建装配演示模型,有利于应用装配演示模型获取相似度,从而加速轴孔装配模块对多类别技能的学习效率。The shaft hole assembly method provided in this embodiment constructs an assembly demonstration model through teaching states and teaching actions, which is beneficial to applying the assembly demonstration model to obtain similarity, thereby accelerating the learning efficiency of the shaft hole assembly module for multi-category skills.

下面对本发明提供的轴孔装配装置进行描述,下文描述的轴孔装配装置与上文描述的轴孔装配方法可相互对应参照。The shaft hole assembly device provided by the present invention will be described below. The shaft hole assembly device described below and the shaft hole assembly method described above can be referenced correspondingly.

图7是本发明一示例性实施例提供的轴孔装配装置的结构示意图,如图7所示,轴孔装配装置包括:执行模块710。Figure 7 is a schematic structural diagram of a shaft hole assembly device provided by an exemplary embodiment of the present invention. As shown in Figure 7, the shaft hole assembly device includes: an execution module 710.

执行模块710,用于执行多次迭代操作;每一次迭代操作包括:Execution module 710 is used to perform multiple iterative operations; each iterative operation includes:

获取轴的当前状态;所述当前状态包括当前受力信息;Obtain the current status of the axis; the current status includes current force information;

将所述当前状态输入轴孔装配模型,得到当前装配动作;所述当前装配动作包括平移量和旋转量;所述轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;Input the current state into the shaft hole assembly model to obtain the current assembly action; the current assembly action includes the translation amount and the rotation amount; the shaft hole assembly model is based on prior knowledge and meta-reinforcement learning on multiple different categories of shafts. Trained in the hole assembly task;

在所述轴上执行所述当前装配动作;Execute the current assembly action on the axis;

在所述轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。If the depth of the shaft installation hole does not meet the preset assembly requirements, the next iterative operation is performed.

在一些实施例中,所述轴孔装配装置还包括:构建模块和训练模块。其中,In some embodiments, the shaft hole assembly device further includes: a building module and a training module. in,

所述构建模块,用于构建多个不同类别的轴孔装配任务;The building module is used to construct multiple different categories of shaft hole assembly tasks;

所述训练模块,用于执行多次迭代训练操作;所述训练模块包括:抽取子模块,执行子模块、获取子模块、确定子模块,其中:The training module is used to perform multiple iterative training operations; the training module includes: extracting sub-modules, executing sub-modules, obtaining sub-modules, and determining sub-modules, wherein:

所述抽取子模块,用于抽取从所述多个不同类别的轴孔装配任务中随机抽取一个轴孔装配任务,作为当前训练任务;The extraction sub-module is used to randomly select a shaft hole assembly task from the plurality of shaft hole assembly tasks of different categories as the current training task;

所述执行子模块,用于执行所述当前训练任务,并更新所述轴孔装配模型的参数;The execution sub-module is used to execute the current training task and update the parameters of the shaft hole assembly model;

所述获取子模块,用于在所述当前训练任务完成后,获取所述当前训练任务对应的累计奖励;The acquisition sub-module is used to obtain the cumulative reward corresponding to the current training task after the current training task is completed;

所述确定子模块,用于根据所述当前训练任务对应的累计奖励,以及获取的前序训练任务对应的累计奖励,确定训练任务对应的累计奖励的收敛情况;The determination sub-module is used to determine the convergence status of the cumulative reward corresponding to the training task based on the cumulative reward corresponding to the current training task and the obtained cumulative reward corresponding to the previous training task;

所述执行子模块,还用于在所述训练任务对应的累计奖励未收敛的情况下,执行下一次迭代训练。The execution sub-module is also used to execute the next iterative training when the cumulative reward corresponding to the training task has not converged.

在一些实施例中,所述执行子模块具体用于:通过多次迭代训练回合执行所述当前训练任务,在每次迭代训练回合中更新所述轴孔装配模型的参数;所述执行子模块包括:第一获取单元、第二获取单元、第二获取单元、第一更新单元、第三获取单元、第二更新单元和执行单元。In some embodiments, the execution sub-module is specifically configured to: execute the current training task through multiple iterative training rounds, and update the parameters of the shaft hole assembly model in each iterative training round; the execution sub-module It includes: a first acquisition unit, a second acquisition unit, a second acquisition unit, a first update unit, a third acquisition unit, a second update unit and an execution unit.

所述第一获取单元,用于获取前一个装配训练动作与前一个演示动作之间的相似度,与当前训练状态;所述前一个装配训练动作是在前一次迭代训练回合中,所述轴孔装配模型中的网络选取网络根据前一个训练状态输出的训练动作;所述前一个演示动作是装配演示模型根据所述前一个训练状态输出的演示动作;所述当前训练状态包括训练轴的当前受力信息;The first acquisition unit is used to obtain the similarity between the previous assembly training action and the previous demonstration action, and the current training status; the previous assembly training action was in the previous iterative training round, and the axis The network in the hole assembly model selects the training action output by the network according to the previous training state; the previous demonstration action is the demonstration action output by the assembly demonstration model according to the previous training state; the current training state includes the current training axis force information;

所述第二获取单元,用于将所述相似度和所述当前训练状态输入所述轴孔装配模型中的动作评价网络,获取当前第一损失函数值;The second acquisition unit is used to input the similarity and the current training status into the action evaluation network in the shaft hole assembly model, and obtain the current first loss function value;

所述第一更新单元,用于基于所述当前第一损失函数值更新动作评价网络的参数;The first update unit is used to update parameters of the action evaluation network based on the current first loss function value;

所述第三获取单元,将所述相似度和所述当前训练状态输入所述轴孔装配模型中的动作选取网络,获取当前第二损失函数值;The third acquisition unit inputs the similarity and the current training status into the action selection network in the shaft hole assembly model to obtain the current second loss function value;

所述第二更新单元,用于基于所述当前第二损失函数值更新所述动作选取网络的参数;The second update unit is used to update parameters of the action selection network based on the current second loss function value;

所述执行单元,用于在所述当前训练任务未完成的情况下,执行下一次迭代训练回合。The execution unit is configured to execute the next iterative training round when the current training task is not completed.

在一些实施例中,所述第二获取单元包括:第一获取子单元、第二获取子单元和第三获取子单元。其中:In some embodiments, the second acquisition unit includes: a first acquisition sub-unit, a second acquisition sub-unit and a third acquisition sub-unit. in:

所述第一获取子单元,用于将所述相似度和所述当前训练状态输入所述动作评价网络,得到当前状态价值和当前奖励值;The first acquisition subunit is used to input the similarity and the current training status into the action evaluation network to obtain the current status value and the current reward value;

所述第二获取子单元,用于根据所述当前状态价值和所述当前奖励值,以及后一状态价值,获取当前优势值;The second acquisition subunit is used to acquire the current advantage value according to the current state value, the current reward value, and the latter state value;

所述第三获取子单元,用于根据所述当前优势值,获取所述当前第一损失函数值。The third acquisition subunit is used to acquire the current first loss function value according to the current advantage value.

在一些实施例中,所述第三获取单元包括:第四获取子单元和第五获取子单元。其中:In some embodiments, the third acquisition unit includes: a fourth acquisition sub-unit and a fifth acquisition sub-unit. in:

所述第四获取子单元,用于将所述相似度和所述当前训练状态输入所述动作选取网络,得到当前装配训练动作;The fourth acquisition subunit is used to input the similarity and the current training status into the action selection network to obtain the current assembly training action;

所述第五获取子单元,用于基于所述当前装配训练动作的概率与上一装配训练动作的概率,以及所述当前优势值,获取所述当前第二损失函数值。The fifth acquisition subunit is used to acquire the current second loss function value based on the probability of the current assembly training action, the probability of the previous assembly training action, and the current advantage value.

在一些实施例中,在一些实施例中,所述轴孔装配装置还包括:第一获取模块和第二获取模块。其中:In some embodiments, the shaft hole assembly device further includes: a first acquisition module and a second acquisition module. in:

所述第一获取模块,用于获取示教状态和示教动作;The first acquisition module is used to acquire teaching status and teaching actions;

所述第二获取模块,用于对所述示教状态和所述示教动作进行建模,获取装配演示模型。The second acquisition module is used to model the teaching state and the teaching action and acquire an assembly demonstration model.

在此需要说明的是,本发明提供的上述轴孔装配装置,能够实现上述方法实施例所实现的所有方法步骤,且能够达到相同的技术效果,在此不再对本实施例中与方法实施例相同的部分及有益效果进行具体赘述。It should be noted here that the above-mentioned shaft hole assembly device provided by the present invention can implement all the method steps implemented by the above-mentioned method embodiments, and can achieve the same technical effect. No further explanation will be given here of the differences between this embodiment and the method embodiments. The same parts and beneficial effects will be described in detail.

图8是本发明一示例性实施例提供的电子设备的实体结构示意图,如图8所示,该电子设备可以包括:处理器(processor)810、通信接口(Communications Interface)820、存储器(memory)830和通信总线840,其中,处理器810,通信接口820,存储器830通过通信总线840完成相互间的通信。处理器810可以调用存储器830中的逻辑指令,以执行轴孔装配方法,该方法包括:执行多次迭代操作;每一次迭代操作包括:获取轴的当前状态;所述当前状态包括当前受力信息;将所述当前状态输入轴孔装配模型,得到当前装配动作;所述当前装配动作包括平移量和旋转量;所述轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;在所述轴上执行所述当前装配动作;在所述轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。Figure 8 is a schematic diagram of the physical structure of an electronic device provided by an exemplary embodiment of the present invention. As shown in Figure 8, the electronic device may include: a processor (processor) 810, a communications interface (Communications Interface) 820, and a memory (memory). 830 and communication bus 840, in which the processor 810, the communication interface 820, and the memory 830 complete communication with each other through the communication bus 840. The processor 810 can call the logical instructions in the memory 830 to execute the shaft hole assembly method. The method includes: performing multiple iterative operations; each iterative operation includes: obtaining the current status of the shaft; the current status includes current force information. ; Input the current state into the shaft hole assembly model to obtain the current assembly action; the current assembly action includes the translation amount and the rotation amount; the shaft hole assembly model is based on prior knowledge and meta-reinforcement learning in multiple different categories Trained in the shaft hole assembly task; perform the current assembly action on the shaft; when the depth of the shaft installation hole does not meet the preset assembly requirements, perform the next iteration operation.

此外,上述的存储器830中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logical instructions in the memory 830 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

另一方面,本发明还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被处理器执行时,计算机能够执行上述各方法所提供的轴孔装配方法,该方法包括:执行多次迭代操作;每一次迭代操作包括:获取轴的当前状态;所述当前状态包括当前受力信息;将所述当前状态输入轴孔装配模型,得到当前装配动作;所述当前装配动作包括平移量和旋转量;所述轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;在所述轴上执行所述当前装配动作;在所述轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。On the other hand, the present invention also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Executing the shaft hole assembly method provided by the above methods, the method includes: performing multiple iterative operations; each iterative operation includes: obtaining the current status of the shaft; the current status includes current force information; inputting the current status The shaft hole assembly model is used to obtain the current assembly action; the current assembly action includes the translation amount and the rotation amount; the shaft hole assembly model is trained in multiple different categories of shaft hole assembly tasks based on prior knowledge and meta-reinforcement learning. ; perform the current assembly action on the shaft; when the depth of the shaft installation hole does not meet the preset assembly requirements, perform the next iterative operation.

又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的轴孔装配方法,该方法包括:执行多次迭代操作;每一次迭代操作包括:获取轴的当前状态;所述当前状态包括当前受力信息;将所述当前状态输入轴孔装配模型,得到当前装配动作;所述当前装配动作包括平移量和旋转量;所述轴孔装配模型是基于先验知识和元强化学习在多个不同类别的轴孔装配任务中训练得到的;在所述轴上执行所述当前装配动作;在所述轴装入孔的深度未满足预设装配要求的情况下,执行下一次迭代操作。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by a processor to perform the shaft hole assembly method provided by the above methods. The method includes : Execute multiple iterative operations; each iterative operation includes: obtaining the current state of the shaft; the current state includes the current force information; inputting the current state into the shaft hole assembly model to obtain the current assembly action; the current assembly action Including translation amount and rotation amount; the shaft hole assembly model is trained in multiple different categories of shaft hole assembly tasks based on prior knowledge and meta-reinforcement learning; perform the current assembly action on the shaft; in If the depth of the shaft installation hole does not meet the preset assembly requirements, the next iteration operation is performed.

以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A shaft hole assembling method, characterized by comprising:
performing a plurality of iterative operations; each iterative operation includes:
acquiring the current state of the shaft; the current state comprises current stress information;
inputting the current state into a shaft hole assembly model to obtain a current assembly action; the current assembly action includes a translation amount and a rotation amount; the shaft hole assembly model is obtained by training in a plurality of shaft hole assembly tasks of different categories based on priori knowledge and meta reinforcement learning;
performing the current assembly action on the shaft;
in the case where the depth of the shaft-fitting hole does not meet the preset fitting requirement, the next iteration operation is performed.
2. The shaft hole assembly method according to claim 1, wherein the shaft hole assembly model is trained based on the steps of:
constructing a plurality of shaft hole assembly tasks of different categories;
performing a plurality of iterative training operations; each iterative training operation includes:
randomly extracting one shaft hole assembly task from the shaft hole assembly tasks with different categories to serve as a current training task;
executing the current training task and updating parameters of the shaft hole assembly model;
after the current training task is completed, acquiring a cumulative reward corresponding to the current training task;
determining convergence condition of the accumulated rewards corresponding to the training tasks according to the accumulated rewards corresponding to the current training tasks and the acquired accumulated rewards corresponding to the preamble training tasks;
and executing the next iterative training under the condition that the accumulated rewards corresponding to the training tasks are not converged.
3. The shaft hole assembly method according to claim 2, wherein the performing the current training task and updating parameters of the shaft hole assembly model comprises:
executing the current training task through a plurality of iterative training rounds, and updating parameters of the shaft hole assembly model in each iterative training round; wherein each iterative training round comprises:
Obtaining the similarity between the previous assembly training action and the previous demonstration action and the current training state; the previous assembly training action is a training action which is output by a network selection network in the shaft hole assembly model according to the previous training state in the previous iteration training round; the previous demonstration action is a demonstration action output by the assembly demonstration model according to the previous training state; the current training state comprises current stress information of a training shaft;
inputting the similarity and the current training state into an action evaluation network in the shaft hole assembly model to obtain a current first loss function value;
updating parameters of an action evaluation network based on the current first loss function value;
inputting the similarity and the current training state into an action selection network in the shaft hole assembly model to obtain a current second loss function value;
updating parameters of the action selection network based on the current second loss function value;
and executing the next iteration training round under the condition that the current training task is not completed.
4. The shaft hole assembly method according to claim 3, wherein the inputting the similarity and the current training state into the action evaluation network in the shaft hole assembly model, obtaining a current first loss function value, comprises:
Inputting the similarity and the current training state into the action evaluation network to obtain a current state value and a current rewarding value;
acquiring a current advantage value according to the current state value, the current rewarding value and the latter state value;
and acquiring the current first loss function value according to the current dominant value.
5. The shaft hole assembly method of claim 4, wherein the act of inputting the similarity and the current training state into the shaft hole assembly model selects a network, and obtaining a current second loss function value comprises:
inputting the similarity and the current training state into the action selection network to obtain a current assembly training action;
and acquiring the current second loss function value based on the probability of the current assembly training action, the probability of the last assembly training action and the current dominance value.
6. A shaft hole assembling method according to claim 3, further comprising, before obtaining the similarity between the previous assembling training action and the previous demonstration action:
acquiring a teaching state and a teaching action;
modeling the teaching state and the teaching action to obtain an assembly demonstration model.
7. A shaft hole assembling device, characterized by comprising:
the execution module is used for executing a plurality of iterative operations; each iterative operation includes:
acquiring the current state of the shaft; the current state comprises current stress information;
inputting the current state into a shaft hole assembly model to obtain a current assembly action; the current assembly action includes a translation amount and a rotation amount; the shaft hole assembly model is obtained by training in a plurality of shaft hole assembly tasks of different categories based on priori knowledge and meta reinforcement learning;
performing the current assembly action on the shaft;
in the case where the depth of the shaft-fitting hole does not meet the preset fitting requirement, the next iteration operation is performed.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the shaft hole assembly method according to any one of claims 1 to 6 when executing the computer program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the shaft hole fitting method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the shaft hole assembly method according to any one of claims 1 to 6.
CN202310518497.5A 2023-05-09 2023-05-09 Shaft hole assembly method and related equipment Pending CN116766178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310518497.5A CN116766178A (en) 2023-05-09 2023-05-09 Shaft hole assembly method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310518497.5A CN116766178A (en) 2023-05-09 2023-05-09 Shaft hole assembly method and related equipment

Publications (1)

Publication Number Publication Date
CN116766178A true CN116766178A (en) 2023-09-19

Family

ID=87993861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310518497.5A Pending CN116766178A (en) 2023-05-09 2023-05-09 Shaft hole assembly method and related equipment

Country Status (1)

Country Link
CN (1) CN116766178A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110942512A (en) * 2019-11-27 2020-03-31 大连理工大学 Meta-learning-based indoor scene reconstruction method
CN111890357A (en) * 2020-07-01 2020-11-06 广州中国科学院先进技术研究所 An intelligent robot grasping method based on action demonstration and teaching
CN113674324A (en) * 2021-08-27 2021-11-19 常州唯实智能物联创新中心有限公司 Class-level 6D pose tracking method, system and device based on meta-learning
US20220105624A1 (en) * 2019-01-23 2022-04-07 Google Llc Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning
CN115338610A (en) * 2022-07-04 2022-11-15 中国科学院自动化研究所 Biaxial hole assembling method and device, electronic device and storage medium
CN115674204A (en) * 2022-11-03 2023-02-03 湘潭大学 A robot shaft hole assembly method based on deep reinforcement learning and admittance control

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220105624A1 (en) * 2019-01-23 2022-04-07 Google Llc Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning
CN110942512A (en) * 2019-11-27 2020-03-31 大连理工大学 Meta-learning-based indoor scene reconstruction method
CN111890357A (en) * 2020-07-01 2020-11-06 广州中国科学院先进技术研究所 An intelligent robot grasping method based on action demonstration and teaching
CN113674324A (en) * 2021-08-27 2021-11-19 常州唯实智能物联创新中心有限公司 Class-level 6D pose tracking method, system and device based on meta-learning
CN115338610A (en) * 2022-07-04 2022-11-15 中国科学院自动化研究所 Biaxial hole assembling method and device, electronic device and storage medium
CN115674204A (en) * 2022-11-03 2023-02-03 湘潭大学 A robot shaft hole assembly method based on deep reinforcement learning and admittance control

Similar Documents

Publication Publication Date Title
CN112937564B (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
Vecerik et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
EP4231197B1 (en) Training machine learning models on multiple machine learning tasks
CN114330659B (en) A BP neural network parameter optimization method based on improved ASO algorithm
KR102213061B1 (en) Learning framework setting method for robot and digital control device
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
CN114967472B (en) A deep deterministic policy gradient control method for UAV trajectory tracking and state compensation
CN114781248A (en) Offline reinforcement learning method and device based on state offset correction
CN112016678B (en) Training method and device for strategy generation network for reinforcement learning and electronic equipment
CN116587275A (en) Method and system for intelligent impedance control of manipulator based on deep reinforcement learning
CN111260056B (en) Network model distillation method and device
CN114971066A (en) Knowledge tracking method and system integrating forgetting factor and learning ability
CN117975190B (en) Method and device for processing simulated learning mixed sample based on vision pre-training model
CN117311374A (en) Aircraft control method based on reinforcement learning, terminal equipment and medium
CN117301068A (en) Dual robust enhanced control method suitable for robot adaptation to different tasks
CN114298302B (en) Agent task learning method and device
CN117540203B (en) A multi-directional course learning training method and device for cooperative navigation of swarm robots
CN118560521A (en) Unmanned vehicle control method and system based on behavior cloning and reinforcement learning
CN115338610B (en) Biaxial hole assembly method, device, electronic device and storage medium
CN113919475B (en) Robot skill learning method, device, electronic device and storage medium
CN116766178A (en) Shaft hole assembly method and related equipment
CN112884129B (en) Multi-step rule extraction method, device and storage medium based on teaching data
CN110515297B (en) Staged motion control method based on redundant musculoskeletal system
CN113988185B (en) Data processing method and related device
KR20250043037A (en) Constrained Markov Decision Device and Method Having Improved Calculation Time

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination