CN116884000A

CN116884000A - Relation finger representation understanding method and device based on visual language and relation detection

Info

Publication number: CN116884000A
Application number: CN202310837433.1A
Authority: CN
Inventors: 宋伟; 金天磊; 张格格; 袭向明; 孟启炜; 谢冰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-10-13
Anticipated expiration: 2043-07-10
Also published as: CN116884000B

Abstract

The invention discloses a method and a device for understanding a relation meaning expression based on visual language and relation detection, comprising the following steps: decomposing the relational expression representation into a reference object, a reference relation and a reference object; obtaining candidate entities in the scene image by using a target detection algorithm; respectively calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object and the similarity of the reference object; calculating reference relationship probabilities between candidate entities on relationship categories of the reference relationship by using a relationship detection algorithm; establishing an adjacency list by using the similarity of the reference object, the similarity of the reference object and the probability of the reference relation; and calculating the comprehensive probability of the similarity of the reference object, the similarity of the reference object and the reference relation probability according to the adjacency list, and determining the relation index representative to understand the entity by the highest value of the comprehensive probability. The invention is suitable for improving the man-machine interaction capability of the service robot in terms of relationship representation and understanding.

Description

Relation finger representation understanding method and device based on visual language and relation detection

Technical Field

The invention belongs to the technical field of computer vision and man-machine interaction, and particularly relates to a relational expression understanding method and device based on visual language and relational detection.

Background

In the process of language interaction with human beings, the service robot often describes entities in a scene through diversified expression representations, positions the objects in the scene according to the expression representations sent by the human beings, and finds the entities corresponding to the expression representations. The expression "means that there are a plurality of forms, for example, the expression" that thing "," this cup ", the noun phrase means that the expression" yellow cup "," cola ", the relation means that the expression" cup beside the cola "," white cup on table ".

With the continuous development of the field of man-machine interaction, the research of understanding the finger representation is particularly important. In the prior art, the finger representation understanding method mainly comprises a finger representation understanding method based on image-text retrieval, a finger representation understanding method based on visual positioning, a finger representation understanding method based on sight estimation and limb behavior recognition, and the like.

In the meaning representation understanding based on image-text retrieval, the traditional image-text retrieval method is limited by a data set, the meaning representation understanding of common adjective noun phrases and proper nouns is mainly processed, a visual language model is trained in an open environment by using a cross-modal retrieval mode in further research, and a text-oriented entity can be found out from objects common in human life. However, in the existing image-text retrieval method, analysis of the relation between objects is lacking, and meaning and representation understanding of relation noun phrases and pronouns cannot be achieved.

In understanding the finger representations based on the sight estimation and the limb behavior recognition, a robot analyzes a specific interactive object by deducing the sight direction or limb direction of the human target fixation to understand the intention, and the method is mainly used for understanding the finger representations of the pronouns.

In the meaning expression understanding based on visual positioning, the visual positioning method can analyze the whole image, mainly realize the processing of adjective nouns and phrases, has higher requirements on a data set, is often limited to descriptive texts in the data set, and cannot process proper nouns which are not in the data set.

Relational expression is sometimes used to describe specific entities in a scene based on the habitual expression in human daily life. Generally, in the relationship representation, the robot is often composed of three parts of a reference object, a relationship and a reference object, so that the robot needs to comprehensively analyze the three parts of the reference object, the relationship and the reference object in the process of understanding the relationship representation. At present, the research of understanding the expression representation is often realized by using methods such as face orientation, gesture estimation and the like, and the expression representation of noun phrases is realized by using a visual language model, so that the research of expressing the relation representation is omitted.

Disclosure of Invention

In view of the above, the present invention aims to provide a method and a device for understanding a relational expression based on visual language and relational detection, which utilize a visual language model to associate a reference object and a reference object with an entity in a scene, utilize a relational detection algorithm to detect the reference relationship, establish an adjacency list and calculate comprehensive probability, comprehensively understand the relational expression, and improve the understanding capability of a robot on human natural language in a human-computer interaction process.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

in a first aspect, an embodiment of the present invention provides a method for understanding a relationship expression based on visual language and relationship detection, including the steps of:

step 1, decomposing a relation index expression into three parts of a reference object, a reference relation and an index object;

step 2, obtaining candidate entities in the scene image by using a target detection algorithm;

step 3, calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity between the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object;

step 4, calculating the reference relationship probability among the candidate entities on the relationship category of the reference relationship by using a relationship detection algorithm;

step 5, combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of relation categories;

and 6, calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index representation for understanding.

In step 4, the calculating, by using a relationship detection algorithm, a reference relationship probability between candidate entities on a relationship category of a reference relationship includes:

the relation detection algorithm outputs probabilities of different relation categories between every two objects;

the relation category is the relation category corresponding to the language expression of the reference relation;

and calculating the relation probability between every two candidate entities on the relation category of the reference relation by using a relation detection algorithm, namely, the reference relation probability.

In step 5, the reference object similarity and the reference object similarity are combined with the reference relationship probability to form an adjacency list of the relationship category, that is, the reference object similarity is taken as an abscissa, the reference object similarity is taken as an ordinate, and the reference relationship probability is taken as a value, so that the adjacency list is established.

In step 6, the calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index, which is understood, comprises:

multiplying each reference relation probability of the adjacency list with three values of the reference object similarity of the corresponding abscissa and the reference object similarity of the ordinate to obtain different comprehensive probabilities;

and obtaining the candidate entity corresponding to the similarity of the reference object, which is associated with the calculated comprehensive probability with the maximum value, as the entity of which the final relation index represents the understanding.

In the step 1, the relation expression means that the relation between other remarkable objects and the target object is utilized to describe the target object in the expression process of human language;

the other remarkable objects are reference objects;

the relation between other remarkable objects and the target object is a reference relation;

the target object is the reference object.

In step 2, the target detection algorithm extracts pictures of all the actual objects as candidate entities by detecting the actual objects observable in the scene.

In step 3, the visual language model converts language names of the reference object and the reference object into language features in the human language expression process, converts pictures of candidate entities into visual features, and calculates the similarity between each language feature and each visual feature one by one to obtain the similarity of each reference object and the similarity of the reference object.

In a second aspect, an embodiment of the present invention provides a relational expression understanding device based on visual language and relational detection, including: the system comprises a language detection unit, a target detection unit, a visual language analysis unit, a relation detection analysis unit, an adjacency list construction unit and an understanding result output unit;

the language detection unit is used for decomposing the relation index expression into three parts of a reference object, a reference relation and an index object;

the target detection unit is used for obtaining candidate entities in the scene image by using a target detection algorithm;

the visual language analysis unit is used for calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity between the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object;

the relation detection analysis unit is used for calculating reference relation probability of the candidate entities on relation categories of the reference relation by using a relation detection algorithm;

the adjacency list construction unit is used for combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of relation categories;

the understanding result output unit is used for calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, and the candidate entity represented by the reference object with the highest comprehensive probability is the entity of which the relation index represents understanding.

In a third aspect, an embodiment of the present invention further provides a relational expression representation understanding device based on visual language and relational detection, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to implement, when the computer program is executed, a relational expression representation understanding method based on visual language and relational detection provided by the embodiment of the present invention in the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored on the storage medium, where the computer program implements, when used in a computer, a relational expression representation understanding method based on visual language and relational detection provided in the embodiment of the present invention in the first aspect.

Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:

the similarity between the reference object and the candidate entity in the scene is calculated by utilizing the visual language model, the relation probability between the reference relation and the candidate entity is detected by utilizing the relation detection algorithm, the similarity between the reference object and the reference object is taken as an abscissa and an ordinate, the reference relation probability is taken as a value, an adjacency list is built, the comprehensive probability is calculated from the adjacency list to judge the entity of the relationship instruction expression, the study gap of the relationship instruction expression is filled, and the method is suitable for improving the understanding capability of the service robot to human natural language instructions in man-machine interaction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a relational finger representation understanding method based on visual language and relational detection provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of acquiring candidate entities from an image of a scene provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of calculating the similarity of reference objects and referring to the similarity of objects according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of establishing an adjacency list and calculating an integrated probability provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of a relational expression understanding device based on visual language and relational detection according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

The invention is characterized in that: aiming at the research that the representation and understanding of the relation representation are ignored in the prior art, the representation and understanding method of the relation representation based on the visual language model and the relation detection is provided, and is used for selecting the actual object corresponding to the correct reference object from the scene image, so that the understanding capability of the service robot on the human natural language instruction in man-machine interaction is improved.

FIG. 1 is a flow chart of a relational finger representation understanding method based on visual language and relational detection provided by an embodiment of the invention. As shown in fig. 1, the embodiment provides a relational expression understanding method based on visual language and relational detection, which comprises the following steps:

step 1, decomposing the relation index expression into three parts of a reference object, a reference relation and an index object. The relation refers to the relation between other obvious objects and the target object in the expression process of the human language, wherein the other obvious objects are reference objects, the relation between the other obvious objects and the target object is reference relation, and the target object is a reference object.

In the example, as shown in fig. 2, the relationship refers to the expression of "white cup beside the Jiaduobao herbal tea", and the relationship refers to the decomposition of the representation into: the reference object "Jiaduobao herbal tea", the reference relationship "beside" refers to the object "white cup".

And 2, obtaining candidate entities in the scene image by using a target detection algorithm. The target detection algorithm extracts pictures of all the practical objects to obtain N candidate entities by detecting the practical objects which can be observed in the scene.

In an embodiment, as shown in fig. 2, a plurality of actual objects may be detected from the scene image, and 6 candidate entities are obtained from the detected pictures of the actual objects.

And 3, calculating the similarity of the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity of the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object. The visual language model can convert language names of reference objects and reference objects into language features in the human language expression process, convert pictures of N candidate entities into visual features, and calculate similarity between each language feature and each visual feature one by one to obtain N reference object similarities and N reference object similarities.

In the embodiment, as shown in fig. 3, the reference object is "gazebo herbal tea", the reference object is a "white cup", the reference object and the reference object are input into the visual language model, and simultaneously, the pictures of 6 candidate entities are also input into the visual language model, the similarity between the visual language model output reference object "gazebo herbal tea" and each candidate entity is 6 reference object similarities, and the similarity between the reference object "white cup" and each candidate entity is 6 reference object similarities.

And 4, calculating the reference relation probability among the candidate entities on the relation category of the reference relation by using a relation detection algorithm. The relation detection algorithm is used for calculating relation probability between every two candidate entities on relation categories of reference relation, and the relation probability between the same candidate entity is not included, so that N (N-1) reference relation probabilities are obtained; the output of the relation detection algorithm is the probability of different relation categories between every two objects, and the relation categories are the relation categories corresponding to the language expressions of the reference relation.

In the embodiment, the reference relationship is "next to" and the relationship detection model is used to detect the relationship probabilities of the relationship category of "next to" between every two of the 6 candidate entities, and output 6×5 relationship probabilities. It should be noted that there is a primary-secondary score in the relationship detection, i.e., the relationship probability between the object 1 and the object 2 is not equal to the relationship probability between the object 2 and the object 1.

And 5, combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of the relation class. The abscissa of the adjacency list is the reference object similarity, the ordinate refers to the object similarity, and the value is the reference relationship probability.

In the embodiment, as shown in fig. 4, the reference object is "Jiaduobao herbal tea", and the similarity of 6 reference objects of "Jiaduobao herbal tea" is taken as the abscissa; the reference object is a 'white cup', and 6 reference object similarities of the 'white cup' are taken as ordinate; the relationship class of the reference relationship is "next", and the value of the adjacency list is the probability of the reference relationship between the candidate objects with respect to the relationship class of "next".

And 6, calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index representation for understanding. Multiplying N (N-1) reference relation probabilities of the adjacency list with N reference object similarities of corresponding abscissas and N reference object similarities of corresponding ordinates according to different weights to obtain N (N-1) comprehensive probabilities; and obtaining the candidate entity corresponding to the similarity of the reference object, which is associated with the calculated comprehensive probability with the maximum value, as the entity of which the final relation index represents the understanding.

In the embodiment, as shown in fig. 4, the reference object similarity, the weight indicating the object similarity and the reference relationship probability are all set to 1, that is, the three are directly multiplied to obtain the comprehensive probability. In the adjacency list established in this embodiment, although both the fifth row and the sixth row can be regarded as "white cups", by counting the results of all the comprehensive probabilities in the adjacency list, it can be found that the comprehensive probability obtained by calculating the values of the reference relation probabilities of the fifth row and the first column is the largest, and then the reference object (the "white cup" of the fifth row) at this position represents the entity that understands the final direction, and the relationship of the "white cup beside the Duobao herbal tea" in the embodiment represents.

In summary, the relation index representation understanding method based on the visual language model and relation detection further establishes an adjacency list to calculate the comprehensive probability by utilizing the visual language model and the relation detection algorithm, obtains the actual object pointed by the language instruction of the relation index representation, and is suitable for understanding the human language instruction by the robot in the human-computer interaction process.

Based on the same inventive concept, the embodiment also provides a relationship indicating and understanding device 500 based on a visual language model and relationship detection, as shown in fig. 5, comprising a language detection unit 501, a target detection unit 502, a visual language analysis unit 503, a relationship detection analysis unit 504, an adjacency list construction unit 505 and an understanding result output unit 506;

the language detection unit 501 is used for decomposing the relation expression into three parts of a reference object, a reference relation and a reference object;

the target detection unit 502 is configured to obtain candidate entities in the scene image by using a target detection algorithm;

the visual language analysis unit 503 is configured to calculate the similarity between the language feature of the reference object and the visual feature of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculate the similarity between the language feature of the reference object and the visual feature of the candidate entity to obtain the similarity of the reference object;

the relationship detection analysis unit 504 is configured to calculate a reference relationship probability on a relationship category of the reference relationship between the candidate entities using a relationship detection algorithm;

the adjacency list construction unit 505 is configured to combine the reference object similarity and the reference object similarity with the reference relationship probability to form an adjacency list of the relationship class;

the understanding result output unit 506 is configured to calculate, according to the adjacency list, a comprehensive probability of the reference object similarity, and the reference relationship probability, where the candidate entity represented by the reference object with the highest comprehensive probability is the entity whose relationship representation is understood.

Based on the same inventive concept, the embodiment also provides a relational expression based on visual language and relational detection and understanding device, comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the relational expression based on visual language and relational detection provided by the embodiment of the invention when the computer program is executed.

Based on the same inventive concept, the embodiment also provides a computer readable storage medium, wherein the storage medium stores a computer program, and when the computer program uses a computer, the method for understanding the relation index representation based on the visual language and the relation detection provided by the embodiment of the invention is realized.

It should be noted that, the relationship indicating representative understanding device based on the visual language and the relationship detection, and the computer readable storage medium provided in the foregoing embodiments all belong to the same concept as the relationship indicating representative understanding method embodiment based on the visual language and the relationship detection, and detailed implementation processes thereof are shown in the relationship indicating representative understanding method embodiment based on the visual language and the relationship detection, which are not repeated herein.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. A relational expression representation understanding method based on visual language and relational detection, which is characterized by comprising the following steps:

2. The visual language and relationship detection based relationship index representation understanding method according to claim 1, wherein in step 4, the calculating the reference relationship probability between candidate entities on the relationship category of the reference relationship using the relationship detection algorithm comprises:

3. The visual language and relation detection-based relation representation understanding method according to claim 1, wherein in step 5, the reference object similarity and the reference object similarity are combined with the reference relation probability to form an adjacency list of relation class, namely, the adjacency list is established by taking the reference object similarity as an abscissa and the reference object similarity as an ordinate and taking the reference relation probability as a value.

4. The method for understanding a relational expression based on visual language and relational detection according to claim 3, wherein in step 6, the calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relational probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity for understanding the relational expression, comprises:

5. The visual language and relation detection-based relation expression understanding method according to claim 1, wherein in step 1, the relation expression means that in the expression process of human language, the relation between other remarkable objects and the target object is utilized to describe the target object;

the other remarkable objects are reference objects;

the target object is the reference object.

6. The visual language and relation detection-based relation representation understanding method according to claim 1, wherein in step 2, the target detection algorithm extracts pictures of each real object as each candidate entity by detecting the real object observable in the scene.

7. The method for understanding the relation index representation based on the visual language and relation detection according to claim 1, wherein in the step 3, the visual language model obtains the similarity of each reference object and the similarity of the reference object by converting language names of the reference object and the reference object into language features in the human language expression process and converting pictures of each candidate entity into visual features, and calculating the similarity between each language feature and each visual feature one by one.

8. A relational expression understanding device based on visual language and relational detection, comprising: the system comprises a language detection unit, a target detection unit, a visual language analysis unit, a relation detection analysis unit, an adjacency list construction unit and an understanding result output unit;

9. A visual language and relation detection based relation representation understanding device comprising a memory for storing a computer program and a processor for implementing the visual language and relation detection based relation representation understanding method of any one of claims 1-7 when the computer program is executed.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when used with a computer, implements the visual language and relation detection-based relation representation understanding method of any one of claims 1-7.