[go: up one dir, main page]

CN116884000A - Relation finger representation understanding method and device based on visual language and relation detection - Google Patents

Relation finger representation understanding method and device based on visual language and relation detection Download PDF

Info

Publication number
CN116884000A
CN116884000A CN202310837433.1A CN202310837433A CN116884000A CN 116884000 A CN116884000 A CN 116884000A CN 202310837433 A CN202310837433 A CN 202310837433A CN 116884000 A CN116884000 A CN 116884000A
Authority
CN
China
Prior art keywords
relation
similarity
reference object
language
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310837433.1A
Other languages
Chinese (zh)
Other versions
CN116884000B (en
Inventor
宋伟
金天磊
张格格
袭向明
孟启炜
谢冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310837433.1A priority Critical patent/CN116884000B/en
Publication of CN116884000A publication Critical patent/CN116884000A/en
Application granted granted Critical
Publication of CN116884000B publication Critical patent/CN116884000B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for understanding a relation meaning expression based on visual language and relation detection, comprising the following steps: decomposing the relational expression representation into a reference object, a reference relation and a reference object; obtaining candidate entities in the scene image by using a target detection algorithm; respectively calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object and the similarity of the reference object; calculating reference relationship probabilities between candidate entities on relationship categories of the reference relationship by using a relationship detection algorithm; establishing an adjacency list by using the similarity of the reference object, the similarity of the reference object and the probability of the reference relation; and calculating the comprehensive probability of the similarity of the reference object, the similarity of the reference object and the reference relation probability according to the adjacency list, and determining the relation index representative to understand the entity by the highest value of the comprehensive probability. The invention is suitable for improving the man-machine interaction capability of the service robot in terms of relationship representation and understanding.

Description

Relation finger representation understanding method and device based on visual language and relation detection
Technical Field
The invention belongs to the technical field of computer vision and man-machine interaction, and particularly relates to a relational expression understanding method and device based on visual language and relational detection.
Background
In the process of language interaction with human beings, the service robot often describes entities in a scene through diversified expression representations, positions the objects in the scene according to the expression representations sent by the human beings, and finds the entities corresponding to the expression representations. The expression "means that there are a plurality of forms, for example, the expression" that thing "," this cup ", the noun phrase means that the expression" yellow cup "," cola ", the relation means that the expression" cup beside the cola "," white cup on table ".
With the continuous development of the field of man-machine interaction, the research of understanding the finger representation is particularly important. In the prior art, the finger representation understanding method mainly comprises a finger representation understanding method based on image-text retrieval, a finger representation understanding method based on visual positioning, a finger representation understanding method based on sight estimation and limb behavior recognition, and the like.
In the meaning representation understanding based on image-text retrieval, the traditional image-text retrieval method is limited by a data set, the meaning representation understanding of common adjective noun phrases and proper nouns is mainly processed, a visual language model is trained in an open environment by using a cross-modal retrieval mode in further research, and a text-oriented entity can be found out from objects common in human life. However, in the existing image-text retrieval method, analysis of the relation between objects is lacking, and meaning and representation understanding of relation noun phrases and pronouns cannot be achieved.
In understanding the finger representations based on the sight estimation and the limb behavior recognition, a robot analyzes a specific interactive object by deducing the sight direction or limb direction of the human target fixation to understand the intention, and the method is mainly used for understanding the finger representations of the pronouns.
In the meaning expression understanding based on visual positioning, the visual positioning method can analyze the whole image, mainly realize the processing of adjective nouns and phrases, has higher requirements on a data set, is often limited to descriptive texts in the data set, and cannot process proper nouns which are not in the data set.
Relational expression is sometimes used to describe specific entities in a scene based on the habitual expression in human daily life. Generally, in the relationship representation, the robot is often composed of three parts of a reference object, a relationship and a reference object, so that the robot needs to comprehensively analyze the three parts of the reference object, the relationship and the reference object in the process of understanding the relationship representation. At present, the research of understanding the expression representation is often realized by using methods such as face orientation, gesture estimation and the like, and the expression representation of noun phrases is realized by using a visual language model, so that the research of expressing the relation representation is omitted.
Disclosure of Invention
In view of the above, the present invention aims to provide a method and a device for understanding a relational expression based on visual language and relational detection, which utilize a visual language model to associate a reference object and a reference object with an entity in a scene, utilize a relational detection algorithm to detect the reference relationship, establish an adjacency list and calculate comprehensive probability, comprehensively understand the relational expression, and improve the understanding capability of a robot on human natural language in a human-computer interaction process.
In order to achieve the above purpose, the technical scheme provided by the invention is as follows:
in a first aspect, an embodiment of the present invention provides a method for understanding a relationship expression based on visual language and relationship detection, including the steps of:
step 1, decomposing a relation index expression into three parts of a reference object, a reference relation and an index object;
step 2, obtaining candidate entities in the scene image by using a target detection algorithm;
step 3, calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity between the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object;
step 4, calculating the reference relationship probability among the candidate entities on the relationship category of the reference relationship by using a relationship detection algorithm;
step 5, combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of relation categories;
and 6, calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index representation for understanding.
In step 4, the calculating, by using a relationship detection algorithm, a reference relationship probability between candidate entities on a relationship category of a reference relationship includes:
the relation detection algorithm outputs probabilities of different relation categories between every two objects;
the relation category is the relation category corresponding to the language expression of the reference relation;
and calculating the relation probability between every two candidate entities on the relation category of the reference relation by using a relation detection algorithm, namely, the reference relation probability.
In step 5, the reference object similarity and the reference object similarity are combined with the reference relationship probability to form an adjacency list of the relationship category, that is, the reference object similarity is taken as an abscissa, the reference object similarity is taken as an ordinate, and the reference relationship probability is taken as a value, so that the adjacency list is established.
In step 6, the calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index, which is understood, comprises:
multiplying each reference relation probability of the adjacency list with three values of the reference object similarity of the corresponding abscissa and the reference object similarity of the ordinate to obtain different comprehensive probabilities;
and obtaining the candidate entity corresponding to the similarity of the reference object, which is associated with the calculated comprehensive probability with the maximum value, as the entity of which the final relation index represents the understanding.
In the step 1, the relation expression means that the relation between other remarkable objects and the target object is utilized to describe the target object in the expression process of human language;
the other remarkable objects are reference objects;
the relation between other remarkable objects and the target object is a reference relation;
the target object is the reference object.
In step 2, the target detection algorithm extracts pictures of all the actual objects as candidate entities by detecting the actual objects observable in the scene.
In step 3, the visual language model converts language names of the reference object and the reference object into language features in the human language expression process, converts pictures of candidate entities into visual features, and calculates the similarity between each language feature and each visual feature one by one to obtain the similarity of each reference object and the similarity of the reference object.
In a second aspect, an embodiment of the present invention provides a relational expression understanding device based on visual language and relational detection, including: the system comprises a language detection unit, a target detection unit, a visual language analysis unit, a relation detection analysis unit, an adjacency list construction unit and an understanding result output unit;
the language detection unit is used for decomposing the relation index expression into three parts of a reference object, a reference relation and an index object;
the target detection unit is used for obtaining candidate entities in the scene image by using a target detection algorithm;
the visual language analysis unit is used for calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity between the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object;
the relation detection analysis unit is used for calculating reference relation probability of the candidate entities on relation categories of the reference relation by using a relation detection algorithm;
the adjacency list construction unit is used for combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of relation categories;
the understanding result output unit is used for calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, and the candidate entity represented by the reference object with the highest comprehensive probability is the entity of which the relation index represents understanding.
In a third aspect, an embodiment of the present invention further provides a relational expression representation understanding device based on visual language and relational detection, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to implement, when the computer program is executed, a relational expression representation understanding method based on visual language and relational detection provided by the embodiment of the present invention in the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored on the storage medium, where the computer program implements, when used in a computer, a relational expression representation understanding method based on visual language and relational detection provided in the embodiment of the present invention in the first aspect.
Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:
the similarity between the reference object and the candidate entity in the scene is calculated by utilizing the visual language model, the relation probability between the reference relation and the candidate entity is detected by utilizing the relation detection algorithm, the similarity between the reference object and the reference object is taken as an abscissa and an ordinate, the reference relation probability is taken as a value, an adjacency list is built, the comprehensive probability is calculated from the adjacency list to judge the entity of the relationship instruction expression, the study gap of the relationship instruction expression is filled, and the method is suitable for improving the understanding capability of the service robot to human natural language instructions in man-machine interaction.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a relational finger representation understanding method based on visual language and relational detection provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of acquiring candidate entities from an image of a scene provided by an embodiment of the invention;
FIG. 3 is a schematic diagram of calculating the similarity of reference objects and referring to the similarity of objects according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of establishing an adjacency list and calculating an integrated probability provided by an embodiment of the present invention;
fig. 5 is a schematic diagram of a relational expression understanding device based on visual language and relational detection according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.
The invention is characterized in that: aiming at the research that the representation and understanding of the relation representation are ignored in the prior art, the representation and understanding method of the relation representation based on the visual language model and the relation detection is provided, and is used for selecting the actual object corresponding to the correct reference object from the scene image, so that the understanding capability of the service robot on the human natural language instruction in man-machine interaction is improved.
FIG. 1 is a flow chart of a relational finger representation understanding method based on visual language and relational detection provided by an embodiment of the invention. As shown in fig. 1, the embodiment provides a relational expression understanding method based on visual language and relational detection, which comprises the following steps:
step 1, decomposing the relation index expression into three parts of a reference object, a reference relation and an index object. The relation refers to the relation between other obvious objects and the target object in the expression process of the human language, wherein the other obvious objects are reference objects, the relation between the other obvious objects and the target object is reference relation, and the target object is a reference object.
In the example, as shown in fig. 2, the relationship refers to the expression of "white cup beside the Jiaduobao herbal tea", and the relationship refers to the decomposition of the representation into: the reference object "Jiaduobao herbal tea", the reference relationship "beside" refers to the object "white cup".
And 2, obtaining candidate entities in the scene image by using a target detection algorithm. The target detection algorithm extracts pictures of all the practical objects to obtain N candidate entities by detecting the practical objects which can be observed in the scene.
In an embodiment, as shown in fig. 2, a plurality of actual objects may be detected from the scene image, and 6 candidate entities are obtained from the detected pictures of the actual objects.
And 3, calculating the similarity of the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity of the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object. The visual language model can convert language names of reference objects and reference objects into language features in the human language expression process, convert pictures of N candidate entities into visual features, and calculate similarity between each language feature and each visual feature one by one to obtain N reference object similarities and N reference object similarities.
In the embodiment, as shown in fig. 3, the reference object is "gazebo herbal tea", the reference object is a "white cup", the reference object and the reference object are input into the visual language model, and simultaneously, the pictures of 6 candidate entities are also input into the visual language model, the similarity between the visual language model output reference object "gazebo herbal tea" and each candidate entity is 6 reference object similarities, and the similarity between the reference object "white cup" and each candidate entity is 6 reference object similarities.
And 4, calculating the reference relation probability among the candidate entities on the relation category of the reference relation by using a relation detection algorithm. The relation detection algorithm is used for calculating relation probability between every two candidate entities on relation categories of reference relation, and the relation probability between the same candidate entity is not included, so that N (N-1) reference relation probabilities are obtained; the output of the relation detection algorithm is the probability of different relation categories between every two objects, and the relation categories are the relation categories corresponding to the language expressions of the reference relation.
In the embodiment, the reference relationship is "next to" and the relationship detection model is used to detect the relationship probabilities of the relationship category of "next to" between every two of the 6 candidate entities, and output 6×5 relationship probabilities. It should be noted that there is a primary-secondary score in the relationship detection, i.e., the relationship probability between the object 1 and the object 2 is not equal to the relationship probability between the object 2 and the object 1.
And 5, combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of the relation class. The abscissa of the adjacency list is the reference object similarity, the ordinate refers to the object similarity, and the value is the reference relationship probability.
In the embodiment, as shown in fig. 4, the reference object is "Jiaduobao herbal tea", and the similarity of 6 reference objects of "Jiaduobao herbal tea" is taken as the abscissa; the reference object is a 'white cup', and 6 reference object similarities of the 'white cup' are taken as ordinate; the relationship class of the reference relationship is "next", and the value of the adjacency list is the probability of the reference relationship between the candidate objects with respect to the relationship class of "next".
And 6, calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index representation for understanding. Multiplying N (N-1) reference relation probabilities of the adjacency list with N reference object similarities of corresponding abscissas and N reference object similarities of corresponding ordinates according to different weights to obtain N (N-1) comprehensive probabilities; and obtaining the candidate entity corresponding to the similarity of the reference object, which is associated with the calculated comprehensive probability with the maximum value, as the entity of which the final relation index represents the understanding.
In the embodiment, as shown in fig. 4, the reference object similarity, the weight indicating the object similarity and the reference relationship probability are all set to 1, that is, the three are directly multiplied to obtain the comprehensive probability. In the adjacency list established in this embodiment, although both the fifth row and the sixth row can be regarded as "white cups", by counting the results of all the comprehensive probabilities in the adjacency list, it can be found that the comprehensive probability obtained by calculating the values of the reference relation probabilities of the fifth row and the first column is the largest, and then the reference object (the "white cup" of the fifth row) at this position represents the entity that understands the final direction, and the relationship of the "white cup beside the Duobao herbal tea" in the embodiment represents.
In summary, the relation index representation understanding method based on the visual language model and relation detection further establishes an adjacency list to calculate the comprehensive probability by utilizing the visual language model and the relation detection algorithm, obtains the actual object pointed by the language instruction of the relation index representation, and is suitable for understanding the human language instruction by the robot in the human-computer interaction process.
Based on the same inventive concept, the embodiment also provides a relationship indicating and understanding device 500 based on a visual language model and relationship detection, as shown in fig. 5, comprising a language detection unit 501, a target detection unit 502, a visual language analysis unit 503, a relationship detection analysis unit 504, an adjacency list construction unit 505 and an understanding result output unit 506;
the language detection unit 501 is used for decomposing the relation expression into three parts of a reference object, a reference relation and a reference object;
the target detection unit 502 is configured to obtain candidate entities in the scene image by using a target detection algorithm;
the visual language analysis unit 503 is configured to calculate the similarity between the language feature of the reference object and the visual feature of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculate the similarity between the language feature of the reference object and the visual feature of the candidate entity to obtain the similarity of the reference object;
the relationship detection analysis unit 504 is configured to calculate a reference relationship probability on a relationship category of the reference relationship between the candidate entities using a relationship detection algorithm;
the adjacency list construction unit 505 is configured to combine the reference object similarity and the reference object similarity with the reference relationship probability to form an adjacency list of the relationship class;
the understanding result output unit 506 is configured to calculate, according to the adjacency list, a comprehensive probability of the reference object similarity, and the reference relationship probability, where the candidate entity represented by the reference object with the highest comprehensive probability is the entity whose relationship representation is understood.
Based on the same inventive concept, the embodiment also provides a relational expression based on visual language and relational detection and understanding device, comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the relational expression based on visual language and relational detection provided by the embodiment of the invention when the computer program is executed.
Based on the same inventive concept, the embodiment also provides a computer readable storage medium, wherein the storage medium stores a computer program, and when the computer program uses a computer, the method for understanding the relation index representation based on the visual language and the relation detection provided by the embodiment of the invention is realized.
It should be noted that, the relationship indicating representative understanding device based on the visual language and the relationship detection, and the computer readable storage medium provided in the foregoing embodiments all belong to the same concept as the relationship indicating representative understanding method embodiment based on the visual language and the relationship detection, and detailed implementation processes thereof are shown in the relationship indicating representative understanding method embodiment based on the visual language and the relationship detection, which are not repeated herein.
The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims (10)

1. A relational expression representation understanding method based on visual language and relational detection, which is characterized by comprising the following steps:
step 1, decomposing a relation index expression into three parts of a reference object, a reference relation and an index object;
step 2, obtaining candidate entities in the scene image by using a target detection algorithm;
step 3, calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity between the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object;
step 4, calculating the reference relationship probability among the candidate entities on the relationship category of the reference relationship by using a relationship detection algorithm;
step 5, combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of relation categories;
and 6, calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity of the relation index representation for understanding.
2. The visual language and relationship detection based relationship index representation understanding method according to claim 1, wherein in step 4, the calculating the reference relationship probability between candidate entities on the relationship category of the reference relationship using the relationship detection algorithm comprises:
the relation detection algorithm outputs probabilities of different relation categories between every two objects;
the relation category is the relation category corresponding to the language expression of the reference relation;
and calculating the relation probability between every two candidate entities on the relation category of the reference relation by using a relation detection algorithm, namely, the reference relation probability.
3. The visual language and relation detection-based relation representation understanding method according to claim 1, wherein in step 5, the reference object similarity and the reference object similarity are combined with the reference relation probability to form an adjacency list of relation class, namely, the adjacency list is established by taking the reference object similarity as an abscissa and the reference object similarity as an ordinate and taking the reference relation probability as a value.
4. The method for understanding a relational expression based on visual language and relational detection according to claim 3, wherein in step 6, the calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relational probability according to the adjacency list, wherein the candidate entity represented by the reference object with the highest comprehensive probability is the entity for understanding the relational expression, comprises:
multiplying each reference relation probability of the adjacency list with three values of the reference object similarity of the corresponding abscissa and the reference object similarity of the ordinate to obtain different comprehensive probabilities;
and obtaining the candidate entity corresponding to the similarity of the reference object, which is associated with the calculated comprehensive probability with the maximum value, as the entity of which the final relation index represents the understanding.
5. The visual language and relation detection-based relation expression understanding method according to claim 1, wherein in step 1, the relation expression means that in the expression process of human language, the relation between other remarkable objects and the target object is utilized to describe the target object;
the other remarkable objects are reference objects;
the relation between other remarkable objects and the target object is a reference relation;
the target object is the reference object.
6. The visual language and relation detection-based relation representation understanding method according to claim 1, wherein in step 2, the target detection algorithm extracts pictures of each real object as each candidate entity by detecting the real object observable in the scene.
7. The method for understanding the relation index representation based on the visual language and relation detection according to claim 1, wherein in the step 3, the visual language model obtains the similarity of each reference object and the similarity of the reference object by converting language names of the reference object and the reference object into language features in the human language expression process and converting pictures of each candidate entity into visual features, and calculating the similarity between each language feature and each visual feature one by one.
8. A relational expression understanding device based on visual language and relational detection, comprising: the system comprises a language detection unit, a target detection unit, a visual language analysis unit, a relation detection analysis unit, an adjacency list construction unit and an understanding result output unit;
the language detection unit is used for decomposing the relation index expression into three parts of a reference object, a reference relation and an index object;
the target detection unit is used for obtaining candidate entities in the scene image by using a target detection algorithm;
the visual language analysis unit is used for calculating the similarity between the language features of the reference object and the visual features of the candidate entity by using the visual language model to obtain the similarity of the reference object, and calculating the similarity between the language features of the reference object and the visual features of the candidate entity to obtain the similarity of the reference object;
the relation detection analysis unit is used for calculating reference relation probability of the candidate entities on relation categories of the reference relation by using a relation detection algorithm;
the adjacency list construction unit is used for combining the reference object similarity with the reference object similarity and the reference relation probability to form an adjacency list of relation categories;
the understanding result output unit is used for calculating the comprehensive probability of the reference object similarity, the reference object similarity and the reference relation probability according to the adjacency list, and the candidate entity represented by the reference object with the highest comprehensive probability is the entity of which the relation index represents understanding.
9. A visual language and relation detection based relation representation understanding device comprising a memory for storing a computer program and a processor for implementing the visual language and relation detection based relation representation understanding method of any one of claims 1-7 when the computer program is executed.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when used with a computer, implements the visual language and relation detection-based relation representation understanding method of any one of claims 1-7.
CN202310837433.1A 2023-07-10 2023-07-10 Methods and apparatus for understanding relational pointer expression based on visual language and relation detection Active CN116884000B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310837433.1A CN116884000B (en) 2023-07-10 2023-07-10 Methods and apparatus for understanding relational pointer expression based on visual language and relation detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310837433.1A CN116884000B (en) 2023-07-10 2023-07-10 Methods and apparatus for understanding relational pointer expression based on visual language and relation detection

Publications (2)

Publication Number Publication Date
CN116884000A true CN116884000A (en) 2023-10-13
CN116884000B CN116884000B (en) 2025-12-05

Family

ID=88258027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310837433.1A Active CN116884000B (en) 2023-07-10 2023-07-10 Methods and apparatus for understanding relational pointer expression based on visual language and relation detection

Country Status (1)

Country Link
CN (1) CN116884000B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118172528A (en) * 2024-03-28 2024-06-11 南京工业大学 REC method based on cross-modal model and spatial index relation modeling

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992786A (en) * 2019-04-09 2019-07-09 杭州电子科技大学 A Semantic-Sensitive Approximate Query Method for RDF Knowledge Graph
CN110390289A (en) * 2019-07-17 2019-10-29 苏州大学 A Video Security Detection Method Based on Reference Understanding
CN112084789A (en) * 2020-09-14 2020-12-15 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
US20210090449A1 (en) * 2019-09-23 2021-03-25 Revealit Corporation Computer-implemented Interfaces for Identifying and Revealing Selected Objects from Video
CN116258931A (en) * 2022-12-14 2023-06-13 之江实验室 Visual reference expression understanding method and system based on ViT and sliding window attention fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992786A (en) * 2019-04-09 2019-07-09 杭州电子科技大学 A Semantic-Sensitive Approximate Query Method for RDF Knowledge Graph
CN110390289A (en) * 2019-07-17 2019-10-29 苏州大学 A Video Security Detection Method Based on Reference Understanding
US20210090449A1 (en) * 2019-09-23 2021-03-25 Revealit Corporation Computer-implemented Interfaces for Identifying and Revealing Selected Objects from Video
CN112084789A (en) * 2020-09-14 2020-12-15 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN116258931A (en) * 2022-12-14 2023-06-13 之江实验室 Visual reference expression understanding method and system based on ViT and sliding window attention fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FETHIYE IRMAK DOĞAN 等: "Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments", 《2019 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS)》, 8 November 2019 (2019-11-08), pages 4992 - 4999, XP033695949, DOI: 10.1109/IROS40897.2019.8968510 *
LICHENG YU 等: "MAttNet: Modular Attention Network for Referring Expression Comprehension", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, 23 June 2018 (2018-06-23), pages 1307 - 1315, XP033476093, DOI: 10.1109/CVPR.2018.00142 *
RONGHANG HU 等: "Modeling Relationships in Referential Expressions with Compositional Modular Networks", 《2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, 26 July 2017 (2017-07-26), pages 4418 - 4427, XP033249796, DOI: 10.1109/CVPR.2017.470 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118172528A (en) * 2024-03-28 2024-06-11 南京工业大学 REC method based on cross-modal model and spatial index relation modeling
CN118172528B (en) * 2024-03-28 2024-08-13 南京工业大学 REC method based on cross-modal model and spatial reference relationship modeling

Also Published As

Publication number Publication date
CN116884000B (en) 2025-12-05

Similar Documents

Publication Publication Date Title
CN108536681B (en) Intelligent question-answering method, device, equipment and storage medium based on emotion analysis
CN106875941B (en) Voice semantic recognition method of service robot
KR102033435B1 (en) System and Method for Question and answer of Natural Language and Paraphrase Module
JP2020135852A (en) Image-based data processing methods, devices, electronics, computer-readable storage media and computer programs
CN116795973B (en) Text processing method and device based on artificial intelligence, electronic equipment and medium
EP3951617A1 (en) Video description information generation method, video processing method, and corresponding devices
Xie et al. Topic enhanced deep structured semantic models for knowledge base question answering
CN109388700A (en) Intention identification method and system
CN105589844A (en) Missing semantic supplementing method for multi-round question-answering system
CN107515900B (en) Intelligent robot and event memo system and method thereof
CN111259130B (en) Method and apparatus for providing reply sentence in dialog
CN110222560A (en) A kind of text people search's method being embedded in similitude loss function
CN113658690A (en) A kind of intelligent medical guidance method, device, storage medium and electronic equipment
CN119598321B (en) A multimodal network rumor detection method based on large language model feature enhancement
CN116884000B (en) Methods and apparatus for understanding relational pointer expression based on visual language and relation detection
CN114169447B (en) Event detection method based on self-attention convolutional bidirectional gated recurrent unit network
Poornima et al. Review on text and speech conversion techniques based on hand gesture
CN118673465A (en) Open vocabulary target detection method, system, equipment and medium
US11386292B2 (en) Method and system for auto multiple image captioning
CN108268443B (en) Method and device for determining topic transfer and obtaining reply text
CN116562278B (en) Word similarity detection method and system
CN119691669A (en) Machine learning-based industrial safety health risk prediction method, equipment and medium
CN119557695A (en) A multimodal false information detection method based on semantic consistency
CN119027997A (en) Emotion recognition method and device based on facial expression and human posture
Kundeti et al. A Real-Time English Audio to Indian Sign Language Converter for Enhanced Communication Accessibility

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant