WO2025028676A1 - Procédé d'étiquetage pour contenu et système associé - Google Patents
Procédé d'étiquetage pour contenu et système associé Download PDFInfo
- Publication number
- WO2025028676A1 WO2025028676A1 PCT/KR2023/011140 KR2023011140W WO2025028676A1 WO 2025028676 A1 WO2025028676 A1 WO 2025028676A1 KR 2023011140 W KR2023011140 W KR 2023011140W WO 2025028676 A1 WO2025028676 A1 WO 2025028676A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- content
- tag
- candidate
- tags
- prompt
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
- G06F16/7343—Query language or query format
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/738—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Definitions
- OTT (Over The Top) service providers perform tagging, which sets keywords that indicate the characteristics and content of the content as tags for the content. These tags are used for various purposes such as content search, grouping similar content, and providing information.
- a technical problem to be solved by some embodiments of the present disclosure is to provide a method and system capable of accurately performing automatic tagging of content.
- a method for tagging content may include: generating a candidate tag set including a plurality of candidate tags related to content through an image analysis module; configuring a prompt based on the candidate tag set, context data for the content, and a query statement; and generating a tag set for the content based on a response of a language model to the prompt.
- the context data includes metadata of the content, wherein the metadata may include at least one of a title, a genre, and a summary of the content.
- the context data may include a script of the content.
- the image analysis module includes an object detection model and an action recognition model
- the step of generating the candidate tag set may include a step of detecting an object appearing in the content through the object detection model and generating a first candidate tag representing the detected object, and a step of recognizing an action of an object appearing in the content through the action recognition model and generating a second candidate tag representing the recognized action.
- the query may include a phrase requesting that significant tags matching the content be selected from the set of candidate tags by referencing the context data.
- the step of configuring the prompt may include the steps of: obtaining specific content tagged with a set of correct tags, extracting candidate tags from the specific content, configuring a first prompt having the extracted candidate tags, context data of the specific content, and a query as prompt elements, evaluating a first response of the language model to the first prompt using the set of correct tags, configuring a second prompt by changing at least one of an arrangement and a content of the prompt elements, evaluating a second response of the language model to the second prompt using the set of correct tags, generating a prompt template for the language model using a result of evaluating the first response and a result of evaluating the second response, and configuring the prompt using the prompt template.
- the response of the language model includes at least one tag
- the step of generating the set of tags may include the step of removing a tag that is not present in the set of candidate tags from among the at least one tag, and constructing the set of tags using the remaining tags.
- the tagging method may further include the steps of constructing a tag discriminator using the candidate tag set and the tag set, the tag discriminator being a model that determines whether an input candidate tag is an important tag that matches relevant content, obtaining additional content, and performing tagging on the additional content using the image analysis module to which the tag discriminator is added without the aid of the language model.
- the tag discriminator may be configured to further receive image features of the relevant content extracted via the image analysis module and determine whether the input candidate tag is an important tag.
- the tag set is tagged to the content
- the tagging method may further include a step of creating a content collection for the common tag by grouping other contents having a common tag with the content among a plurality of previously registered contents.
- a tagging system includes a memory storing one or more processors and a computer program executed by the one or more processors, wherein the computer program may include instructions for an operation of generating a candidate tag set including a plurality of candidate tags related to content through an image analysis module, an operation of configuring a prompt based on the candidate tag set, context data for the content, and a query statement, and an operation of generating a tag set for the content based on a response of a language model to the prompt.
- a computer program may be stored in a computer-readable recording medium, coupled with a computing device, to execute the steps of: generating a candidate tag set including a plurality of candidate tags related to content through an image analysis module; configuring a prompt based on the candidate tag set, context data for the content, and a query statement; and generating a tag set for the content based on a response of a language model to the prompt.
- a set of candidate tags for content is generated through an image analysis module, and important tags (i.e., tags that match the content) can be selected from the set of candidate tags through a language model that understands the context (or content) of the content.
- important tags i.e., tags that match the content
- tags that match the content can be selected from the set of candidate tags through a language model that understands the context (or content) of the content.
- high-quality tags that well represent the characteristics, content, etc. of the content can be easily generated (extracted), and the tagging task can be easily automated.
- the human and time costs required for the tagging task can be greatly reduced accordingly.
- the image analysis module is configured to include an object detection model, an action recognition model, etc., various candidate tags for the content can be easily and completely extracted.
- the hallucination problem of the language model can be easily prevented.
- the problem of generating tags that do not exactly match the content due to the hallucination phenomenon of the language model can be easily prevented.
- the hallucination problem of the language model can be more easily prevented by verifying the tags included in the response of the language model using the candidate tag set.
- prompt templates optimized for language models can be automatically generated using content that has a set of correct tags, resulting in further improved tag quality.
- the image analysis module can be strengthened by using the tag set of the content obtained through the language model as training data.
- a tag discriminator that determines whether the candidate tag is an important tag that matches the content can be built using the training data, and the tag discriminator can be added to the image analysis module.
- tagging of the content can be performed without the help of the language model through the strengthened image analysis module. In this case, the resources required for the tagging task can be greatly reduced, and the hallucination problem of the language model can be completely freed from it.
- FIG. 1 is an exemplary diagram schematically illustrating the operation of a tagging system according to some embodiments of the present disclosure.
- FIG. 2 is an exemplary diagram to further explain the operation of a tagging system according to some embodiments of the present disclosure.
- FIG. 3 is an exemplary flowchart illustrating a tagging method for content according to some embodiments of the present disclosure.
- FIG. 4 is an exemplary diagram illustrating a method for extracting candidate tags according to some embodiments of the present disclosure.
- FIGS. 5 and 6 illustrate prompt templates that may be referenced in some embodiments of the present disclosure.
- FIG. 7 is an exemplary diagram illustrating a method for generating a prompt template according to some embodiments of the present disclosure.
- FIG. 8 is an exemplary diagram illustrating a tag verification method according to some embodiments of the present disclosure.
- FIGS. 9 and 10 are exemplary diagrams illustrating a tagging module/model operation method according to some embodiments of the present disclosure.
- FIG. 11 is an exemplary diagram illustrating a tagging module/model operation method according to some other embodiments of the present disclosure.
- FIGS. 12 and 13 are exemplary drawings for explaining a method of creating a content collection according to some application examples of the present disclosure.
- FIG. 14 illustrates an exemplary computing device that may implement a tagging system according to some embodiments of the present disclosure.
- FIG. 1 is an exemplary diagram schematically illustrating the operation of a tagging system (10) according to some embodiments of the present disclosure.
- the tagging system (10) is a computing device/system that can automatically generate tags for content (11).
- the tagging system (10) can automatically generate a tag set (13) that matches the content (11) by using context data (12) of the content (11).
- the tag set (13) thus generated can be tagged (set) to the content (11).
- the tag set (13) can include at least one tag (e.g., keyword tag, etc.), and FIG. 1 illustrates an example in which the tag set (13) is composed of multiple tags.
- a tag may be, for example, a keyword indicating the content (11), characteristics (e.g., topic, atmosphere, director, character, actor, production date, etc.).
- characteristics e.g., topic, atmosphere, director, character, actor, production date, etc.
- a tag may be a keyword indicating other information, or may have the form of a phrase, sentence, etc.
- a tag may also be named with terms such as ‘label’, ‘annotation’, etc. depending on the case.
- Content (11) is content having visual elements (or content capable of image analysis), and may include various types/forms of content without limitation.
- content (11) may be video content, image content, etc., but the scope of the present disclosure is not limited thereto.
- Context data (12) refers to data (e.g., text data) that describes the content (11) or helps in understanding (e.g., context/content understanding) the content (11).
- the context data (12) may include, for example, metadata and scripts of the content (11). However, the scope of the present disclosure is not limited thereto.
- the context data (12) may also be named as ‘description data’, ‘reference data’, ‘description data’, etc., depending on the case.
- the metadata of the content (11) may include various information related to the content (11) without limitation, such as the title, genre, running time, director, actor, character, production date, summary, review, etc. of the content (11).
- the script of the content (11) may also be included in the category of metadata.
- the script of the content (11) refers to a text that describes the details of the content (11).
- the script of the content (11) may be, for example, a subtitle, a caption, a synopsis, a scenario, etc., but the scope of the present disclosure is not limited thereto.
- the script may be prepared in advance or may be generated through a speech-to-text or speech recognition technique.
- the tagging system (10) can generate a tag set (13) of content (11) using an image analysis module (21) and a language model (22). More specifically, the tagging system (10) can generate a candidate tag set (23) including a plurality of candidate tags (illustrated as ‘tags’ in the drawing) through the image analysis module (21), and can analyze the candidate tag set (23) and context data (12) through the language model (22) to generate (determine) a final tag set (13) of content (11).
- the tagging system (10) may control the language model (22) to generate tags that match the content (11) based on the set of candidate tags (23) and/or context data (12) (e.g., by entering a prompt requesting tag generation).
- a language model (22) refers to a neural network model that has acquired universal understanding of language (or natural language/text) by learning a large amount of texts (e.g., texts from various domains).
- the language model (22) can be viewed as a large-scale model with question and answer capabilities based on a text interface, or as a model that can ‘generate’ a response to a question. Therefore, it may be named as a ‘largescale language model (LLM)’, ‘generative model’, ‘question-answering model’, ‘conversational model’, ‘tagging model’, etc. depending on the case.
- LLM largescale language model
- model may be used interchangeably with terms such as ‘module’, ‘AI (Artificial Intelligence)’, etc.
- the tagging system (10) may provide a set of tags (13) for the content (11) to a separate device (not shown).
- the tagging system (10) may perform tagging tasks only by using the enhanced image analysis module (21) after gradually strengthening the image analysis module (21) using the tag set (13) generated with the help of the language model (22). For this, refer to the descriptions of FIGS. 9 to 11.
- the tagging system (10) may perform tagging operations on various contents and group contents with common tags to create a content collection. For this, refer to the descriptions of FIGS. 12 and 13.
- the tagging system (10) described above may be implemented in at least one computing device.
- all functions of the tagging system (10) may be implemented in one computing device, or a first function of the tagging system (10) may be implemented in a first computing device and a second function may be implemented in a second computing device.
- a specific function of the tagging system (10) may be implemented in a plurality of computing devices.
- a computing device may include any device having computing (processing) capabilities, for an example of such a device see FIG. 14.
- a computing device is a collection of interacting various components (e.g., memory, processor, etc.), so it may sometimes be called a ‘computing system.’
- computing system can also encompass the concept of a collection of interacting multiple computing devices.
- FIG. 3 is an exemplary flowchart schematically illustrating a tagging method for content according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the purpose of the present disclosure, and it is to be understood that some steps may be added or deleted as necessary.
- the tagging method may start at step S31 of acquiring content and context data thereof.
- the content may be, for example, video content including a plurality of frame images, but the scope of the present disclosure is not limited thereto. However, for the convenience of understanding, the following description will continue by assuming that the content is ‘video content’.
- the context data may include, for example, metadata and scripts of the content.
- a candidate tag set for the corresponding content can be generated through the image analysis module (21).
- the candidate tag set can include a plurality of candidate tags related to the corresponding content.
- the image analysis module (21) may be configured to include various types of modules/models that analyze images to extract information.
- the image analysis module (21) may include a module based on an image analysis algorithm, an object detection model, an image captioning model (or a scene recognition model), a scene segmentation model, an expression (emotion) recognition model, etc., but the scope of the present disclosure is not limited thereto.
- the image analysis module (21) may also be named, for example, a ‘tagging module’, depending on the case.
- step S32 The specific method of generating a candidate tag set or extracting candidate tags in step S32 may vary depending on the embodiment.
- the tagging system (10) may analyze each frame image of the content (i.e., video content) through the image analysis module (21) to extract candidate tags. For example, the tagging system (10) may extract candidate tags for each frame image. The candidate tags extracted in this way may constitute a candidate tag set.
- the tagging system (10) may select a main frame image from among frame images of the content (i.e., video content) and analyze the main frame image through the image analysis module (21) to extract candidate tags. For example, the tagging system (10) may select a main frame image based on the number of objects, the degree of movement of the objects, the result of object action recognition, the result of scene recognition, the similarity between the result of scene recognition and the title of the content, etc.
- the scope of the present disclosure is not limited thereto.
- the tagging system (10) can classify frame images of content (i.e., video content) by scene through a scene segmentation model. And, the tagging system (10) can extract candidate tags by scene. For example, the tagging system (10) can select a representative frame image by scene and analyze the representative frame image to extract candidate tags of the corresponding scene. Alternatively, the tagging system (10) can extract candidate tags from each of the frame images belonging to a scene and combine them to use them as candidate tags of the corresponding scene.
- the tagging system (10) may detect objects appearing in content through an object detection model and determine (generate) tags (e.g., keywords) representing the detected objects as candidate tags for the corresponding content.
- tags e.g., keywords
- the tagging system (10) may recognize the behavior of an object appearing in content through an action recognition model and determine (generate) a tag (e.g., keyword) representing the recognized behavior as a candidate tag for the corresponding content.
- a tag e.g., keyword
- the tagging system (10) may generate a caption for the content (e.g., for each frame image) through an image captioning model (or a scene recognition model) and generate candidate tags for the content based on the caption.
- the candidate tags may be, for example, the caption itself, keywords extracted from the caption, results of sentiment analysis for the caption, etc., but are not limited thereto.
- the image captioning model refers to a model that outputs a caption (i.e., a description) of an input image as text in a natural language form.
- a candidate tag set may be generated based on various combinations of the above-described embodiments.
- the tagging system (10) may generate candidate tags related to objects appearing in frame images of video content (43) through an object detection model (41) and may generate candidate tags related to actions of objects appearing in the frame images through an action recognition model (42).
- a candidate tag set (44) of the video content (43) may be generated.
- a prompt of a language model (22) can be configured based on a candidate tag set, context data, and a query statement.
- the tagging system (10) can configure a prompt by arranging (arranging) prompt elements such as a candidate tag set, context data, and a query statement according to a predefined prompt template.
- the prompt template will be described in more detail later with reference to FIGS. 5 to 7.
- step S33 The specific way in which the prompt is configured in step S33 may vary depending on the embodiment.
- the tagging system (10) may divide the candidate tag set into a plurality of subsets based on a judgment that the size of the candidate tag set (e.g., the total number of candidate tags, the total length, etc.) is greater than a threshold. Then, the tagging system (10) may configure a plurality of prompts based on each of the plurality of subsets. For example, the tagging system (10) may configure a first prompt based on a first subset, a common query statement, and first context data, and may configure a second prompt based on a second subset, a common query statement, and second context data.
- the first context data may be the same as the second context data, or at least part of them may be different from each other.
- the first context data may include a script of a portion related to the first subset in the entire script (e.g., a script of a frame image (or scene) from which tags of the first subset are extracted, a script of an adjacent frame image (or scene), etc.)
- the second context data may include a script of a portion related to the second subset in the entire script.
- the language model (22) may be a model that has a limit on the maximum length of a prompt (e.g., a maximum number of tokens).
- the tagging system (10) may divide the set of candidate tags into a plurality of subsets based on the maximum length of the prompt, and construct a plurality of prompts (i.e., prompts shorter than the maximum length) based on each of the subsets.
- the prompt may be constructed based on various combinations of the embodiments described above.
- FIGS. 5 and 6 illustrate prompt templates (50, 60) that may be referenced in some embodiments of the present disclosure.
- a prompt template (50) means a template that defines prompt elements (51 to 54) to be included in a prompt, their arrangement order, content of a query statement (53), etc.
- FIG. 5 illustrates an example in which a content title (51), a set of candidate tags (52), a query statement (53), and context data (54) of content are defined as prompt elements (for the arrangement order, refer to the order of the prompt elements illustrated in FIG. 5).
- the query (53) may be written to include a phrase requesting selection of important tags matching the content from the candidate tag set (52) by referring to the context (54) of the content, for example.
- the query (53) is written in this way, the hallucination problem of the language model (22) can be easily prevented.
- the problem of the language model (22) generating and outputting tags (e.g., keywords) that do not exist in the candidate tag set (52) can be easily prevented.
- a prompt template For an example of a prompt template, refer to the prompt template (60) illustrated in Fig. 6.
- the process of defining and designing a prompt template optimized for a language model (22) can be performed manually or in an automated manner.
- the prompt template can be defined in advance by the prompt engineer.
- the prompt engineer can write a query requesting to select a specified number of important tags (e.g., the three most important tags) from the candidate tag set so that the response of the language model (22) can be evaluated quickly and accurately.
- the prompt engineer can input the prompt based on the query into the language model (22) in various forms and evaluate the response of the language model (22) for each prompt.
- the prompt engineer can synthesize the evaluation results to define a prompt template optimized for the language model (22).
- a prompt template can be automatically generated using at least one content that is tagged (set) with a set of correct tags.
- the tagging system (10) can extract candidate tags (73, i.e., a set of candidate tags) from a content (71) given a set of correct tags (72).
- the tagging system (10) can input various forms of prompts (e.g., 74, 76) based on the candidate tags (73) into the language model (22), and can evaluate a response (e.g., see 75, 77) of the language model (22) using the set of correct tags (72) (e.g., evaluating how well it matches the set of correct tags (72)).
- the tagging system (10) can construct a first prompt (e.g., 74) based on candidate tags (73), context data of content (71), and a query sentence, and can evaluate a response (e.g., 75) of a language model (22) to the first prompt (e.g., 74) using a set of correct tags (72).
- the tagging system (10) can construct a second prompt (e.g., 76) by changing at least one of the arrangement and content (e.g., content of a query sentence, etc.) of prompt elements, and can evaluate a response (e.g., 77) of a language model (22) to the second prompt (e.g., 76) using a set of correct tags (72).
- the tagging system (10) can repeat these processes for other content and comprehensively consider the evaluation results to generate a prompt template optimized for the language model (22) (e.g., generating a prompt template using the prompt with the highest comprehensive evaluation score).
- a tag set of the corresponding content may be generated based on the response of the language model (22) to the prompt.
- the tagging system (10) may determine tags included in the response of the language model (22) as the tag set of the corresponding content, and may generate the tag set of the corresponding content through additional processing of the tags included in the response. Examples of such processing may include merging similar tags, tag expansion based on similar words, and removal of tags whose similarity with the content title (or other metadata) is below a criterion, but the scope of the present disclosure is not limited thereto.
- the tagging system (10) may verify tags (84) included in a response (i.e., a response to a prompt (83)) using a candidate tag set (82) of the content (81).
- the tagging system (10) may generate (configure) a tag set of the content (81) based on the verified tags (84).
- This verification process may be understood as being intended to prevent a problem in which tag quality is deteriorated due to a hallucination phenomenon of the language model (22).
- the tagging system (10) may remove tags that do not exist in the candidate tag set (82) among the tags (84) and generate a tag set of the content (81) using the remaining tags.
- the tagging system (10) may remove tags whose similarity with the candidate tag (see 82) is below a criterion value among the tags (84) and generate a tag set of the content (81) using the remaining tags.
- a tagging method for content is generated through an image analysis module (21), and important tags (i.e., tags that match the content) can be selected from the candidate tag set through a language model (22) that understands the context (or content) of the content.
- important tags i.e., tags that match the content
- a language model (22) that understands the context (or content) of the content.
- high-quality tags that well represent the characteristics, content, etc. of the content can be easily generated (extracted), and the tagging task can be easily automated.
- the human and time costs required for the tagging task can be greatly reduced accordingly.
- the image analysis module (21) is configured to include an object detection model (41), an action recognition model (42), etc., various candidate tags for the content can be easily and without omission extracted.
- the hallucination problem of the language model (22) can be easily prevented.
- the problem of generating tags that do not exactly match the content due to the hallucination phenomenon of the language model (22) can be easily prevented.
- the hallucination problem of the language model (22) can be more easily prevented.
- the tagging system (10) can gradually strengthen the image analysis module (21) by using a tag set generated with the help of the language model (22). And, when the image analysis module (21) is sufficiently strengthened, the tagging system (10) can perform tagging work on content only by using the image analysis module (21) without the help of the language model (22).
- these embodiments will be described with reference to FIGS. 9 to 11.
- FIGS. 9 and 10 are exemplary drawings for explaining a tagging module/model operation method according to some embodiments of the present disclosure.
- the tagging system (10) can perform tagging work on content by using the image analysis module (21) and the language model (22) together.
- the tagging system (10) can strengthen the image analysis module (21) by using the tag set generated through the language model (22).
- the tag discriminator (101) may be configured to further input image features (e.g., feature maps extracted through an object detection model (41) or an image analysis module (21)) of the content (102) in order to accurately determine whether the input tag matches the content (102).
- the tag discriminator (101) may be configured to further input at least a portion of context data of the content (102).
- the tagging system (10) may determine (predict) whether a candidate tag input through the tag discriminator (101) is an important tag matching the content (102) and compare the determination result with a correct tag set (104) (e.g., determine whether the candidate tag exists in the correct tag set (104)) to calculate a loss (105, ‘L’).
- the tagging system (10) can update the weight parameters of the tag discriminator (101) based on the loss (105).
- a tag discriminator (101) equipped with important tag discriminatory ability can be constructed.
- the image analysis module (21) can be strengthened.
- the tagging system (10) can perform tagging work on the content using only the enhanced image analysis module (21, e.g., the image analysis module (21) to which the tag determiner (101) is added).
- the resources invested in the tagging work can be greatly reduced (because the language model (22) requiring considerable resources is not used), and the hallucination problem of the language model (22) can be completely freed.
- the ‘T2’ time point may be, for example, a time point at which the performance of the enhanced image analysis module (21) becomes higher than a reference value, and the performance of the enhanced image analysis module (21) may be judged (measured) using a tag set generated through a language model (22).
- the tagging system (10) may judge the performance of the image analysis module (21) based on the difference between the first tag set generated through the language model (22) and the second tag set generated through the image analysis module (21) (e.g., a smaller difference means that the performance of the enhanced image analysis module (21) is excellent enough to approach the language model (22)).
- the tagging system (10) may verify the tag set of the content obtained through the enhanced image analysis module (21) by using the language model (22). For example, the tagging system (10) may request the language model (22) to determine (verify) whether the tag set matches the context data of the content (e.g., construct a prompt based on a caption, context data, and a query requesting verification), and may perform verification by directly comparing the tag set with the tag set generated through the language model (22).
- the context data of the content e.g., construct a prompt based on a caption, context data, and a query requesting verification
- FIG. 11 is an exemplary diagram illustrating a tagging module/model operation method according to some other embodiments of the present disclosure.
- the tagging system (10) can perform tagging work on content by using an object detection model (41) and a language model (22) together.
- the tagging system (10) can also perform tagging work by using another model (e.g., an action recognition model (42)).
- the tagging system (10) can generate a tag set by using a language model (22) for each frame image or scene, and can convert the tag set into a caption in a natural language form by using the language model (22) again.
- the tagging system (10) can build an image captioning model (111) by using frame images (or frame images belonging to a scene) of content and captions as training data.
- the image captioning model (111) built in this way can be added to the image analysis module (21), and as a result, the image analysis module (21) can be strengthened.
- the tagging system (10) can perform tagging tasks by using the image captioning model (111) and the language model (22) together. For example, the tagging system (10) can configure a prompt based on the caption generated through the image captioning model (111), the context data of the content, and a query (e.g., a query requesting to select important tags by referring to the context of the content or requesting processing of the caption), and can generate a tag set of the content based on the response of the language model (22) to the prompt. In addition, the tagging system (10) can perform additional learning on the image captioning model (111) in a similar manner as described above. As a result, the image captioning model (111) can be strengthened.
- the ‘T2’ point in time may be, for example, a point in time when the performance of the image captioning model (111) exceeds the reference value, and the performance of the image captioning model (111) may be judged using the caption generated by the language model (22) (see the description of FIG. 9).
- the scope of the present disclosure is not limited thereto.
- the tagging system (10) can perform tagging work on the content using only the enhanced image captioning model (111). For example, the tagging system (10) can generate a caption for the content (e.g., a caption for a frame image) through the enhanced image captioning model (111) and generate a tag set for the content based on the caption (e.g., extracting keywords from the caption, etc.).
- the enhanced image captioning model (111) can accurately generate a caption that represents the content, etc. of the content well without the help of a language model (22) (or context data of the content) through an accurate understanding of the frame image.
- the ‘T3’ time point can also be determined based on the performance of the enhanced image captioning model (11), similar to the ‘T2’ time point, for example.
- the scope of the present disclosure is not limited thereto.
- the tagging system (10) may verify the caption (or tag set) of the content obtained through the enhanced image captioning model (111) by using the language model (22). For example, the tagging system (10) may request the language model (22) to determine (verify) whether the caption (or tag set) matches the context data of the content (e.g., construct a prompt based on the caption, the context data, and a query requesting verification), and may perform verification by directly comparing the caption (or tag set) with the caption (or tag set) generated through the language model (22).
- the language model (22) may request the language model (22) to determine (verify) whether the caption (or tag set) matches the context data of the content (e.g., construct a prompt based on the caption, the context data, and a query requesting verification), and may perform verification by directly comparing the caption (or tag set) with the caption (or tag set) generated through the language model (22).
- the image analysis module (21) can be strengthened by using the tag set of the content obtained through the language model (22) as training data.
- a tag determiner (101) that determines whether a candidate tag is an important tag that matches the content can be constructed by using the training data, and the tag determiner (101) can be added to the image analysis module (21).
- tagging work on the content can be performed without the help of the language model (22) through the strengthened image analysis module (21). In this case, the resources required for the tagging work can be greatly reduced, and the hallucination problem of the language model (22) can be completely freed.
- FIGS. 12 and 13 a method for creating a content collection according to some examples of use of the present disclosure will be described with reference to FIGS. 12 and 13.
- This method may be performed by a computing device (not shown) other than the tagging system (10), but for the convenience of understanding, it will be described below assuming that it is performed by the tagging system (10).
- FIGS. 12 and 13 assume that the contents are contents of an OTT (Over The Top) service.
- tags i.e., tag sets
- a content collection means a collection of contents having common characteristics (e.g., a collection of contents having the same or similar topics, moods, characters, etc.).
- the tagging system (10) can automatically create a content collection (e.g., 128) by grouping pre-registered contents (e.g., 121 to 123) based on a common tag (e.g., 127).
- the tagging system (10) can create (configure) a content collection (128) for a specific common tag (127) by grouping contents (e.g., 121, 122, etc.) having a specific common tag (127, ‘a’) among pre-registered contents (e.g., 121 to 123).
- the method of designating the specific common tag (127) may be any method.
- the tagging system (10) can generate a content collection based on the embedding similarity between contents (e.g., 121 to 123). Specifically, the tagging system (10) can generate an embedding (e.g., an embedding vector) for the contents (e.g., 121) based on a tag set (e.g., 124). For example, the tagging system (10) can generate a content embedding by embedding the tag set (e.g., 124) through various word/text embedding techniques. Next, the tagging system (10) can group contents whose embedding similarity is higher than a criterion value to generate a content collection.
- an embedding e.g., an embedding vector
- the tagging system (10) can generate a content embedding by embedding the tag set (e.g., 124) through various word/text embedding techniques.
- the tagging system (10) can group contents whose embedding similarity is higher than a
- the tagging system (10) can designate a representative content (e.g., a popular content, etc.) and group contents whose embedding similarity with the representative content is higher than a criterion value to generate a content collection.
- tags for the content collection can be selected from tags set for the representative content.
- the tagging system (10) may generate a content collection based on various combinations of the examples described above.
- the tagging system (10) may generate a content collection (e.g., 128) by grouping contents having a common tag (e.g., 127) among contents having an embedding similarity higher than a criterion value.
- Fig. 13 illustrates an OTT service page (130).
- various content collections e.g., 128, 131
- various content collections e.g., 128, 131
- user satisfaction with the OTT service can be greatly improved.
- FIGS. 12 and 13 So far, a method for creating a content collection according to some examples of the present disclosure has been described with reference to FIGS. 12 and 13.
- an exemplary computing device (140) capable of implementing a tagging system (10) according to some embodiments of the present disclosure will be described with reference to FIG. 14.
- Figure 14 is an exemplary hardware configuration diagram showing a computing device (140).
- the computing device (140) may include one or more processors (141), a bus (143), a communication interface (144), a memory (142) for loading a computer program executed by the processor (141), and a storage (147) for storing a computer program (146).
- processors 141
- bus 143
- communication interface 144
- memory 142
- storage 147
- other general components may be included in addition to the components illustrated in FIG. 14. That is, the computing device (140) may further include various components in addition to the components illustrated in FIG. 14. In addition, in some cases, the computing device (140) may be configured in a form in which some of the components illustrated in FIG. 14 are omitted. Hereinafter, each component of the computing device (140) will be described.
- the processor (141) can control the overall operation of each component of the computing device (140).
- the processor (141) can be configured to include at least one of a CPU (Central Processing Unit), an MPU (Micro Processor Unit), an MCU (Micro Controller Unit), a GPU (Graphics Processing Unit), an NPU (Neural Processing Unit), or any other type of processor well known in the technical field of the present disclosure.
- the processor (141) can perform operations for at least one application or program for executing operations/methods according to embodiments of the present disclosure.
- the computing device (140) can have one or more processors.
- the memory (142) can store various data, commands, and/or information.
- the memory (142) can load a computer program (146) from the storage (147) to execute operations/methods according to embodiments of the present disclosure.
- the memory (142) may be implemented as a volatile memory such as RAM, but the technical scope of the present disclosure is not limited thereto.
- the bus (143) can provide a communication function between components of the computing device (140).
- the bus (143) can be implemented as various types of buses such as an address bus, a data bus, and a control bus.
- the communication interface (144) can support wired and wireless Internet communication of the computing device (140).
- the communication interface (144) can support various communication methods other than Internet communication.
- the communication interface (144) can be configured to include a communication module well known in the technical field of the present disclosure.
- the storage (147) can non-temporarily store one or more computer programs (146).
- the storage (147) can be configured to include non-volatile memory such as a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or any form of computer-readable recording medium well known in the art to which the present disclosure pertains.
- ROM Read Only Memory
- EPROM Erasable Programmable ROM
- EEPROM Electrically Erasable Programmable ROM
- flash memory a hard disk, a removable disk, or any form of computer-readable recording medium well known in the art to which the present disclosure pertains.
- the computer program (146) may include instructions that cause the processor (141) to perform operations/methods according to various embodiments of the present disclosure when loaded into the memory (142). That is, the processor (141) may perform operations/methods according to various embodiments of the present disclosure by executing the loaded instructions.
- the computer program (146) may include instructions that cause the image analysis module (21) to generate a set of candidate tags including a plurality of candidate tags related to the content, construct a prompt based on the set of candidate tags, contextual data about the content, and a query statement, and generate a set of tags for the content based on a response of the language model (22) to the prompt.
- the computer program (146) may include instructions to perform at least some of the steps/operations described with reference to FIGS. 1 through 13.
- a tagging system (10) may be implemented via a computing device (140).
- the computing device (140) illustrated in FIG. 14 may mean a virtual machine implemented based on cloud technology.
- the computing device (140) may be a virtual machine operating on one or more physical servers included in a server farm.
- the processor (141), memory (142), and storage (147) illustrated in FIG. 14 may be virtual hardware, and the communication interface (144) may also be implemented as a virtualized networking element such as a virtual switch.
- a computer program recorded on a computer-readable recording medium can be transmitted to another computing device through a network such as the Internet and installed on said other computing device, thereby allowing it to be used on said other computing device.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
- Auxiliary Devices For And Details Of Packaging Control (AREA)
- Measurement And Recording Of Electrical Phenomena And Electrical Characteristics Of The Living Body (AREA)
- Adornments (AREA)
- Stored Programmes (AREA)
Abstract
L'invention concerne un procédé d'étiquetage pour un contenu et un système associé. Un procédé d'étiquetage selon certains modes de réalisation peut comprendre les étapes consistant à : générer un ensemble d'étiquettes candidates comprenant une pluralité d'étiquettes candidates associées à un contenu par l'intermédiaire d'un module d'analyse d'image ; configurer une invite sur la base de l'ensemble d'étiquettes candidates, des données de contexte pour le contenu, et une instruction d'interrogation ; et générer un ensemble d'étiquettes pour le contenu sur la base d'une réponse d'un modèle de langage à l'invite. Selon le procédé, une étiquette de haute qualité qui représente bien les caractéristiques, les détails et similaires du contenu peut être facilement créée, et une opération d'étiquetage pour le contenu peut être facilement automatisée.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020230098584A KR102705765B1 (ko) | 2023-07-28 | 2023-07-28 | 콘텐츠를 위한 태깅 방법 및 그 시스템 |
| KR10-2023-0098584 | 2023-07-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025028676A1 true WO2025028676A1 (fr) | 2025-02-06 |
Family
ID=92757366
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2023/011140 Pending WO2025028676A1 (fr) | 2023-07-28 | 2023-07-31 | Procédé d'étiquetage pour contenu et système associé |
Country Status (3)
| Country | Link |
|---|---|
| KR (2) | KR102705765B1 (fr) |
| TW (1) | TWI883500B (fr) |
| WO (1) | WO2025028676A1 (fr) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102878133B1 (ko) * | 2024-11-06 | 2025-10-30 | 주식회사 인핸스 | 대규모 언어 모델 기반 반정형 데이터 자동분류 시스템 |
| KR102815032B1 (ko) * | 2024-11-14 | 2025-05-30 | 주식회사 일만백만 | 인공지능 기반 아티클 원문으로부터 관련 콘텐츠를 자동으로 생성하는 방법 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20090090613A (ko) * | 2008-02-21 | 2009-08-26 | 주식회사 케이티 | 멀티모달 대화형 이미지 관리 시스템 및 방법 |
| KR20090101863A (ko) * | 2008-03-24 | 2009-09-29 | 강민수 | 디지털 콘텐츠 내용 맞춤형 키워드 광고를 위한 상업적 태그 집합 생성 방법 |
| KR101607468B1 (ko) * | 2015-02-27 | 2016-03-30 | 고려대학교 산학협력단 | 콘텐츠에 대한 키워드 태깅 방법 및 시스템 |
| KR20200054613A (ko) * | 2018-11-12 | 2020-05-20 | 주식회사 코난테크놀로지 | 동영상 메타데이터 태깅 시스템 및 그 방법 |
| KR20210156283A (ko) * | 2019-04-19 | 2021-12-24 | 삼성전자주식회사 | 프롬프트 정보 처리 장치 및 방법 |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10929665B2 (en) * | 2018-12-21 | 2021-02-23 | Samsung Electronics Co., Ltd. | System and method for providing dominant scene classification by semantic segmentation |
| US11755643B2 (en) * | 2020-07-06 | 2023-09-12 | Microsoft Technology Licensing, Llc | Metadata generation for video indexing |
| CN112765403A (zh) * | 2021-01-11 | 2021-05-07 | 北京达佳互联信息技术有限公司 | 一种视频分类方法、装置、电子设备及存储介质 |
| CN115114479B (zh) * | 2022-04-18 | 2025-03-14 | 腾讯科技(深圳)有限公司 | 视频标签的生成方法和装置、存储介质及电子设备 |
-
2023
- 2023-07-28 KR KR1020230098584A patent/KR102705765B1/ko active Active
- 2023-07-31 WO PCT/KR2023/011140 patent/WO2025028676A1/fr active Pending
- 2023-08-04 TW TW112129432A patent/TWI883500B/zh active
-
2024
- 2024-09-06 KR KR1020240121383A patent/KR20250018144A/ko active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20090090613A (ko) * | 2008-02-21 | 2009-08-26 | 주식회사 케이티 | 멀티모달 대화형 이미지 관리 시스템 및 방법 |
| KR20090101863A (ko) * | 2008-03-24 | 2009-09-29 | 강민수 | 디지털 콘텐츠 내용 맞춤형 키워드 광고를 위한 상업적 태그 집합 생성 방법 |
| KR101607468B1 (ko) * | 2015-02-27 | 2016-03-30 | 고려대학교 산학협력단 | 콘텐츠에 대한 키워드 태깅 방법 및 시스템 |
| KR20200054613A (ko) * | 2018-11-12 | 2020-05-20 | 주식회사 코난테크놀로지 | 동영상 메타데이터 태깅 시스템 및 그 방법 |
| KR20210156283A (ko) * | 2019-04-19 | 2021-12-24 | 삼성전자주식회사 | 프롬프트 정보 처리 장치 및 방법 |
Also Published As
| Publication number | Publication date |
|---|---|
| TWI883500B (zh) | 2025-05-11 |
| KR102705765B1 (ko) | 2024-09-11 |
| KR20250018144A (ko) | 2025-02-04 |
| TW202533070A (zh) | 2025-08-16 |
| TW202505401A (zh) | 2025-02-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020180013A1 (fr) | Appareil d'automatisation de tâche de téléphone intelligent assistée par langage et vision et procédé associé | |
| CN113536999B (zh) | 人物情绪识别方法、系统、介质及电子设备 | |
| WO2025028676A1 (fr) | Procédé d'étiquetage pour contenu et système associé | |
| WO2022060066A1 (fr) | Dispositif électronique, système et procédé de recherche de contenu | |
| CN112149386B (zh) | 一种事件抽取方法、存储介质及服务器 | |
| WO2020204535A1 (fr) | Procédé, dispositif et système de classification automatique personnalisée par utilisateur de documents de brevet sur la base d'un apprentissage machine | |
| WO2017138766A1 (fr) | Procédé de regroupement d'image à base hybride et serveur de fonctionnement associé | |
| WO2019093599A1 (fr) | Appareil permettant de générer des informations d'intérêt d'un utilisateur et procédé correspondant | |
| CN113705563B (zh) | 一种数据处理方法、装置、设备及存储介质 | |
| WO2025053615A1 (fr) | Dispositif de fourniture de données, procédé et programme informatique pour générer une réponse à une question à l'aide d'une technologie de type intelligence artificielle | |
| WO2020204364A2 (fr) | Procédé et dispositif de plongement lexical sur la base d'informations contextuelles et d'informations morphologiques d'un mot | |
| WO2024005413A1 (fr) | Procédé et dispositif basés sur l'intelligence artificielle pour extraire des informations d'un document électronique | |
| CN114676705B (zh) | 一种对话关系处理方法、计算机及可读存储介质 | |
| WO2020159140A1 (fr) | Dispositif électronique et son procédé de commande | |
| WO2023075434A1 (fr) | Système de fourniture de livre de référence numérique basé sur l'apprentissage automatique utilisant des boîtes de délimitation | |
| CN113805977B (zh) | 测试取证方法及模型训练方法、装置、设备、存储介质 | |
| WO2019098539A1 (fr) | Procédé et dispositif de commande de conversation vocale | |
| WO2021177499A1 (fr) | Procédé et dispositif d'extraction automatique de nouvelle fonction d'agent vocal à l'aide d'une analyse de journal d'utilisation | |
| Polat et al. | Unsupervised term discovery for continuous sign language | |
| EP4384987A1 (fr) | Procédé et système de sélection de marqueur pour modifier une scène dans un environnement informatique basé sur la réalité augmentée | |
| CN111767727B (zh) | 数据处理方法及装置 | |
| WO2023068495A1 (fr) | Dispositif électronique et son procédé de commande | |
| WO2023042988A1 (fr) | Procédés et systèmes pour déterminer des intervalles manquants associés à une commande vocale pour une interaction vocale avancée | |
| WO2022108282A1 (fr) | Procédé d'utilisation d'informations en domaine ouvert pour la compréhension de contexte d'informations de relation temporelle | |
| WO2022270660A1 (fr) | Procédé de traitement de flux vidéo basé sur un objet central à base d'apprentissage profond et système associé |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23947681 Country of ref document: EP Kind code of ref document: A1 |