CN119739838A

CN119739838A - RAG intelligent question answering method, device, equipment and medium for multi-label generation and matching

Info

Publication number: CN119739838A
Application number: CN202510240231.8A
Authority: CN
Inventors: 王鸷天; 刘志伟
Original assignee: Northking Information Technology Co ltd
Current assignee: Northking Information Technology Co ltd
Priority date: 2025-03-03
Filing date: 2025-03-03
Publication date: 2025-04-01

Abstract

The present invention discloses a RAG intelligent question-answering method, device, equipment and medium for multi-label generation and matching, wherein the method comprises: obtaining original text data corresponding to a target business scenario, and performing vectorization processing on the original text data to determine a text data set to be processed; determining a text label set corresponding to the text data set to be processed, and establishing a text vector database based on the text label set and the text data set to be processed; in the case of receiving a question to be processed, determining a question text label and a question text vector corresponding to the question to be processed; determining a text to be applied from a text vector database based on the question text to be processed, the question text label and the question text vector; generating a target question answer corresponding to the question to be processed according to the text to be applied and a preset prompt word template. The above technical solution customizes the data source and data label according to the business scenario in combination with the vector retrieval function, significantly improving the granularity and accuracy of the retrieval results.

Description

Multi-label generation matching RAG intelligent question-answering method, device, equipment and medium

Technical Field

The invention relates to the technical field of computers, in particular to a multi-label generation matching RAG intelligent question-answering method, device, equipment and medium.

Background

The intelligent question and answer is characterized in that the intention of a user in an input text is identified, efficient retrieval is carried out in a vector library constructed by a large-scale data set, comprehensive analysis and summarization are carried out on the user query and the retrieval result by utilizing a large language model, and finally, an answer conforming to natural language habits is generated. The existing question-answer flow comprises source data acquisition, data analysis and blocking, vectorization processing, vector library construction, user query vectorization, vector search, result processing and ranking, and answer generation and output.

However, because the accuracy and relevance of vector retrieval are often limited by the quality and quantity of training data, and are difficult to flexibly adjust or customize, the traditional question-answering scheme mainly depends on vector retrieval technology, which results in that the user intention cannot be accurately understood in a specific application scene, and the generated answer is not accurate or relevant enough, so that the application range and effect of the existing intelligent question-answering system are limited.

Disclosure of Invention

The invention provides a multi-label generation matching RAG intelligent question-answering method, device, equipment and medium, which are used for quickly finding information related to user demands by constructing a vector database of document text label data, analyzing and summarizing by utilizing a large language model, so as to return answers to user questions, and customizing a data source and a data label according to a service scene and combining a vector retrieval function at the same time so as to improve the granularity and accuracy of retrieval results.

According to one aspect of the invention, there is provided a multi-tag generation matching RAG intelligent question-answering method, comprising:

acquiring original text data corresponding to a target service scene, and carrying out vectorization processing on the original text data to determine a text data set to be processed;

Determining a text label set corresponding to the text data set to be processed, and establishing a text vector database based on the text label set and the text data set to be processed;

Under the condition that a to-be-processed problem is received, determining a problem text label and a problem text vector corresponding to the to-be-processed problem;

Determining a text to be applied from the text vector database based on the question text to be processed, the question text label and the question text vector;

And generating a target question answer corresponding to the to-be-processed question according to the to-be-applied text and a preset prompt word template.

According to another aspect of the present invention, there is provided a RAG intelligent question-answering apparatus for generating a match of multiple tags, including:

The text data processing module is used for acquiring original text data corresponding to a target service scene, and carrying out vectorization processing on the original text data to determine a text data set to be processed;

the vector database establishing module is used for determining a text label set corresponding to the text data set to be processed and establishing a text vector database based on the text label set and the text data set to be processed;

The system comprises a question text processing module, a question text processing module and a processing module, wherein the question text processing module is used for determining a question text label and a question text vector corresponding to a to-be-processed question under the condition that the to-be-processed question is received;

The text to be applied determining module is used for determining the text to be applied from the text vector database based on the problem text to be processed, the problem text label and the problem text vector;

and the question answer generating module is used for generating a target question answer corresponding to the to-be-processed question according to the to-be-applied text and a preset prompt word template.

According to another aspect of the present invention, there is provided an electronic apparatus including:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multi-tag matching RAG intelligent question-answering method according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the RAG intelligent question-answering method of multi-tag generation matching according to any one of the embodiments of the present invention when executed.

According to the technical scheme, the original text data corresponding to the target service scene is obtained, vectorization processing is carried out on the original text data, a text data set to be processed is determined, a text label set corresponding to the text data set to be processed is determined, a text vector database is established based on the text label set and the text data set to be processed, and then a problem text label and a problem text vector corresponding to the problem to be processed are determined under the condition that the problem to be processed is received. And searching a text to be applied from the text vector database based on the text of the question to be processed, the text label of the question and the text vector of the question, and generating a target question answer corresponding to the question to be processed according to the text to be applied and a preset prompt word template. Based on the technical scheme, the information related to the user needs is quickly found by constructing the vector library of the document text label data, and analysis and summarization are performed by using the large language model, so that an answer to the user problem is returned.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a multi-tag matching-generating RAG intelligent question-answering method provided according to an embodiment of the present invention;

FIG. 2 is a flow chart of determining a text tag set of text data provided in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of a RAG intelligent question-answering method for generating matches by multiple tags according to an embodiment of the present invention;

FIG. 4 is a flowchart of a RAG intelligent question-answering method for generating a match by multiple tags according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a RAG intelligent question-answering device with multiple tags generating matching according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that, in the description of the present invention and the claims and the above figures, the terms "first," "second," and the like are used merely to distinguish similar objects, and are not necessarily used to describe a particular order or sequence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a multi-tag matching RAG intelligent question-answering method provided by the embodiment of the invention, which is suitable for constructing a vector library of document text tag data, quickly finding information related to user requirements, and analyzing and summarizing by using a large language model, so as to return answers to user questions. The method can be executed by a multi-tag generation matching RAG intelligent question-answering device which can be realized in a hardware and/or software mode and can be configured in electronic equipment. As shown in fig. 1, the method includes:

S110, acquiring original text data corresponding to a target service scene, and carrying out vectorization processing on the original text data to determine a text data set to be processed.

The target service scenario may be understood as a service scenario associated with a user requirement, and the user may determine a service scenario requiring to construct a text vector database according to the requirement, for example, may be a service scenario such as a financial question-answering assistant, a corporate regulation assistant, and the like. The raw text data may be raw text data, such as a company reimbursement system file, a company recruitment system file, and the like. A text data set to be processed may be understood as a data set consisting of text vectors of a plurality of text blocks, which is obtained by processing the original text data. It should be noted that, the search enhancement generation (RETRIEVAL-augmented Generation), abbreviated as RAG, combines a language model and an information search technology, that is, when the model needs to generate text or answer questions, related information is searched from a document set, and then the searched information is used to guide the generation of text, so as to improve the quality and accuracy of prediction.

Specifically, relevant business data and label data are collected and integrated in a targeted manner according to the use scene and the user requirements. According to different usage scenarios, the question and answer data files and classification labels (such as business trip reimbursement, invoice management and the like) required in the current scenario are collected, and specific label information (such as user permission, access level and the like) can be collected according to user requirements, and it is noted that the original text data can comprise different data types, including but not limited to structured data and unstructured data. The file format may relate to TXT, doc, docx and other types, and further vectorizing the extracted original text data to obtain a text data set to be processed corresponding to the original text data.

On the basis of the technical scheme, the vectorization processing of the original text data to determine a text data set to be processed comprises the steps of determining a text format corresponding to the original text data, extracting text content from the original text data according to the text format, dividing the text content according to a dividing identifier, determining each text block corresponding to the text content, distributing text block identifiers for each text block, and determining the text data set to be processed based on each text block and the text block identifier distributed with each text block.

The text format may be a file format of the original text data, and may include DOCX, PDF, XLSX formats. Text content may be understood as text information in the original text data. The segmentation identifier may be a preset identifier for segmenting the text, and may use various text symbols, such as punctuation marks of periods, question marks, and the like, as the segmentation identifier. The text block may be a text segment that is processed in its entirety that is marked. Text block identification may be understood as an identifier for uniquely identifying a text block, which may be a digital ID, for example.

Specifically, after text parsing, text segmentation, vectorization and label extraction labeling are performed on the collected data, the text block, the semantic vector and the label set are stored in a vector database for subsequent multi-way recall based on user questions, and knowledge input is performed when an answer is generated as a large language model, for example, the format of text materials is first identified, and different tools are used for identifying and extracting text contents according to different formats. For example, text extraction may be performed using PDFMiner or PyPDF libraries for PDF documents, while content reading may be performed using the python-docx library for Word documents. The text content is subdivided into smaller text blocks by a recursive circulation and dynamic block strategy until the preset block requirement is met by further processing text information in possibly existing pictures and tables in the document and further using various text symbols (such as punctuations of periods, question marks and the like) as a segmentation mark. It should be noted that, each text block after segmentation is assigned a unique Identifier (ID) for facilitating subsequent processing and tracing. The final structured data form is [ { "ID": id_1, "data": text block_1 }, { "ID": id_2, "data" for text block_2 },... And the text data after the segmentation is vectorized by utilizing a pre-training word vector embedding model (such as BERT, roBERTa and the like) based on supervised learning and contrast learning. Each block of text may be input into an embedded model to generate a corresponding high-dimensional vector representation, which is capable of capturing semantic information in the text. The generated vector is then stored to a vector database along with the text identifier, the original text block, ensuring that relevant information can be retrieved and matched efficiently during subsequent searches. The processed data structure becomes [ { "id": id_1, "data": text block_1, "vector": vector_1}, { "id": id_2, "data": text block_2, "vector": vector_2}, { "id": id_n, "data": text block_n, "vector": vector_n } ].

S120, determining a text label set corresponding to the text data set to be processed, and establishing a text vector database based on the text label set and the text data set to be processed.

Wherein the text tag set may be a data set of tags corresponding to text data. The text vector database is a database for storing text blocks, semantic vectors, and tag sets, and supports efficient query and retrieval.

Specifically, after the segmented text is subjected to label extraction and labeling, the text block data is bound with the corresponding label set based on the text block unique ID, so that additional information dimension is added for each data item, and the query efficiency and accuracy are improved. The final data form is shown below [ { "id": id_1, "data": text block_1, "vector": vector_1, "label": [ label_1, label_2, ], label_n ] }, { "id": id_2, "data": text block_2, "vector": vector_2, "label": [ label_1, label_2, ], label_n ] }, { "id": id_n, "data": text block_n, "vector": vector_n, "label_1, label_2, }, label_n ]. In turn, a suitable vector database is selected based on project requirements, such as PostgreSQL, elasticsearch or a specialized vector search engine (e.g., faiss). And storing the processed data items into a selected vector database. The use of an indexing structure and query algorithm in a vector database or vector search engine ensures that the system can quickly respond to a user's search request and provide accurate result matching.

On the basis of the technical scheme, the determining of the text label set corresponding to the text data set to be processed comprises the steps of inputting the text data set to be processed and the preset text label set into a preset semantic retrieval model to obtain candidate text label sets when the preset text label set corresponding to the original text data exists, inputting the candidate text label sets and the text data set to be processed into a large language model, and determining the text label set corresponding to the text data set to be processed.

The preset text label set can be a pre-established label set corresponding to the original text, and can be obtained by manually labeling the original text, or can be a set of text labels obtained by processing the original text through a semantic recognition model obtained through pre-training. The preset semantic retrieval model may be a pre-trained large language model obtained for retrieval according to text semantics.

Specifically, when a preset text label set corresponding to the original text data exists, the text data set to be processed and the preset text label set are input into a preset semantic retrieval model together. The preset semantic retrieval model is based on semantic similarity among text contents, the label most relevant to the text data set to be processed is retrieved from a preset label set, a candidate text label set is generated, the label highly relevant to the text data set to be processed is contained in the candidate text label set, the candidate text label set and the text data set to be processed are input into a large language model, such as GPT-4, BERT and the like, and the most accurate text label is determined. For example, as shown in fig. 2, in the case of possessing a preset candidate tag library, a pre-trained semantic search model is used to perform preliminary screening on a question text or a knowledge text segment. The candidate prompt words most relevant to the question text or the knowledge text are recalled from a plurality of preset candidate labels. Then, constructing a special tag screening prompt word, inputting a recalled candidate tag set and a question text or a knowledge text into a large language model, further accurately positioning a correct tag to be finally adopted, and supporting generation of a hierarchical and multi-level tag structure so as to ensure that each tag can accurately reflect the core content of a document or sentence.

On the basis of the technical scheme, the determining of the text label set corresponding to the text data set to be processed comprises the steps of constructing a label extraction prompt word corresponding to the text data set to be processed when a preset text label set corresponding to the original text data does not exist, inputting the label extraction prompt word and the text data set to be processed into a large language model, and determining the text label set corresponding to the text data set to be processed.

The tag extraction prompt word is used for indicating the large language model to perform tag extraction, and the tag extraction prompt word can be a prompt word corresponding to a target service scene.

Specifically, under the condition that a preset text label set is not available, a label extraction prompt word is constructed, wherein the label extraction prompt word is a phrase or a problem which is related to the content of the text data set to be processed and can guide a large language model to generate related labels. For example, if the text data set to be processed is about technical news, then the tag extraction hint words may include "what is the subject of this news? and then the constructed tag extraction prompt word and the text data set to be processed are input into a large language model, and the large language model generates candidate tags according to the input prompt word and text content. For example, in the case that no preset text label set corresponding to the original text data exists, a specific label extraction prompt word is constructed, and the specific label extraction prompt word is input into a large language model in combination with a question text or a knowledge text segment provided by a user. The large language model will extract the most relevant category and content tags based on the information described above. The above process supports generating a hierarchical multi-level tag hierarchy from documents, text blocks, or even single sentences. The final extracted tag will be bound to the original question text or knowledge text block for further processing operations.

When it is detected that a custom tag is provided for a specific document, custom labeling of document data can be achieved according to the custom tag, for example, when data containing sensitive information is processed, tags such as authority management related to the data can be automatically generated, and the tags can be accurately attached to corresponding problem text or knowledge text block IDs.

And S130, under the condition that a to-be-processed problem is received, determining a problem text label and a problem text vector corresponding to the to-be-processed problem.

The problem to be processed may be a problem associated with the target business scenario entered by the user. The problem text label can be understood as a text label obtained by extracting a label of a problem text to be processed, and the corresponding problem text vector can be a text vector corresponding to the problem text obtained by vectorizing the problem text.

Specifically, under the condition that a problem to be processed is received, the problem to be processed is deeply understood and analyzed, the most representative key Word or phrase can be extracted from the problem by identifying the subject, key information and the problem type of the problem, the extracted key Word is matched with the existing tag library, the most conforming tag is selected, the problem is cleaned (such as stop words and punctuation marks are removed), word segmentation, word stem extraction or Word shape reduction and the like, text is converted into feature vectors by means of Word bag models, TF-IDF, word embedding (such as Word2Vec, gloVe, BERT and the like) and the like, finally, the determined text tag of the problem and the generated text vector of the problem are integrated together to form a complete problem representation, and the semantic information of the problem and the numerical feature of the problem are contained in the representation. It should be noted that the above method for processing the problem text is also applicable to processing document data.

S140, determining a text to be applied from the text vector database based on the text of the question to be processed, the text label of the question and the text vector of the question.

The text of the question to be processed may be text data of the question to be processed. The text to be applied is text for constructing an answer corresponding to the question.

Specifically, the similarity between the text vector of the problem to be processed and the text vector in the database can be calculated by using a similarity measurement method, such as cosine similarity, euclidean distance, manhattan distance and other algorithms, the text most similar to the problem to be processed is screened out according to a similarity threshold value which is required to be preset, and the result can be filtered by further utilizing the text label of the problem on the basis of similarity calculation. For example, if the problem to be processed is a "machine learning algorithm", only the text in the database that is also labeled with the "machine learning algorithm" may be considered, the text in the database may be ranked according to the similarity score and the tag filtering result, and the text that best meets the requirements of the problem to be processed may be selected as the text to be applied.

S150, generating a target question answer corresponding to the to-be-processed question according to the to-be-applied text and a preset prompt word template.

The preset prompting word template can be a preset prompting word corresponding to a service scene.

Specifically, according to different use scenes, a preset prompt word template is adopted to combine a user question and the text to be applied to generate a corresponding answer, the text to be applied is formatted by using the preset prompt word template to generate a target question answer corresponding to the question to be processed, and the answer generation prompt word template is required to be written in advance according to the specific application scene as a basic framework. The template aims at accommodating user problems and related text blocks recalled from a knowledge base, and complete and structured prompt words are formed by reasonably filling the contents, so that a large language model can accurately understand and respond to the query requirements of users. After the constructed answer generation prompt word is input into the large language model, the accurate and comprehensive answer is comprehensively analyzed and generated based on the user questions and the related background knowledge thereof.

On the basis of the technical scheme, the method for generating the target question answer corresponding to the to-be-processed question according to the to-be-applied text and the preset prompt word template comprises the steps of obtaining the preset prompt word template corresponding to a target service scene, generating an answer generation prompt word according to the preset prompt word template and the to-be-applied text, inputting the answer generation prompt word into a large language model, and generating the target question answer corresponding to the to-be-processed question.

Specifically, an appropriate preset prompt word template is selected or constructed according to a target business scenario, and the template is usually designed based on business logic, common problem types or historical data analysis and is used for guiding a large language model to generate an answer conforming to a specific format or style. For example, for a customer consultation scenario, the template may include "according to the provided information, your answer is.," about your question, solution is., "etc. And combining the text to be applied with a preset prompting word template to generate an answer generating prompting word. The key information in the text to be applied can be inserted into the corresponding position of the template, or the text to be applied is appropriately rewritten according to the structure and style of the template, and the generated answer generation prompt word is input into the pre-trained large language model. A large language model, such as GPT series (GPT-3, GPT-4, etc.), generates target question answers corresponding to the questions to be processed based on the answer generation prompt.

The technical scheme provided by the embodiment of the invention can be further described with reference to fig. 3, as shown in fig. 3, service data are firstly obtained based on the current use scene, text analysis, document cutting, text label extraction labeling and text block vectorization are respectively carried out on the service data, and the final text block, vector block and label set are stored in a vector database together. And then the user inputs the questions, carries out word segmentation, label extraction and marking and vectorization on the questions, carries out multi-path retrieval in a vector database, recalls related text blocks, and finally combines preset prompt words to send the large language model to generate final answers.

According to the technical scheme, the original text data corresponding to the target service scene is obtained, vectorization processing is conducted on the original text data to determine a text data set to be processed, a text label set corresponding to the text data set to be processed is determined, a text vector database is built based on the text label set and the text data set to be processed, further, under the condition that a problem to be processed is received, a problem text label and a problem text vector corresponding to the problem to be processed are determined, a text to be applied is determined from the text vector database based on the text of the problem to be processed, the problem text label and the problem text vector, and a target problem answer corresponding to the problem to be processed is generated according to the text to be applied and a preset prompt word template. Based on the technical scheme, the information related to the user needs is quickly found by constructing the vector library of the document text label data, and is analyzed and summarized by utilizing the large language model, so that an answer for the user questions is returned, and the fine granularity and the accuracy of the retrieval result are improved by customizing the data source and the data label according to the service scene and combining the vector retrieval function.

Example two

Fig. 4 is a flowchart of a multi-tag matching RAG intelligent question-answering method according to an embodiment of the present invention, where the technical scheme of determining a text to be applied from the text vector database based on a to-be-processed question text, the question text tag and the question text vector is further refined based on the above technical scheme. As shown in fig. 4, the method includes:

S210, calculating a relevance score corresponding to each text block in the text vector database of the problem text to be processed based on a preset text matching algorithm, and determining a text block set to be processed corresponding to the problem text to be processed from the text vector database according to the relevance score.

The preset text matching algorithm may be a preset algorithm for calculating text similarity, for example, may be a cosine similarity, a Jaccard similarity, an edit distance (such as a Levenshtein distance), a BM25 algorithm, and the like.

Specifically, a relevance score between the question text to be processed and each text block in the text vector database is calculated using a selected text matching algorithm. The relevance score is a numerical value between 0 and 1, which indicates the similarity or matching degree between two text blocks, a threshold value of the relevance score can be set according to specific requirements, only text blocks with the score higher than the threshold value can be selected, all text blocks can be sorted according to the relevance score, and then the top N text blocks with the highest scores are selected as a text block set to be processed. Illustratively, text block data is retrieved in a vector database based on user questions. The BM25 algorithm may be used to perform the search: Wherein- (Q) represents a query statement, (D) represents a document, (q_i) represents an ith word in the query statement, (n) represents the total number of words in the query statement, - (|D|) represents the document length, - (text { avgdl }) represents the average length of the documents in the document set, and- (IDF (q_i)) represents the inverse document frequency, and is used for measuring the importance of one word in the document set. The calculation formula of (IDF (q_i)) is [ IDF (q_i) = logleft (frac { N } { text { df } (q_i) } +1 right) ] wherein (N) is the total number of documents, (text { df } (q_i)) is the number of documents containing the word (q_i), - (f (q, D)) is the word frequency (TF) representing the number of occurrences of the ith word in the document, for measuring the importance of a word in the document, - (k_1) and (b) are adjustment parameters, and the influence of the number of individual words in the document and the document length are controlled respectively by sorting the text blocks according to the search score calculated by the formula and obtaining the text block ID rank. These text blocks will serve as the basis for the subsequent generation of answers.

S220, screening text blocks to be verified, which are matched with the problem text labels, from the text vector database based on the problem text labels, calculating a relevance score between the problem text to be processed and the text blocks to be verified according to the preset text matching algorithm, and determining a text block set to be processed, which corresponds to the problem text to be processed, from the text blocks to be verified based on the relevance score.

The text block to be verified can be a text block matched with the text label of the problem, which is obtained by screening from a text vector database in a label matching mode.

Specifically, text blocks matched with the problem text labels are screened out from a text vector database according to the problem text labels and are used as text blocks to be verified, a relevance score between the problem text to be processed and the text blocks to be verified is calculated according to a preset text matching algorithm, further a text block set to be processed with higher relevance scores is screened out of the text blocks to be verified according to the relevance score, and illustratively, the text blocks to be verified can be arranged in a descending order according to the calculated relevance score, the text block set to be processed is selected according to a preset text block selection number and the text block ID of each text block in the text block set to be processed is obtained.

In another possible implementation manner, the number of label matches between each text block in the text vector database and the problematic text label may be determined, and the set of text blocks to be processed corresponding to the problematic text label may be determined from the text vector database according to the number of label matches.

The number of label matches may be the number of label matches of the text block and the question label.

Specifically, for each text block in the text vector database, the number of tag matches between the text block and the problem text tag is calculated, for example, the text block is compared with a tag set carried by the problem, the number of common tags is found, the text block with the highest number of matching with the problem text tag is screened out of the text vector database according to the determined screening criteria, a text block set to be processed is formed, and, illustratively, based on the tags of the user problem, the matched tag data is searched in the vector database. And sequencing the text blocks according to the matching quantity of each text block and the user question label, and acquiring the ID ranking of the text blocks.

S230, calculating the vector similarity between each text block in the text vector database and the problem text vector based on a vector search engine, and determining a text block set to be processed corresponding to the problem text vector from the text vector database according to the vector similarity.

Vector similarity, among other things, can be understood as the score of the degree of similarity between a problem text vector and a text block vector.

Specifically, a similarity calculation function provided by a vector search engine, such as cosine similarity, euclidean distance, dot product and the like, is used for calculating the similarity between the text vector of the problem and each text block vector in a text vector database, and text blocks most relevant to the text vector of the problem are screened from the text vector database according to a determined screening threshold or sorting result to form a text block set to be processed. Illustratively, a vector database is searched for semantically most similar text block vectors using a vector representation of the user question. And calculating the similarity between the vectors by using a vector search engine in the database, and sorting the results from high to low according to the similarity score, so as to finally obtain the text block ID ranking.

S240, determining the text to be applied based on the text block set to be processed.

On the basis of the technical scheme, the text to be applied is determined based on the text block set to be processed, and the text to be applied comprises the steps of determining text block identifiers corresponding to all text blocks in the text block set to be processed, performing de-duplication processing on the text block set to be processed according to the text block identifiers, determining comprehensive scores of all text blocks in the text block set to be processed after de-duplication processing according to a comprehensive scoring algorithm, determining target text blocks from the text block set to be processed according to the comprehensive scores, and acquiring the text to be applied from the text vector database based on the text block identifiers of the target text blocks.

Wherein the text block identification may be a numerical identification corresponding to each text block in the vector database. The composite scoring algorithm may be an algorithm for determining a composite match score corresponding to each text block.

Specifically, a text block set to be processed of the multi-path search results obtained by combining the steps is subjected to de-duplication processing on all text block IDs in the text block set to be processed. And then adopting RFF algorithm to synthesize text block ranking in each path of search result. The formula of the RFF algorithm is as follows: Wherein- (RRF_w (d)) represents the weighted multipath search score. - (sum_ { rinR }) means summing all the retrieval paths (r). - (w_r) is a weight given to the (r) th path search for adjusting the importance of the different search paths in the final result. - (frac {1} { k+r (d) }) is the reciprocal ranking score of each search path in the original RFF algorithm, where (k) is a constant and (r (d)) represents the ranking of document (d) in the (r) th path search. - (k_1in [1.2,2.0 ]), the text frequency scale of the feature word is adjusted. The larger the value of (k_1), the greater the influence of the original word frequency on the relevance, and- (b) is called a document length normalization factor. The larger the value of (b), the greater the influence of the document length on the correlation. And calculating the comprehensive score of each text block through the formula, selecting the text block ID of K digits before ranking according to the preset TOP_K number, and recalling the corresponding text content as a final retrieval result.

On the basis of the above technical solution, in order to further optimize the quality and response speed of the search result, the method may select Reranker to work before or after the RFF algorithm according to the requirement, when Reranker is before the RFF, the search result may be rearranged in each path of search path by first using Reranker, and the rearranged result of each path is sent to the RFF algorithm again to perform comprehensive ranking, and it should be noted that the configuration of Reranker before the RFF is suitable for the situation that the user wants to obtain more content, and by independently applying Reranker on each path of search path, the quality of the result of each search path may be more finely adjusted.

When Reranker is behind the RFF, the retrieval results of all paths are comprehensively ranked by using the RFF algorithm, and then after the comprehensive ranking, the results after the comprehensive ranking of the RFF are rearranged by using Reranker, and it is noted that the configuration of Reranker behind the RFF is suitable for the situation that a user wants to obtain a high-quality result more quickly, and the response speed of the system can be remarkably improved by reducing the data volume processed by Reranker.

Based on the above technical solution, reranker adopted in the above technical solution is further described, including recalculating all results by summarizing candidate result lists of multiple sources based on statistical Reranker (RFF), using a weighted score or Reciprocal Rank Fusion (RRF) algorithm for multiple recalls. The formula is [ (text { NewScore } (d) = \sum _ { r\ inR } w_r '\ cdot \frac {1} { k' +r '(d) } ], wherein- (w_r') is the weight given to the (r) th path retrieval. - (k ') is a constant and (r' (d)) represents the ranking of document (d) after Reranker processing. It should be noted that, the calculation based on Reranker of statistics is not complex, the efficiency is high, and the method is suitable for the traditional search system which is sensitive to delay.

Further, a deep learning model-based Reranker (Cross-encoderReranker) may be included to analyze the relevance between the problem and the document using a specially trained neural network to score the semantic similarity between the problem and the document. The formula is [ \text { NewScore } (d) =f (\text { Query }, d) ] where- (f) is a trained deep learning model, the inputs are the Query ((\text { Query }) and the document ((d)), and the output is a score representing the relevance. It should be noted that the scoring generally depends only on the question and the text content of the document, and does not depend on the scoring or relative position of the document in the recall result. The method is suitable for single-way recall and also suitable for multi-way recall. Through the selection and configuration of the two Reranker modules, a user can flexibly adjust the performance and quality of the retrieval system according to actual requirements.

According to the technical scheme provided by the embodiment of the invention, the accuracy and the adaptability of the system are obviously improved by combining the traditional vector retrieval technology and the diversified text label mechanism. Firstly, by introducing a multi-level text label system, the system can more finely classify and label the data, not only can the explicit characteristics of keywords, field identifiers and the like be covered, but also the implicit information of authority levels, semantic categories and the like can be included. The multi-dimensional label mechanism enables the system to more accurately identify the user intention when processing the query in the specific field and quickly search the answer candidate set highly relevant to the massive data. In addition, by introducing diversified text labels, more accurate retrieval and matching can be realized on a limited data set, so that the dependence on a large amount of high-quality training data is relieved to a certain extent. Based on the technical scheme, the stability and expansibility of the question-answering system are improved, the search of the internal data is quicker and more accurate, and the question-answering system is easy to expand to different application scenes.

Example III

Fig. 5 is a schematic structural diagram of a multi-tag generation matching RAG intelligent question-answering device according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes a text data processing module 510, a vector database creation module 520, a question text processing module 530, a text to be applied determination module 540, and a question answer generation module 550, wherein,

The text data processing module 510 is configured to obtain original text data corresponding to a target service scene, and perform vectorization processing on the original text data to determine a text data set to be processed;

A vector database creation module 520, configured to determine a text tag set corresponding to the text data set to be processed, and create a text vector database based on the text tag set and the text data set to be processed;

A question text processing module 530, configured to determine a question text label and a question text vector corresponding to a question to be processed, in the case that the question to be processed is received;

a text to be applied determining module 540, configured to determine a text to be applied from the text vector database based on the question text to be processed, the question text label, and the question text vector;

and the question answer generating module 550 is configured to generate a target question answer corresponding to the to-be-processed question according to the to-be-applied text and a preset prompt word template.

On the basis of the technical scheme, the vector database building module is used for inputting the text data set to be processed and the preset text label set into a preset semantic retrieval model to obtain a candidate text label set under the condition that the preset text label set corresponding to the original text data exists, inputting the candidate text label set and the text data set to be processed into a large language model, and determining the text label set corresponding to the text data set to be processed.

On the basis of the technical scheme, the vector database building module is used for building a tag extraction prompt word corresponding to the text data set to be processed under the condition that a preset text tag set corresponding to the original text data does not exist, inputting the tag extraction prompt word and the text data set to be processed into a large language model, and determining the text tag set corresponding to the text data set to be processed.

On the basis of the technical scheme, the text to be applied determining module is used for calculating a relevance score corresponding to each text block in the text vector database of the problem text to be processed based on a preset text matching algorithm, determining a text block set corresponding to the problem text to be processed from the text vector database according to the relevance score, screening out text blocks to be verified matched with the problem text labels from the text vector database based on the problem text labels, calculating a relevance score between the problem text to be processed and the text blocks to be verified according to the preset text matching algorithm, determining a text block set to be processed corresponding to the problem text to be processed from the text blocks to be verified based on the relevance score, calculating vector similarity of each text block in the text vector database and the problem text vector based on a vector search engine, determining a text block set to be processed corresponding to the problem text vector from the text vector database according to the vector similarity, and determining the text to be applied based on the text block set to be processed.

On the basis of the technical scheme, the text to be applied determining module is used for determining text block identifiers corresponding to all text blocks in the text block set to be processed, performing de-duplication processing on the text block set to be processed according to the text block identifiers, determining comprehensive scores of all text blocks in the text block set to be processed after de-duplication processing according to a comprehensive scoring algorithm, determining target text blocks from the text block set to be processed according to the comprehensive scores, and acquiring the text to be applied from the text vector database based on the text block identifiers of the target text blocks.

On the basis of the technical scheme, the text data processing module is used for determining a text format corresponding to the original text data, extracting text content from the original text data according to the text format, dividing the text content according to a dividing identifier, determining each text block corresponding to the text content, distributing text block identifiers for each text block, and determining the text data set to be processed based on each text block and the text block distribution text block identifiers.

On the basis of the technical scheme, the question answer generation module is used for acquiring a preset prompt word template corresponding to a target service scene, generating an answer generation prompt word according to the preset prompt word template and the determined text to be applied, inputting the answer generation prompt word into a large language model, and generating a target question answer corresponding to the to-be-processed question.

According to the technical scheme, the original text data corresponding to the target service scene is obtained, vectorization processing is conducted on the original text data to determine a text data set to be processed, a text label set corresponding to the text data set to be processed is determined, a text vector database is built based on the text label set and the text data set to be processed, further, when a problem to be processed is received, a problem text label and a problem text vector corresponding to the problem to be processed are determined, a text to be applied is determined from the text vector database based on the text of the problem to be processed, the problem text label and the problem text vector, and a target problem answer corresponding to the problem to be processed is generated according to the determined text to be applied and a preset prompt word template. Based on the technical scheme, the information related to the user needs is quickly found out by constructing the vector database of the document text label data, and analysis and summarization are performed by utilizing a large language model, so that an answer to the user questions is returned, and the granularity and the accuracy of the search result are improved by customizing the data source and the data label according to the service scene and combining the vector search function.

The RAG intelligent question-answering device for generating and matching the multiple tags provided by the embodiment of the invention can execute the RAG intelligent question-answering method for generating and matching the multiple tags provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including an input unit 16, such as a keyboard, mouse, etc., an output unit 17, such as various types of displays, speakers, etc., a storage unit 18, such as a magnetic disk, optical disk, etc., and a communication unit 19, such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as the multi-tag generation matching RAG intelligent question-answering method.

In some embodiments, the multi-tag matching-generating RAG intelligent question-answering method can be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the multi-tag generation matching RAG intelligent question-answering method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the RAG intelligent question-answering method of multi-tag generation matching in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), a blockchain network, and the Internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A multi-label generation matching RAG intelligent question-answering method is characterized by comprising the following steps:

acquiring original text data corresponding to a target service scene, vectorizing the original text data, and determining a text data set to be processed;

2. The method of claim 1, wherein the determining a text label set corresponding to the text data set to be processed comprises:

When a preset text label set corresponding to the original text data exists, inputting the text data set to be processed and the preset text label set into a preset semantic retrieval model to obtain a candidate text label set;

Inputting the candidate text label set and the text data set to be processed into a large language model, and determining the text label set corresponding to the text data set to be processed.

3. The method of claim 1, wherein the determining a text label set corresponding to the text data set to be processed comprises:

under the condition that a preset text label set corresponding to the original text data does not exist, constructing a label extraction prompt word corresponding to the text data set to be processed;

Inputting the tag extraction prompt words and the text data set to be processed into a large language model, and determining a text tag set corresponding to the text data set to be processed.

4. The method of claim 1, wherein determining text to apply from the text vector database based on the question text to process, the question text label, and the question text vector comprises:

Calculating a relevance score corresponding to each text block in the text vector database of the problem text to be processed based on a preset text matching algorithm, and determining a text block set to be processed corresponding to the problem text to be processed from the text vector database according to the relevance score;

Screening text blocks to be verified, which are matched with the problem text labels, from the text vector database based on the problem text labels, calculating a relevance score between the problem text to be processed and the text blocks to be verified according to the preset text matching algorithm, and determining a text block set to be processed, which corresponds to the problem text to be processed, from the text blocks to be verified based on the relevance score;

Calculating the vector similarity between each text block in the text vector database and the problem text vector based on a vector search engine, and determining a text block set to be processed corresponding to the problem text vector from the text vector database according to the vector similarity;

and determining the text to be applied based on the text block set to be processed.

5. The method of claim 4, wherein the determining the text to apply based on the set of text blocks to process comprises:

determining text block identifiers corresponding to the text blocks in the text block set to be processed, and performing de-duplication processing on the text block set to be processed according to the text block identifiers;

Determining the comprehensive score of each text block in the to-be-processed text block set after the duplicate removal processing according to a comprehensive scoring algorithm, and determining a target text block from the to-be-processed text block set according to the comprehensive score;

And acquiring the text to be applied from the text vector database based on the text block identification of the target text block.

6. The method of claim 1, wherein said vectorizing the raw text data to determine a set of text data to be processed comprises:

determining a text format corresponding to the original text data, and extracting text content from the original text data according to the text format;

Dividing the text content according to the dividing identifier, determining each text block corresponding to the text content, and distributing text block identifiers for each text block;

And determining the text data set to be processed based on the text blocks and text block identification allocated to the text blocks.

7. The method of claim 1, wherein the generating a target question answer corresponding to the question to be processed according to the text to be applied and a preset prompt word template comprises:

acquiring a preset prompt word template corresponding to a target service scene, and generating an answer to generate a prompt word according to the preset prompt word template and the text to be applied;

and inputting the answer generation prompt word into a large language model to generate a target question answer corresponding to the to-be-processed question.

8. The utility model provides a many labels generate RAG intelligent question answering device that matches which characterized in that includes:

The text data processing module is used for acquiring original text data corresponding to a target service scene, carrying out vectorization processing on the original text data and determining a text data set to be processed;

9. An electronic device, the electronic device comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multi-tag generation matching RAG intelligent question-answering method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to implement the multi-tag generation matching RAG intelligent question-answering method of any one of claims 1-7 when executed.