CN119377369A

CN119377369A - A multimodal RAG, device, equipment and storage medium based on a large model

Info

Publication number: CN119377369A
Application number: CN202411662967.6A
Authority: CN
Inventors: 杨修远
Original assignee: Shenzhen Chengzhitong Information Technology Co ltd
Current assignee: Shenzhen Chengzhitong Information Technology Co ltd
Priority date: 2024-11-20
Filing date: 2024-11-20
Publication date: 2025-01-28

Abstract

The present invention discloses a multimodal RAG, device, equipment and storage medium based on a large model, including the following steps: preprocessing text information in a large-scale corpus using multimodal technology; constructing courseware content fragments according to the preprocessed text information and storing them in the created index; based on the preprocessed keywords in the questions issued by the user, recalling the courseware content fragments related to the questions in combination with the index to generate the first recall result; according to the result of the first recall, expanding the content according to the preset rules, and performing a second recall; sorting the results of the second recall in multiple dimensions; integrating the sorted results with the user questions, judging the relevance through the large model, and further screening the search results. Compared with the prior art, the present invention greatly increases the accuracy of the recalled content and improves the generation efficiency by introducing multimodal technology, a second recall strategy and the use of a large model.

Description

Multi-mode RAG, device, equipment and storage medium based on large model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-mode RAG, a device, equipment and a storage medium based on a large model.

Background

In the field of artificial intelligence, large model RAGs (search enhancement generation techniques) have become a hotspot in research in recent years. The method combines two key technologies of searching and generating, and brings revolutionary progress to natural language processing tasks. The RAG utilizes a large-scale corpus to perform information retrieval, and provides rich background knowledge and context information for the generation process, so that the accuracy and diversity of the generated result are improved. RAG is widely applied to the fields of text generation, dialogue systems, question-answering systems and the like.

The RAG main workflow includes:

1. Preprocessing, namely preprocessing a large-scale corpus, wherein the preprocessing comprises the steps of word segmentation, stop word removal, vocabulary list construction and the like. The preprocessing operations are helpful for extracting effective information in the text, and lay a foundation for the subsequent retrieval and generation process.

2. During the generation process, the RAG technology can search the relevant text fragments in the corpus according to the current context information. This retrieval process is typically based on some measure of similarity, such as cosine similarity F, etc. The search result will be used as a reference and supplement to the generation process.

3. Generation after the retrieval result, the RAG technique may use a generation model (e.g., transformer, GPT, etc.) to generate new text. The generating process comprehensively considers the current context information, the search result and the knowledge base of the generating model, so that more accurate and various texts are generated.

4. And finally, carrying out post-processing on the generated text, wherein the post-processing comprises the steps of removing repetition, correcting grammar errors and the like. These post-processing operations help to improve the quality of the results produced.

However, in the workflow, there are also problems that (1) a preprocessing stage, which usually supports simple text analysis, cannot fully and accurately extract text information, and (2) a text fragment related to a problem cannot be retrieved accurately across regions by simple recall. In summary, the limitations of the preprocessing stage and the deficiency of the search process can negatively affect the whole workflow of the RAG, which not only affects the search accuracy, but also makes the generation efficiency low.

Disclosure of Invention

The invention aims to provide a multi-mode RAG, a device, equipment and a storage medium based on a large model, and aims to solve the technical problems that the limitation of the preprocessing stage of the existing RAG processing flow and the defect of the retrieval process can influence the retrieval accuracy of the whole working flow and the generation efficiency is low.

In order to solve the technical problems, the aim of the invention is realized by the following technical scheme:

The first aspect of the present application provides a multi-modal RAG based on a large model, comprising the steps of:

preprocessing text information in a large-scale corpus by utilizing a multi-modal technology;

constructing courseware content fragments according to the preprocessed text information and storing the courseware content fragments into the created index;

Based on the preprocessed keywords in the questions issued by the user, combining indexing to recall courseware content fragments related to the questions to generate first recall results;

according to the result of the first recall, content expansion is carried out according to a preset rule, and a second recall is carried out;

performing multidimensional sorting on the secondary recall result;

Integrating the ordered results with the user problems, judging the relevance through a large model, and further screening search results.

In one possible implementation, the step of preprocessing the text information in the large-scale corpus using the multi-modal technique includes:

training courseware from different platforms is used as a corpus, and the training courseware comprises documents in different formats and audio and video;

text in a corpus is efficiently extracted, including using a variety of different libraries and tools.

In one possible implementation, the step of constructing a courseware content fragment from the preprocessed text information and storing the fragment in the created index includes:

defining a mapping of the index;

and (5) converting the courseware content fragments into vector representations.

In one possible implementation, the step of converting the courseware content shards into a vector representation includes:

Loading a pre-trained word embedding model, and converting each word in the text into a corresponding vector representation;

Storing the vector representation in an index using a vector field;

The type, dimension, and distance measure of the vector field are defined.

In one possible implementation manner, the step of recalling the courseware content fragments related to the problem in combination with the index based on the pre-processed keywords in the problem issued by the user to generate the first recall includes:

Preprocessing keywords by removing stop words, extracting stems, judging intention, extracting keywords and weighting the keywords;

Query the processed keywords by using the inverted index;

Calculating the similarity between the query problem vector and the courseware content fragment vector;

and recalling the courseware content fragments related to the problems according to the similarity.

In one possible implementation manner, the step of expanding the content according to the preset rule includes:

Setting forward expansion N pages and backward expansion M pages;

And adjusting the values of N and M according to the actual business requirements and the characteristics of courseware contents.

In one possible implementation, the results of the secondary recall are ranked in a multi-dimension, wherein the multi-dimension includes intent recognition, word aggregation density, weights of keywords, and fuzzy matching.

A second aspect of the present application provides a large model based multi-modal RAG device, the device comprising:

The preprocessing unit is used for preprocessing text information in a large-scale corpus by utilizing a multi-modal technology;

The index construction unit is used for constructing courseware content fragments according to the preprocessed text information and storing the courseware content fragments into the created index;

The first recall unit is used for recalling courseware content fragments related to the problem by combining the index based on the preprocessed keywords in the problem issued by the user so as to generate a first recall result;

The secondary recall unit is used for expanding the content according to a preset rule and carrying out secondary recall according to the result of the primary recall;

The sorting unit is used for carrying out multidimensional sorting on the secondary recall result;

and the text generation unit is used for integrating the ordered results with the user problems, judging the relevance through the large model and further screening the search results.

A third aspect of the application provides a computer device comprising a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method as described in the first aspect above when executing the computer program.

A fourth aspect of the application provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement a method as described in the first aspect above.

The invention has the advantages that compared with the prior art, the invention provides the multi-mode RAG, the device, the equipment and the storage medium based on the large model, which comprise the steps of preprocessing text information in a large-scale corpus by utilizing the multi-mode technology, constructing courseware content fragments according to the preprocessed text information and storing the fragments into the created index, generating a first recall result by combining the preprocessed keywords in the problem issued by a user and the courseware content fragments related to the problem through index recall, expanding the content according to the first recall result and carrying out secondary recall according to a preset rule, carrying out multi-dimensional sorting on the secondary recall result, integrating the sorted result with the problem of the user, judging the relevance of the large model, and further screening the search result. Compared with the prior art, the invention greatly increases the accuracy of recall content and improves the generation efficiency by introducing a multi-mode technology, a secondary recall strategy and the use of a large model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a multi-mode RAG based on a large model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a multi-mode RAG based on a large model according to an embodiment of the present invention;

FIG. 3 is a second schematic flow chart of a multi-mode RAG based on a large model according to an embodiment of the present invention;

fig. 4 is a schematic flow chart III of a multi-mode RAG based on a large model according to an embodiment of the present invention;

Fig. 5 is a schematic flow chart of a multi-mode RAG based on a large model according to an embodiment of the present invention;

Fig. 6 is a schematic flow chart of a multi-mode RAG based on a large model according to an embodiment of the present invention;

FIG. 7 is a flowchart of a multi-mode RAG based on a large model according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of a multi-mode RAG apparatus based on a large model according to an embodiment of the present invention;

Fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the invention provides a multi-mode RAG based on a large model, which can be applied to an application environment as shown in figure 1, wherein a client communicates with a server through a network. The method comprises the steps of preprocessing text information in a large-scale corpus through a client, constructing courseware content fragments according to the preprocessed text information and storing the courseware content fragments in a created index, recalling the courseware content fragments related to the problem by combining an index and a query algorithm based on the preprocessed keywords in the problem issued by a user to generate a first recall result, expanding the content according to a preset rule according to the first recall result, carrying out secondary recall, carrying out multidimensional sorting on the secondary recall result, integrating the sorted result with the problem of the user, carrying out correlation judgment through a large model, further screening search results, and feeding back the generated text to the client. The clients may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers. The present invention will be described in detail with reference to specific examples.

The embodiment of the application provides a multi-mode RAG based on a large model.

Fig. 2 is a schematic flow chart of a multi-mode RAG based on a large model according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps S110 to S160.

S110, preprocessing text information in a large-scale corpus by utilizing a multi-modal technology;

specifically, the multi-modal technology in this embodiment includes combining technologies such as image recognition, speech recognition, and natural language processing, and performing comprehensive processing on multimedia information such as text, pictures, and audio in a language library. The preprocessing comprises the steps of word segmentation, word deactivation, part-of-speech tagging, entity identification, semantic understanding and the like, so that the efficiency and the accuracy of subsequent processing are improved.

S120, constructing courseware content fragments according to the preprocessed text information and storing the courseware content fragments into the created index;

In particular, since ES is a distributed search engine, to handle large amounts of data and to implement horizontal expansion, the index is typically broken into multiple parts, which are slices. The fragments can be distributed on different nodes, thereby realizing parallel processing and query of data. Each tile is an independent, self-contained index unit with its own inverted index, document, map and settings. In this embodiment, the preprocessed text information is divided into smaller units according to a certain rule (such as paragraphs, chapters, etc.), and the courseware content is stored in fragments by using the distributed storage and indexing functions of ES, so as to create a corresponding index.

The ES is an elastomer search, which is an open-source search engine and provides real-time search and analysis functions based on the Lucene library. The elastiscearch itself has the characteristic of distributed storage, and has a high retrieval speed, so that it is often used to realize a full text retrieval function. In addition, the elastic search also has Natural Language Processing (NLP) functions such as word segmentation, part-of-speech tagging, named entity recognition, emotion analysis, text summarization and the like, and the functions enable the elastic search to be more intelligent and efficient in processing text information.

S130, based on the preprocessed keywords in the questions issued by the user, calling back courseware content fragments related to the questions by combining the index to generate a first calling result;

The system pre-processes the questions issued by the users, such as word segmentation, word stopping and the like, then uses the query DSL (Domain Specific Language ) of the ES as a query condition, combines a KNN (K-Nearest Neighbors, K nearest neighbor) query algorithm to construct a query sentence, and recalls the courseware content fragments related to the questions. The KNN algorithm can find the K vectors closest to the query vector based on the distance of the vectors. The step takes the key word issuing es index as a main part and the KNN semantic issuing inquiry as an auxiliary part to expand the relevance fragments of recall.

It is to be understood that "recall" herein refers to a process of retrieving or extracting data items from stored data that match or are related to a query condition according to a particular query condition or algorithm.

S140, expanding the content according to a preset rule and carrying out secondary recall according to the result of the primary recall;

Specifically, the dimension of expansion of the page numbers before and after is determined through the id of the courseware, the page number hit for the first time, and the actual business, such as the problem of crossing documents and expanding pages, so as to carry out secondary recall.

S150, carrying out multidimensional sorting on the secondary recall result;

s160, integrating the ordered results with the user questions, judging the relevance through a large model, and further screening search results.

In a more specific embodiment, as shown in FIG. 3, performing step S110 further specifically includes performing steps S111-S112.

S111, training courseware from different platforms is used as a corpus, and the training courseware comprises documents in different formats and audios and videos;

s112, effectively extracting the text in the language library, wherein the extraction mode comprises extraction by using a plurality of different libraries and tools.

Specifically, the training courseware in this embodiment is selected for the bird-aware platform. When the format type of the training courseware is docx format type, the python-docx library is used for opening a docx file and extracting text content in the docx file. For some descriptive charts in docx, OCR combining OPENCV is used to extract text in the image, which is combined with plain text content in the document for analysis.

When the format type of the training courseware is pptx, the python-pptx library is used for reading the text content, the slide layout and other information in ppt. For elements such as charts and pictures, the charts are converted into data by utilizing plotly tools and are combined with text descriptions of slides for analysis. For audio and video elements in ppt, metadata such as duration of audio, key frames of video, etc. are parsed and associated with the text of the slide show. For example, a concept in a slide is interpreted according to the content of audio, and the emphasis of the text description related to the concept is determined according to the content of video.

When the format type of the training courseware is the pdf format type, pypdf is used for analyzing the text content of the pdf, and TESSERACT OCR is used for processing the image characters in the pdf. The interactive elements (hyperlinks, forms, etc.) in the pdf are parsed and associated with text content. The structure of pdf is parsed, including page layout, chapter divisions, identification titles, subtitles, and the like. And carrying out association analysis on the structural information and the audio and video to obtain more comprehensive document information.

When the format type of the training courseware is xlsx format type, the openpyxl library is used for reading the data in the xlsx file. If the electronic form records data related to the multimedia (such as the playing amount of the video, the relation between the time length of the audio and a certain sort, etc.), the data of the form and the actual audio-video file are subjected to association analysis. For example, audio and video is categorized or filtered according to data in a spreadsheet. The relationships between the data in the xlsx files are analyzed, such as by looking up duplicate data, calculating correlations, and the like. If the tabular data is associated with other file formats (docx or ppt), they are combined using multi-modal techniques. For example, charts in ppt are generated from data in xlsx and consistency of data relationships is maintained during parsing.

As shown in fig. 4, in a more specific embodiment, performing step S120 further specifically includes performing steps S121-S122:

s121, defining mapping of indexes;

Specifically, when creating the index, it is necessary to define the mapping of the index, including settings such as field types, word splitters, lowercase transformations, etc., which will determine how to index and query the documents stored in the index, and by reasonably setting these parameters, the storage structure of the index can be optimized, the search speed can be improved, unnecessary resource waste can be reduced, and a more accurate and efficient search experience can be provided for the user.

The field type refers to the data type of a field, and specifically comprises text, keywords, numbers, dates and Boolean values, wherein different field types can influence the storage, indexing and query modes of an elastomer search on data, a word segmentation device is a core component used for segmenting the text in the elastomer search and divides a section of text into a plurality of keywords for subsequent searching and matching, and lower case conversion refers to the conversion of the text into lower case in the elastomer search and realizes the searching without distinguishing the lower case.

S122, the courseware content fragments are converted into vector representations.

In particular, the present embodiment is a vector-based query, where courseware content fragments need to be converted to vector representations, which involves converting text information to high-dimensional vectors using some text vectorization technique (e.g., TF-IDF, word2Vec, BERT, etc.).

As shown in fig. 5, in a more specific embodiment, performing the step S122 of converting the courseware content fragments into vector representations further specifically includes performing steps S1221-S1223:

S1221, loading a pre-trained word embedding model, and converting each word in the text into a corresponding vector representation;

S1222, storing the vector representation in the ES using a vector field;

S1223, defining the type, dimension and distance measurement mode of the vector field.

Specifically, first, a suitable pre-training Word embedding model is selected, and these models include Word2Vec, gloVe, BERT, etc., and training data, dimensions and performance of the model should be considered when selecting the model. The selected pre-trained model is loaded using an appropriate library (e.g., gensim, transformers, etc.) to ensure that the weights and parameters of the model are loaded correctly for subsequent word vector transformations. And performing word segmentation processing on the text of courseware content. Word segmentation is the process of segmenting text into individual lexical units. Each word after segmentation is converted into a corresponding vector representation using the loaded pre-trained model. These vectors represent the position of the vocabulary in semantic space, which can be used for subsequent similarity calculation and retrieval. An index is created in the elastic search and is used for storing vector representation of courseware content, the vector is used for representing courseware content fragments, vector fields are configured in the query, and parameters such as vector dimension, distance measurement mode and the like are determined.

In a more specific embodiment, as shown in FIG. 6, performing step S130 further specifically includes performing steps S131-S134.

S131, preprocessing keywords by removing stop words, extracting stems, judging intention, extracting keywords and weighting the keywords;

Specifically, the term "stop word" refers to a word that appears in a text with a high frequency but contributes little to the meaning of the text, such as "on", "off", etc. In the keyword preprocessing stage, firstly, the stop words need to be removed so as to reduce the complexity of subsequent processing and improve the purity of the keywords.

Stem extraction is the process of restoring words to their basic form, such as restoring "running" and "running" to the same stem "running". This helps treat words of different morphology but identical meaning as the same keyword, thereby improving the coverage of the keywords.

Intent determination by semantic analysis of text, determining the query intent of the user, which facilitates more accurate subsequent extraction of keywords related to the query intent.

Extracting keywords, namely extracting keywords in the text by using a natural language processing technology on the basis of removing stop words, extracting word stems and judging intention. These keywords should be able to accurately reflect the subject matter and content of the text.

Keyword weighting, namely, assigning different weights to the keywords according to the importance degree of the keywords in the text. This is achieved by counting the frequency of occurrence, location, etc. of keywords in the text. The weighted keywords will have a higher priority in the subsequent processing.

S132, inquiring the processed keywords by using the inverted index;

In particular, an inverted index is a data structure that records in which documents each keyword appears, and where in the documents the keywords are located, and uses the preprocessed keywords to find in the inverted index, which typically involves searching the index for items that match exactly the query keywords. For each matching keyword, the inverted index will provide a list of documents (and possibly location information) containing that keyword. If the query contains multiple keywords, the query results for each keyword need to be combined. This typically involves combining the result sets using boolean logic (e.g., AND, OR, NOT). For example, if the query is "machine learning AND deep learning," the system needs to find a document that contains both keywords, machine learning AND deep learning. And secondly, sorting the combined results according to a certain sorting standard (such as relevance scores, document quality, time stamps and the like). The ranked result set is returned to the user, typically including information such as the document title, abstract, URL, etc.

S133, calculating the similarity between the query problem vector and the courseware content fragment vector;

specifically, after the query problem is converted into a vector representation, the similarity between the query problem and the courseware content slicing vector is calculated, and the method can be realized by converting text into a vector by using an embedded model (such as BERT, GPT and the like) and applying a measurement method such as cosine similarity and the like.

S134, recalling courseware content fragments related to the problems according to the similarity.

Specifically, according to the calculated similarity, selecting the courseware content fragments with higher similarity as the first recall result. These fragments have a higher semantic relevance to the query problem and are therefore more likely to provide useful information to the user.

In a more specific embodiment, as shown in fig. 7, performing the content augmentation according to the preset rule in step S140 further specifically includes performing steps S141-S142.

S141, setting forward expansion N pages and backward expansion M pages;

s142, adjusting the values of N and M according to the actual service requirements and the characteristics of courseware content.

Specifically, by setting rules that expand N pages forward and M pages backward, the system is able to provide the user with front and back page content associated with the current query content. This helps the user to better understand the context of the current page, improving the consistency and depth of learning. And secondly, the values of N and M can be adjusted according to the actual service requirements and the characteristics of courseware content. This means that the system has flexibility and can be customized according to the requirements in different scenarios. Such as the need to span documents, and the problem of expanding pages.

In a more specific embodiment, performing the multi-dimensions in step S150 includes intent recognition, word aggregation density, keyword weight, and fuzzy matching.

Specifically, in search engines, the initial ranking is typically based on factors such as the degree of matching of documents to queries, field values, custom scripts, and the like. However, to provide more accurate, user-desired search results, secondary ranking is often required to further optimize the order in which the search results are presented. In this embodiment, the es results are secondarily ranked by adopting multiple dimensions such as intention recognition, word aggregation density, tf-idf weights of keywords in some specific fields, fuzzy matching degree and the like.

Wherein intent recognition includes intent analysis of a user's query using an intent classification system (e.g., the intent trisection or the Do-Know-Go system). According to the intention analysis result, different weights or priorities are allocated to the search results. For example, the web of the target website may be ranked in front for a navigation-type search, and the ranking may be adjusted according to the specificity of the query (direct, indirect, etc.) for an information-type search.

Word aggregation density involves calculating the word aggregation density for each document in the search results, i.e., the frequency and distribution of occurrences of a particular word or phrase in the document. And sorting the search results according to the word aggregation density. In general, documents with higher word aggregation densities are more likely to contain information desired by the user.

The weights of the keywords refer to TF-IDF weights of domain-specific keywords, which include extracting domain-specific keywords and calculating TF-IDF weights of the keywords in the search results. And sorting the search results according to the TF-IDF weight. The TF-IDF weights reflect the importance and uniqueness of keywords in documents, so higher weighted documents are more likely to contain valuable information related to the user query.

Fuzzy matching includes computing a fuzzy match between the search results and the query using a fuzzy matching function of the elastic search (e.g., match query). And sorting the search results according to the degree of fuzzy matching. Documents with higher fuzzy matching are more likely to meet the query needs of the user even if they do not exactly match the exact terms of the query.

In summary, the above-mentioned sequencing results of multiple dimensions are integrated to obtain a comprehensive score. And finally sorting the search results according to the comprehensive scores.

Compared with the prior art, the method and the device for preprocessing the RAG solve the problems of preprocessing and recall in the processing flow of the RAG in the prior art, and have the following beneficial effects:

1) At present, most of the industry supports searching of plain text information, the scheme adopts a multi-mode to extract data text information in various formats in an all-around and accurate mode, so that the text information is less lost, and recall content is more accurate and complete;

2) Aiming at the fact that the content of the es recall is inaccurate and the information is easy to miss, a second-order recall strategy is adopted. Related documents are recalled comprehensively through keywords and vectors for the first time, ids of courses and page numbers hit for the first time are passed through, the dimension of expansion of the page numbers before and after the page numbers is determined according to actual services such as the need of crossing the documents and the problem of expanding pages, secondary recall is performed, recall accuracy is improved, and information missing is reduced;

3) The relevance and accuracy of recall content are ensured by multiple-aspect judgment and the use of a large model to reform the arrangement sequence of recall documents.

Fig. 8 is a schematic block diagram of a multimode RAG device based on a large model according to an embodiment of the present application. As shown in fig. 8, the present application further provides a large model-based multimode RAG device 200 corresponding to the above large model-based multimode RAG. The large model-based multimode RAG device 200 includes a unit for performing the large model-based multimode RAG described above, and may be configured in a terminal such as a desktop computer, a tablet computer, a laptop computer, etc.

Specifically, referring to fig. 8, the large model-based multi-modal RAG device 200 includes:

The preprocessing unit 210 is configured to preprocess text information in a large-scale corpus by using a multi-modal technique.

The index construction unit 220 is configured to construct a courseware content fragment according to the preprocessed text information, and store the fragment in the created index.

The index construction unit 220 is further specifically configured to define a mapping of the index and convert the courseware content into a vector representation. The method comprises the steps of loading a pre-trained word embedding model, converting each word in a text into a corresponding vector representation, storing the vector representation in an index by using a vector field, and defining the type, the dimension and the distance measurement mode of the vector field.

The first recall unit 230 is configured to recall, based on the pre-processed keywords in the question issued by the user, the courseware content fragments related to the question in combination with the index to generate a first recall result.

The secondary recall unit 240 is configured to perform content expansion according to a preset rule and perform secondary recall according to a result of the primary recall. The method comprises the steps of setting forward expansion N pages and backward expansion M pages, and adjusting the values of N and M according to the actual service requirements and the characteristics of courseware contents.

The sorting unit 250 is configured to sort the secondary recall result in a multi-dimension manner, where the multi-dimension manner includes intent recognition, word aggregation density, keyword weight, and fuzzy matching degree.

The text generating unit 260 is configured to integrate the ranked results with the user questions, perform relevance judgment through the large model, and further screen the search results.

Furthermore, the preprocessing unit 210 is specifically configured to use training courseware from different platforms as a corpus, where the training courseware includes documents in different formats and audio/video, and effectively extract text in the corpus, where the extraction method includes extracting using multiple different libraries and tools.

The first recall unit 230 is specifically configured to perform keyword preprocessing by removing stop words, stem extraction, intention judgment, keyword extraction and keyword weighting, query the processed keywords by using inverted indexes, calculate the similarity between a query problem vector and a courseware content slicing vector, and recall the courseware content slicing related to the problem according to the similarity.

It should be noted that, as those skilled in the art can clearly understand, the specific implementation process of the foregoing large-model-based multi-mode RAG device and each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, the description is omitted here.

The above described large model based multi-modal RAG device 200 may be implemented in the form of a computer program which may be run on a computer apparatus as shown in fig. 9.

Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 300 may be a terminal or a server, where the terminal may be an electronic device having a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server may be an independent server or a server cluster formed by a plurality of servers.

With reference to FIG. 9, the computer device 300 includes a processor 302, a memory, and a network interface 305, which are connected by a system bus 301, wherein the memory may include a non-volatile storage medium 303 and an internal memory 304.

The non-volatile storage medium 303 may store an operating system 3031 and a computer program 3032. The computer program 3032 includes program instructions that, when executed, cause the processor 302 to perform a large model-based multi-modal RAG.

The processor 302 is used to provide computing and control capabilities to support the operation of the overall computer device 300.

The internal memory 304 provides an environment for the execution of a computer program 3032 in the non-volatile storage medium 303, which computer program 3032, when executed by the processor 302, causes the processor 302 to execute a large model-based multi-modal RAG.

The network interface 305 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 9 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 300 to which the present inventive arrangements may be applied, and that a particular computer device 300 may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.

It should be appreciated that in embodiments of the present application, the Processor 302 may be a central processing unit (Central Processing Unit, CPU), the Processor 302 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATEARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program, wherein the computer program includes program instructions. The program instructions, when executed by the processor, cause the processor to perform the steps of:

s150, carrying out multidimensional sorting on the secondary recall result;

The storage medium may be a U-disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media capable of storing program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The non-native company software tools or components present in the embodiments of the present application are presented by way of example only and are not representative of actual use.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, or the like) to perform all or part of the steps of the methods of the embodiments of the present application.

The present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present application, and these modifications and substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A multimodal RAG based on a large model, characterized by comprising the following steps:

Use multimodal technology to preprocess text information in large-scale corpora;

Construct courseware content segments based on preprocessed text information and store them in the created index;

Based on the pre-processed keywords in the questions sent by users, the courseware content segments related to the questions are recalled in combination with the index to generate the first recall results;

Based on the results of the first recall, the content is expanded according to the preset rules and a second recall is conducted;

Sorting the results of the secondary recall in multiple dimensions;

Integrate the sorted results with user questions, make relevance judgments through a large model, and further filter the search results.

2. According to the large model-based multimodal RAG of claim 1, the step of preprocessing text information in a large-scale corpus using multimodal technology comprises:

Using training courseware from different platforms as a corpus, the training courseware includes documents in different formats and audio and video;

Effectively extracting text from a corpus, the extraction method includes using a variety of different libraries and tools to extract.

3. According to the multimodal RAG based on a large model as described in claim 1, it is characterized in that the step of constructing courseware content fragments according to the preprocessed text information and storing them in the created index comprises:

Define the mapping of the index;

Convert courseware content slices into vector representations.

4. A multimodal RAG based on a large model according to claim 3, characterized in that the step of converting the courseware content fragments into vector representations comprises:

Load a pre-trained word embedding model to convert each word in the text into its corresponding vector representation;

storing the vector representation in an index using a vector field;

Defines the type, dimension, and distance measure of a vector field.

5. According to the multimodal RAG based on a large model as claimed in claim 1, it is characterized in that the step of generating the first recall result by combining the index to recall the courseware content fragments related to the question based on the preprocessed keywords in the question sent by the user comprises:

Preprocess keywords by removing stop words, stemming, judging intent, extracting keywords, and weighting keywords;

Use the inverted index to query the processed keywords;

Calculate the similarity between the query question vector and the courseware content segment vector;

Recall the courseware content segments related to the question based on similarity.

6. The multimodal RAG based on a large model according to claim 1, characterized in that the content expansion according to the preset rules comprises:

Set to expand forward by N pages and backward by M pages;

Adjust the values of N and M according to actual business needs and the characteristics of the courseware content.

7. According to the multimodal RAG based on a large model as described in claim 1, it is characterized in that the results of the secondary recall are sorted in multiple dimensions, wherein the multiple dimensions include: intent recognition, word aggregation density, keyword weight and fuzzy matching degree.

8. A multimodal RAG device based on a large model, characterized in that the device comprises:

A preprocessing unit, used to preprocess text information in a large-scale corpus using multimodal technology;

An index building unit, used to build courseware content segments based on preprocessed text information and store them in the created index;

The first recall unit is used to recall the courseware content segments related to the question based on the pre-processed keywords in the question sent by the user and in combination with the index to generate the first recall result;

A secondary recall unit is used to expand the content and conduct a secondary recall according to the result of the first recall and the preset rules;

A sorting unit, used to sort the results of the secondary recall in multiple dimensions;

The text generation unit is used to integrate the sorted results with the user questions, make relevance judgments through the large model, and further filter the search results.

9. A computer device, characterized in that the computer device comprises a memory and a processor, the memory stores a computer program, and the processor implements the method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program, wherein the computer program comprises program instructions, and when the program instructions are executed by a processor, the method according to any one of claims 1 to 7 can be implemented.