WO2025092056A1 - Question-and-answer data generation method and apparatus, and computer device and storage medium - Google Patents
Question-and-answer data generation method and apparatus, and computer device and storage medium Download PDFInfo
- Publication number
- WO2025092056A1 WO2025092056A1 PCT/CN2024/107859 CN2024107859W WO2025092056A1 WO 2025092056 A1 WO2025092056 A1 WO 2025092056A1 CN 2024107859 W CN2024107859 W CN 2024107859W WO 2025092056 A1 WO2025092056 A1 WO 2025092056A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- answer
- question
- target
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Definitions
- the present disclosure relates to the field of data processing, and in particular to a method, device, computer equipment and storage medium for generating question and answer data.
- the present disclosure provides a method, apparatus, computer device and storage medium for generating question and answer data to solve the problems of low efficiency and poor data quality when performing QA data mining on documents in related technologies.
- the present disclosure provides a method for generating question-answer data, the method comprising:
- each cut document block contains a first preset number of text data
- the target question data corresponding to the cut document block is obtained;
- Acquire full data wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks;
- the target question data, the full amount of data and the target answer model, the target answer data corresponding to the target question data is obtained;
- the method before obtaining the target question data corresponding to the cut document block according to the text data in the cut document block and the target question model, the method further includes:
- the initial question model is optimized according to the candidate questions to obtain the target question model.
- determining a questioning strategy based on text data in the cut document block includes:
- the method before obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full data, and the target answer model, the method further includes:
- the initial answer model is optimized according to the candidate answer data to obtain the target answer model.
- determining an answer strategy based on text data and target question data in the segmented document block includes:
- target answer data corresponding to the target question data is obtained according to the segmented document blocks, the target question data, the full amount of data, and the target answer model, including:
- the target question data, the full amount of data and the target answer model, multiple answer data are obtained;
- the method further includes:
- the question and answer data is scored using the question and answer data scoring model to obtain a third score for the question and answer data;
- the question and answer data is cleaned according to the third score to obtain target question and answer data that meets the preset requirements;
- the data sample is expanded according to the target question and answer data to obtain a third preset number of question and answer data.
- the cut document block includes text data after image recognition.
- the present disclosure provides a device for generating question-answer data, the device comprising:
- a first acquisition module is used to acquire a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data;
- the first obtaining module is used to obtain the target question data corresponding to the cut document block according to the text data in the cut document block and the target question model;
- a second acquisition module is used to acquire full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks;
- the second obtaining module is used to obtain target answer data corresponding to the target question data according to the cut document blocks, the target question data, the full amount of data and the target answer model;
- the third module is used to generate question-answer data based on the target question data and the target answer data.
- the present disclosure provides a computer device, comprising: a memory and a processor, the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the generation of question and answer data of the first aspect or any corresponding embodiment thereof.
- a computer device comprising: a memory and a processor, the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the generation of question and answer data of the first aspect or any corresponding embodiment thereof.
- the present disclosure provides a computer-readable storage medium having computer instructions stored thereon, the computer instructions being used to enable a computer to execute the method for generating question and answer data of the first aspect or any corresponding embodiment thereof.
- FIG1 is a flow chart of a method for generating question and answer data according to some embodiments of the present disclosure
- FIG2 is a schematic diagram of a complete flow chart of a method for generating question and answer data according to some embodiments of the present disclosure
- FIG3 is a structural block diagram of a device for generating question and answer data according to some embodiments of the present disclosure
- FIG. 4 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present disclosure.
- the embodiment of the present disclosure proposes an embodiment of a method for generating question and answer data. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although the logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.
- FIG. 1 is a flow chart of a method for generating question and answer data according to some embodiments of the present disclosure. As shown in FIG. 1 , the method can be applied to a server side. The method flow includes the following steps:
- Step S101 obtaining a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data.
- the server side obtains a document of question and answer data (i.e., QA) to be extracted (or mined), and cuts the document, such as cutting it into 10 parts in equal proportion, to obtain a plurality of cut document blocks. If the cut is equal proportion, each cut document block contains the same first preset amount of text data, for example, if the document contains 100,000 words of text data in total, after being cut into 10 parts, each cut document block contains the same 10,000 words of text data.
- QA question and answer data
- Step S102 obtaining target question data corresponding to the cut document block according to the text data in the cut document block and the target question model.
- the text data in each cut document block is input into a trained target question model to obtain target question data corresponding to the cut document block.
- Step S103 obtaining full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks.
- Context information can also be encoded by using sentence or paragraph representation methods. This approach can help the model capture long-range context information and incorporate it into the processing flow when making predictions.
- the answer to a question in a document may depend on the content of another document.
- full data is needed.
- the full data is stored in a vector database (a complete document database composed of multiple cut document blocks).
- the question is vectorized, it is matched with the data in the vector database.
- the similarity of the vector is used to filter out the content information most relevant to the target question data, such as 10 pieces of data, and then The target question data and the 10 data fragments related to it are given to the question selection answer strategy related model. After applying the answer strategy, the corresponding target answer data is obtained.
- the full amount of data can be determined in the following ways: 1. Through the Embedding model + vector database, information can be searched and extracted across documents. 2. Document links: If there are clear links between documents, documents can be captured and processed in sequence according to the links to form a contextual understanding. 3. Compound document modeling: Establish a multi-input model and input multiple related documents at the same time for understanding and reasoning. 4. Answer extraction and synthesis: When the answer span is large, answer extraction will be performed first, all possible answer fragments will be extracted, and then answer synthesis will be performed to splice these fragments in an appropriate way. In the extraction stage, multiple answer fragments can be set and each answer fragment can be given a score.
- the answer with the highest score can be selected, or the answers with higher scores can be reasonably spliced.
- Paragraph-level processing When the answer may span multiple paragraphs, the paragraph is used as the processing unit, and the information of multiple paragraphs is considered at the same time to generate the answer. Documents can be cut using sliding windows or other methods to capture a wider range of context.
- Step S104 obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full data and the target answer model.
- the embodiment of the present disclosure needs to consider the text data of each cut document block, the target question data output by the target question model, and all the full data associated with the target question data, and then input these data that need to be considered into the trained target answer model to obtain the target answer data corresponding to the target question data.
- Step S105 generating question and answer data according to the target question data and the target answer data.
- each cut document block corresponds to a target question data and a target answer data.
- the question-answer data consisting of a target question data and a target answer data is the final question-answer data mined from each cut document block.
- the final question and answer data mined from all the cut document blocks can be used to form a QA library for storing these question and answer data.
- the method before obtaining the target question data corresponding to the cut document block according to the text data in the cut document block and the target question model, the method further includes:
- the initial question model is optimized according to the candidate questions to obtain the target question model.
- the initial question model needs to be continuously optimized to obtain the final target question model.
- the QA structured data needs to be screened.
- the text suitable for extracting QA structured data has the following characteristics:
- the question strategy of each cut document block is determined, so as to extract multiple question questions (hereinafter also referred to as "questions") from each cut document block.
- the present embodiment sets up a question scoring model in advance to score the multiple questions obtained.
- the scoring criteria of the question scoring model are designed as follows: 1. Difficulty of the question: select challenging questions. 2. Diversity of questions: ensure that the questions come from different fields and knowledge points. 3. Questions 1. Precision: measures whether the question accurately describes a specific concept or fact. 2. Quality of question presentation: ensure that the question has good grammar and clarity. 3. Clarity of question: whether the question is clear and easy to understand. 4. Relevance of question: whether the question is closely related to the given topic or data set. 5. Information content of question: whether the question involves meaningful information. 6. Complexity of question: whether the question is difficult enough to be answered by a simple search.
- Questions are input into the question scoring model to obtain a score value for each question. Questions are sorted according to the score value to select high-quality candidate questions. It is understandable that there may be multiple high-quality candidate questions, in which case multiple cycles are required, and a scoring threshold is set at the same time, and only questions above the threshold are retained. In addition, the scoring threshold can be adjusted according to the needs of the embodiment of the present disclosure to obtain questions of different quantities and qualities.
- the retained questions are optimized, including modifying the question wording and adding key information.
- the selected and optimized candidate questions are used as training samples and input into the initial question model for training.
- the initial question model is continuously optimized to improve the question quality.
- the system will generate a high-quality question set and finally obtain the target question model.
- the embodiment of the present disclosure also proposes some optimization directions:
- Multi-factor scoring In the design of the question scoring model, in addition to the scoring criteria mentioned above, multiple factors can also be combined to score the questions. For example, the NLP model can be used to perform semantic analysis on the questions, and the relevance, information content, and complexity of the questions can be combined to give a comprehensive score.
- Adaptive threshold In the question screening stage, an adaptive threshold strategy will be used to dynamically adjust the scoring threshold. For example, the scoring threshold is dynamically adjusted based on factors such as the number of questions, question difficulty, and question category. This helps to more accurately screen out high-quality questions and improve the quality of the question library.
- Balanced sample weights In the return training phase, try to use the method of balanced sample weights to avoid model overfitting. Different weights are assigned to questions based on factors such as the importance and difficulty of the questions, balancing the impact of different types of questions in the training process.
- Cross-validation During the question collection and preprocessing phase, cross-validation is used to increase the question pool. For example, when dealing with different datasets on the same topic, a subset of questions can be extracted from each dataset and then scored and filtered. This approach can provide a wider set of questions, which helps improve the quality of the question bank.
- high-quality questions are screened out through the question scoring model, so that the initial question model can continuously adapt to the mining strategy and style of a specific enterprise or industry, and improve the accuracy of the trained target question model.
- relevant strategies and algorithms are used for dynamic optimization, so as to learn and improve in the continuous generation of high-quality and authentic synthetic data.
- determining a questioning strategy based on text data in the segmented document block includes:
- the corresponding questioning strategy is determined, as shown in Table 2 (wherein Table 2 includes both the question extraction strategy and the answer extraction strategy corresponding to the question):
- Table 2 only determines the corresponding questioning strategies and answer extraction strategies for educational materials, such as textbooks, handouts, tutorials and other text types; technical documents, such as product manuals, API documents, development guides, and research papers, such as academic papers, industry reports, etc., and regulations and policy documents, news reports and articles also obtain corresponding questioning strategies and answer extraction strategies based on the current text type.
- technical documents such as product manuals, API documents, development guides, and research papers, such as academic papers, industry reports, etc.
- regulations and policy documents, news reports and articles also obtain corresponding questioning strategies and answer extraction strategies based on the current text type.
- the embodiments of the present disclosure will not be repeated.
- Dynamically adjust the questioning strategy Dynamically adjust the questioning strategy based on the quality, domain, and background knowledge of the text. For example, adjust the threshold for keyword extraction, use different entity types and relationship types, etc.
- Adaptive strategy adjustment When processing a large amount of text, the model may need to automatically adapt to the characteristics and difficulty of different texts. To this end, develop adaptive questioning strategies, such as dynamically adjusting the questioning model according to the complexity and domain of the text to achieve higher quality question mining.
- User interaction In order to improve the effectiveness and flexibility of the questioning strategy, user interaction is introduced to enable users to participate in the adjustment of the questioning strategy, provide feedback and receive correction suggestions. For example, after the question is generated, users are asked to evaluate it and the questioning strategy is adjusted according to user feedback.
- questioning strategies, answering strategies, etc. are flexibly selected and adjusted according to different texts and industry requirements to make them more suitable for actual application scenarios.
- the method before obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full data, and the target answer model, the method further includes:
- the initial answer model is optimized according to the candidate answer data to obtain the target answer model.
- the answer strategy can be determined (see Table 2), and then multiple answer data can be obtained. Collecting these answer data requires a series of preprocessing of the answers, including cleaning irrelevant words, correcting spelling errors, etc., to optimize the expression of the answers.
- the goal of the answer scoring model is to sort the quality of multiple answer data according to specific criteria.
- the scoring criteria include: Answer accuracy: evaluate whether the content of the answer is accurate and how well it matches the question.
- Answer completeness evaluate whether the answer has obtained complete information and whether it can fully answer the question.
- Answer expression quality check the language expression of the answer, including grammatical accuracy, comprehensibility, etc.
- the second scores of multiple answers are sorted, a threshold is set, and multiple candidate answer data with scores exceeding the threshold are retained.
- the multiple candidate answer data after screening and optimization are used as training samples and input into the initial answer model for further training, cyclic iteration, and optimization of the initial answer model performance.
- the system will continue to generate high-quality answer sets to prepare for the construction of the QA library and subsequent processes, and then obtain the final trained target answer model.
- the target answer data obtained by the target answer model should be accurate, complete and clearly expressed.
- Multi-task learning During the training process, a multi-task learning strategy is used to optimize the accuracy, completeness, and presentation quality of the answer at the same time. This is achieved by sharing hidden layer parameters and using soft parameter sharing between tasks.
- Real-time fine-tuning For real-time application scenarios, a real-time fine-tuning strategy is adopted, that is, the answer scoring model is updated in real time according to the newly generated answers. This will enable the model to adapt to the changing data distribution and improve the quality of the answers.
- iterative optimization and the use of advanced machine learning methods can lay a solid foundation for building a high-quality question-answer library.
- the questioning strategy, answering strategy, etc. can be flexibly selected and adjusted to make them more suitable for actual application scenarios.
- determining an answer strategy based on text data and target question data in the segmented document block includes:
- an answer strategy can be determined based on the text type and the target question data extraction strategy. Then, according to the answer strategy, multiple answers related to the target question data can be obtained, and then these answers are scored to obtain a high-quality answer set consisting of multiple candidate answer data, and then the initial answer model is continuously iterated and optimized to obtain a trained target answer model. It should be explained that the number of answers output by the target answer model should only be one target answer data.
- obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full amount of data, and the target answer model includes:
- the target question data, the full amount of data and the target answer model, multiple answer data are obtained;
- n answers may be generated. These answers may be generated by different initial answer models, or by the same initial answer model under slightly different input conditions. Break the screened answers into smaller information units (i.e. target fields). These units may be sentences or phrases describing specific facts. Compare each two information units, and if they describe the same facts or details, only keep the answer data with the higher score. If they describe different facts or details, both are retained.
- the selected information units are combined to form a new answer.
- auxiliary strategies are used to determine the order of the information units, which include but are not limited to their order in the original answer, or according to a certain logical order.
- the reconstructed answer is post-processed, including checking grammar, adjusting word order, and correcting spelling errors, etc., to obtain the final target answer data.
- the method further includes:
- the question and answer data is scored using the question and answer data scoring model to obtain a third score for the question and answer data;
- the question and answer data is cleaned according to the third score to obtain target question and answer data that meets the preset requirements;
- the data sample is expanded according to the target question and answer data to obtain a third preset number of question and answer data.
- the embodiment of the present disclosure sets up a question-answer data scoring model in advance to check the quality of the obtained QA.
- the core indicators of QA data are:
- Consistency Are similar or repeated questions and answers extracted from different parts or different documents consistent? Ensure that the answers provided are consistent across documents or document parts.
- the question and answer data of each cut document block is scored to obtain the third score of the question and answer data.
- the scoring criteria are:
- Rule-based checking Use regular expressions or other rules to check the format of questions and answers. Automatically check the completeness of answers, such as ensuring that answers are not truncated.
- Sentence Embedding Comparison Use a pre-trained sentence embedding model to convert questions and answers into vectors. Compare the embeddings of questions and answers to evaluate their relevance or similarity.
- Feedback loop Automatically generate questions using language model, validate answers against documents. Compare automatically generated questions with actual extracted questions to assess their quality.
- Diversity and duplication check Use Jaccard similarity, cosine similarity or other text similarity methods to check the duplication between the extracted QA data.
- Statistical analysis Automatically count the frequency of certain keywords or words to determine if there are too many repeated or missing topics.
- Use TF-IDF term frequency–inverse document frequency, Common weighting techniques used in information retrieval and data mining) or other techniques to identify unusual or rare words that appear in questions or answers.
- Contextual consistency Use a pre-trained language model to evaluate the consistency of the answer in context, ensuring that the answer is relevant to the text around it.
- Error analysis Use automated tools, such as grammar checkers or text classifiers, to identify possible text errors or inconsistencies. Also design an automated system to continuously fine-tune and optimize the data extraction model based on feedback and errors.
- Reference dataset comparison Comparison with a seed dataset of verified quality, using an automated scoring system to assess the quality of the QA data.
- the disclosed embodiment will clean the question and answer data according to the third score, including removing low-quality question and answer data, removing sensitive privacy data, etc., to obtain target question and answer data that meets the preset requirements to ensure the high quality of the question and answer data.
- the data sample is expanded according to the target question and answer data obtained after cleaning to ensure the sufficiency of the data sample of high-quality QA data.
- the embodiment of the present disclosure also proposes some auxiliary strategies to support the application of the QA library in more scenarios.
- Incremental update strategy As the business develops, the enterprise may generate new document data. At this time, an incremental update mechanism can be designed to regularly update and optimize the existing QA library to maintain its timeliness and relevance.
- QA library will be used in scenarios such as customer support, user feedback on answers to questions can be collected to continuously optimize and update the QA library. For example, users can provide feedback on the relevance, accuracy, and ease of understanding of answers, and the system will continuously optimize based on these feedbacks.
- topic models Use topic models to cluster questions so as to maintain a good diversity of questions in the QA database. This helps avoid excessive concentration on a single topic and ensures that the QA database covers multiple fields and knowledge points.
- Visual data evaluation Provides visualization tools to present the process and results of QA scoring screening. It can display key indicators such as the distribution of questions and answers, quality score change trends, etc., to help companies gain insight into the overall situation of data quality.
- Integrate external knowledge base Combine the QA library with external knowledge base to expand the coverage of questions and answers. coverage and improve adaptability to user queries.
- strategies such as multi-task learning and reinforcement learning are applied to improve data quality.
- the robustness of the model is improved through methods such as model integration and data enhancement, and the impact of uncertainty on model performance is reduced.
- the cut document block includes text data after image recognition.
- the server side after the server side obtains the document of the question and answer data (i.e., QA) to be extracted (or mined), it can first determine whether the document is an image document or a non-image document. If it is an image document, it is necessary to identify the image document, identify the corresponding text content, and then cut the document, such as cutting it into 10 parts in equal proportions, etc., to obtain multiple cut document blocks, so that each cut document block includes the text data after the image recognition.
- the document of the question and answer data i.e., QA
- it can first determine whether the document is an image document or a non-image document. If it is an image document, it is necessary to identify the image document, identify the corresponding text content, and then cut the document, such as cutting it into 10 parts in equal proportions, etc., to obtain multiple cut document blocks, so that each cut document block includes the text data after the image recognition.
- the unstructured data is a picture
- the picture is processed, and visual encoder feature extraction is used. Then, OCR technology is used to extract text from the picture, and the context information of the extracted text is obtained at the same time.
- the extracted text and context information are used for text preprocessing and document segmentation, such as segmenting into n document blocks of less than 10k. If the unstructured data is not a picture, the text in the document is preprocessed and the document is segmented into n document blocks of less than 10k.
- the questioning strategy is selected according to the text type of the text in the document block (such as regular documents, special professional documents, news reports, literary works, etc. in Figure 2).
- a variety of questioning strategies and questioning models are applied to extract n questions, and the n questions are input into the question scoring model for scoring. This cycle is repeated N times to screen out high-quality questions to optimize the questioning model, and finally an optimal question is output to input into the QA library and determine the answer strategy module.
- the answer strategy is determined based on the text type in each document block (such as regular documents, special professional documents, news reports, literary works, etc. in Figure 2) and the input best question. Then, multiple answer strategies, each document block, and the full amount of data associated with the best question obtained by vector processing of unstructured data are input into the answer model to obtain multiple answer data. Then, the multiple answer data are input into the answer scoring model. After N rounds of cycles, high-quality answers are screened out to learn and optimize the answer model; n answers are integrated to form an optimal answer (that is, the best answer), which is input into the QA library. It can be seen that the QA library stores QA data corresponding to one best answer for one best question.
- the QA scoring model is used to score each QA data in the QA library for quality inspection.
- the QA data can be cleaned to remove low-quality data and sensitive privacy data.
- the cleaned QA data can be expanded. Finally, all models can be fine-tuned based on the expanded and cleaned QA data.
- data processing is completed locally, and enterprise data will not be exposed to third parties, which greatly increases data security.
- these models are continuously adapted to the mining strategy and style of a specific enterprise or industry, thereby achieving knowledge transfer and building an enterprise-specific mining model.
- a trade-off is made based on hardware resources and performance requirements, and the appropriate model architecture and parameter settings are selected, which can reduce the demand for computing resources without affecting the quality of synthetic data.
- a device for generating question and answer data is also provided, which is used to implement the above-mentioned embodiments and preferred implementation modes, and will not be repeated hereafter.
- the term "module” can be a combination of software and/or hardware that implements a predetermined function.
- the devices described in the following embodiments are preferably implemented in software, the implementation of hardware, or a combination of software and hardware, is also possible and conceivable.
- This embodiment provides a device for generating question-answer data, as shown in FIG3 , including:
- a first acquisition module 301 is used to acquire a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data;
- the first obtaining module 302 is used to obtain target question data corresponding to the cut document block according to the text data in the cut document block and the target question model;
- the second acquisition module 303 is used to acquire full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks;
- the second obtaining module 304 is used to obtain target answer data corresponding to the target question data according to the cut document blocks, the target question data, the full amount of data and the target answer model;
- the third obtaining module 305 is used to generate question-answer data according to the target question data and the target answer data.
- the device further comprises:
- a first determination module is used to determine a questioning strategy based on the text data in the cut document block before obtaining target question data corresponding to the cut document block based on the text data in the cut document block and the target question model;
- An extraction module used to process text data according to a questioning strategy and extract multiple questioning questions
- a third acquisition module is used to obtain first scores for multiple questions according to the question scoring model
- a second determination module is used to determine a plurality of candidate questions from a plurality of questions according to the first score
- the fourth obtaining module is used to optimize the initial question model according to the candidate questions to obtain the target question model.
- the first determining module includes:
- a first determining unit used to determine the text type of the text data
- the second determining unit is used to determine a corresponding questioning strategy according to the text type.
- the device further comprises:
- a third determination module is used to determine an answer strategy based on the text data in the cut document block and the target question data before obtaining the target answer data corresponding to the target question data based on the cut document block, the target question data, the full amount of data and the target answer model;
- the fifth obtaining module is used to process the text data according to the answer strategy to obtain a plurality of answer data
- a fourth acquisition module used for acquiring a second score for the plurality of answer data according to the answer scoring model
- a fourth determination module configured to determine a plurality of candidate answer data from the plurality of answer data according to the second score
- the sixth obtaining module is used to optimize the initial answer model according to the candidate answer data to obtain the target answer model.
- the third determining module includes:
- a third determining unit used to determine the text type of the text data
- the fourth determination unit is used to determine the answer strategy according to the text type and the target question data.
- the second obtaining module 304 includes:
- An obtaining unit used for obtaining a plurality of answer data according to the cut document blocks, the target question data, the full amount of data and the target answer model;
- Splitting unit is used to split the answer data into the smallest unit to obtain multiple target fields
- a comparing unit used for comparing the description content of the target field in each second preset number of answer data
- a retaining unit used for retaining the description content corresponding to the answer data with the second highest score
- the integration unit is used to integrate the description content to obtain target answer data.
- the device further comprises:
- a scoring module for scoring the question and answer data using a question and answer data scoring model after obtaining the question and answer data about the cut document block, to obtain a third score for the question and answer data;
- a cleaning module used to clean the question and answer data according to the third score to obtain target question and answer data that meets preset requirements
- the expansion module is used to expand the data sample according to the target question and answer data to obtain a third preset number of question and answer data.
- the cut document block includes text data after image recognition.
- the question and answer data generating device in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and memory that executes one or more software or fixed programs, and/or other devices that can provide the above functions.
- the embodiment of the present disclosure also provides a computer device having the device for generating question and answer data shown in FIG. 3 above.
- FIG 4 is a schematic diagram of the structure of a computer device provided by an optional embodiment of the present disclosure.
- the computer device includes: one or more processors 10, a memory 20, and interfaces for connecting various components, including high-speed interfaces and low-speed interfaces.
- the various components are connected to each other using different buses for communication, and can be installed on a common motherboard or installed in other ways as needed.
- the processor can process instructions executed in the computer device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface).
- an external input/output device such as a display device coupled to the interface.
- multiple processors and/or multiple buses can be used together with multiple memories and multiple memories.
- multiple computer devices can be connected, and each device provides some necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system).
- a processor 10 is taken as an example.
- the processor 10 may be a central processing unit, a network processor or a combination thereof.
- the processor 10 may further include a hardware chip.
- the hardware chip may be a dedicated integrated circuit, a programmable logic device or a combination thereof.
- the programmable logic device may be a complex programmable logic device, a field programmable gate array, a general purpose array logic or any combination thereof.
- the memory 20 stores instructions executable by at least one processor 10 to enable at least one The processor 10 executes the method shown in the above embodiment.
- the memory 20 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function; the data storage area may store data created by the use of a computer device based on the presentation of a small program landing page, etc.
- the memory 20 may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device.
- the memory 20 may optionally include a memory remotely arranged relative to the processor 10, and these remote memories may be connected to the computer device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the memory 20 may include a volatile memory, such as a random access memory; the memory may also include a non-volatile memory, such as a flash memory, a hard disk or a solid state drive; the memory 20 may also include a combination of the above types of memory.
- a volatile memory such as a random access memory
- the memory may also include a non-volatile memory, such as a flash memory, a hard disk or a solid state drive
- the memory 20 may also include a combination of the above types of memory.
- the computer device further comprises a communication interface 30 for the computer device to communicate with other devices or a communication network.
- the embodiments of the present disclosure also provide a computer-readable storage medium.
- the above-mentioned method according to the embodiments of the present disclosure can be implemented in hardware, firmware, or can be implemented as a computer code that can be recorded in a storage medium, or can be implemented as a computer code that is originally stored in a remote storage medium or a non-temporary machine-readable storage medium and will be stored in a local storage medium and downloaded through a network, so that the method described herein can be stored in such software processing on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware.
- the storage medium can be a magnetic disk, an optical disk, a read-only storage memory, a random access memory, a flash memory, a hard disk or a solid-state drive, etc.; further, the storage medium can also include a combination of the above-mentioned types of memory.
- a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component that can store or receive software or computer code. When the software or computer code is accessed and executed by a computer, a processor, or hardware, the method shown in the above embodiment is implemented.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求申请号为202311434678.6,题为“问答数据的生成方法、装置、计算机设备及存储介质”、申请日为2023年10月31日的中国发明专利申请的优先权,通过引用方式将该申请整体并入本文。This application claims priority to Chinese invention patent application number 202311434678.6, entitled “Method, device, computer equipment and storage medium for generating question and answer data” and filed on October 31, 2023, and the entire application is incorporated herein by reference.
本公开涉及数据处理领域,具体涉及问答数据的生成方法、装置、计算机设备及存储介质。The present disclosure relates to the field of data processing, and in particular to a method, device, computer equipment and storage medium for generating question and answer data.
大部分企业都有现成的大量的文本数据,这些数据是过去长期的业务开展过程中沉淀下来的。企业若希望进行大模型的智能化尝试,就需要利用这些现有数据。然而这些数据没有经过清洗和梳理、没有经过标注,使得技术人员在从现有文档中挖掘出需要的QA(Question and Answer,问题和答案)数据(也即是问答数据)时效率较低且挖掘出的数据质量较低。Most companies have a large amount of existing text data, which is accumulated in the long-term business development process. If companies want to try to make large models intelligent, they need to use these existing data. However, these data have not been cleaned, sorted, or labeled, which makes it inefficient for technicians to mine the required QA (Question and Answer) data (also known as question and answer data) from existing documents, and the mined data is of low quality.
当前为了保证从现有文档中能够挖据出质量较高的QA数据,通常采用人工标注的方式,但是该方法要花费较大的人工成本和时间成本,时效性不能保证,同时人工标注需要依赖工作人员自身的专业素养,导致挖掘出的数据质量参差不齐,甚至出现较多质量较差的数据。Currently, in order to ensure that high-quality QA data can be mined from existing documents, manual annotation is usually used. However, this method costs a lot of labor and time, and timeliness cannot be guaranteed. At the same time, manual annotation depends on the professional qualities of the staff themselves, resulting in uneven quality of the mined data, and even a large amount of poor quality data.
因此,相关技术在面对文档进行QA数据挖掘时,存在效率低下、数据质量不佳的问题。Therefore, when related technologies are used for QA data mining of documents, there are problems of low efficiency and poor data quality.
发明内容Summary of the invention
有鉴于此,本公开提供了一种问答数据的生成方法、装置、计算机设备及存储介质,以解决相关技术在面对文档进行QA数据挖掘时,存在效率低下、数据质量不佳的问题。In view of this, the present disclosure provides a method, apparatus, computer device and storage medium for generating question and answer data to solve the problems of low efficiency and poor data quality when performing QA data mining on documents in related technologies.
第一方面,本公开提供了一种问答数据的生成方法,该方法包括:In a first aspect, the present disclosure provides a method for generating question-answer data, the method comprising:
获取多个切割文档块,其中,每个切割文档块内包含第一预设数量个文本数据; Acquire a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data;
根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据;According to the text data in the cut document block and the target question model, the target question data corresponding to the cut document block is obtained;
获取全量数据,其中,全量数据为由多个切割文档块所组成的完整文档中包含的与目标问题数据相关联的内容信息;Acquire full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks;
根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据;According to the cut document blocks, the target question data, the full amount of data and the target answer model, the target answer data corresponding to the target question data is obtained;
根据目标问题数据和目标回答数据,生成问答数据。Generate question and answer data based on the target question data and the target answer data.
在一种可选的实施方式中,在根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据之前,方法还包括:In an optional implementation, before obtaining the target question data corresponding to the cut document block according to the text data in the cut document block and the target question model, the method further includes:
根据切割文档块内的文本数据,确定提问策略;Determine the questioning strategy based on the text data in the cut document block;
根据提问策略对文本数据进行处理,提取多个提问问题;Process the text data according to the questioning strategy and extract multiple questioning questions;
根据问题评分模型获取对多个提问问题的第一评分;Obtaining first scores for multiple questions according to a question scoring model;
根据第一评分从多个提问问题中确定多个候选提问问题;Determine a plurality of candidate questions from the plurality of questions according to the first score;
根据候选提问问题优化初始提问模型,得到目标提问模型。The initial question model is optimized according to the candidate questions to obtain the target question model.
在一种可选的实施方式中,根据切割文档块内的文本数据,确定提问策略,包括:In an optional implementation, determining a questioning strategy based on text data in the cut document block includes:
确定文本数据的文本类型;Determine the text type of the text data;
根据文本类型确定相对应的提问策略。Determine the corresponding questioning strategy based on the text type.
在一种可选的实施方式中,在根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据之前,方法还包括:In an optional implementation, before obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full data, and the target answer model, the method further includes:
根据切割文档块内的文本数据和目标问题数据,确定回答策略;Determine the answer strategy based on the text data in the cut document block and the target question data;
根据回答策略对文本数据进行处理,得到多个回答数据;Processing the text data according to the answer strategy to obtain multiple answer data;
根据回答评分模型获取对多个回答数据的第二评分;Obtaining a second score for the plurality of answer data according to the answer scoring model;
根据第二评分从多个回答数据中确定多个候选回答数据;determining a plurality of candidate answer data from the plurality of answer data according to the second score;
根据候选回答数据优化初始回答模型,得到目标回答模型。The initial answer model is optimized according to the candidate answer data to obtain the target answer model.
在一种可选的实施方式中,根据切割文档块内的文本数据和目标问题数据,确定回答策略,包括:In an optional implementation, determining an answer strategy based on text data and target question data in the segmented document block includes:
确定文本数据的文本类型; Determine the text type of the text data;
根据文本类型和目标问题数据,确定回答策略。Determine the answer strategy based on the text type and target question data.
在一种可选的实施方式中,根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据,包括:In an optional implementation, target answer data corresponding to the target question data is obtained according to the segmented document blocks, the target question data, the full amount of data, and the target answer model, including:
根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到多个回答数据;According to the cut document blocks, the target question data, the full amount of data and the target answer model, multiple answer data are obtained;
将回答数据进行最小单元的拆分,得到多个目标字段;Split the answer data into the smallest unit to obtain multiple target fields;
比较每第二预设数量个回答数据内的目标字段的描述内容;comparing the description contents of the target field in each second preset number of answer data;
保留第二评分最高的回答数据所对应的描述内容;The description content corresponding to the answer data with the second highest score is retained;
将描述内容进行整合,得到目标回答数据。Integrate the description content to obtain the target answer data.
在一种可选的实施方式中,在得到关于切割文档块的问答数据之后,方法还包括:In an optional implementation, after obtaining the question-answer data about the cut document block, the method further includes:
利用问答数据评分模型对问答数据进行评分,得到问答数据的第三评分;The question and answer data is scored using the question and answer data scoring model to obtain a third score for the question and answer data;
根据第三评分对问答数据进行数据清洗,得到满足预设需求的目标问答数据;The question and answer data is cleaned according to the third score to obtain target question and answer data that meets the preset requirements;
根据目标问答数据进行数据样本扩充,得到第三预设数量个问答数据。The data sample is expanded according to the target question and answer data to obtain a third preset number of question and answer data.
在一种可选的实施方式中,切割文档块内包括图片识别后的文本数据。In an optional implementation, the cut document block includes text data after image recognition.
第二方面,本公开提供了一种问答数据的生成装置,该装置包括:In a second aspect, the present disclosure provides a device for generating question-answer data, the device comprising:
第一获取模块,用于获取多个切割文档块,其中,每个切割文档块内包含第一预设数量个文本数据;A first acquisition module is used to acquire a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data;
第一得到模块,用于根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据;The first obtaining module is used to obtain the target question data corresponding to the cut document block according to the text data in the cut document block and the target question model;
第二获取模块,用于获取全量数据,其中,全量数据为由多个切割文档块所组成的完整文档中包含的与目标问题数据相关联的内容信息;A second acquisition module is used to acquire full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks;
第二得到模块,用于根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据;The second obtaining module is used to obtain target answer data corresponding to the target question data according to the cut document blocks, the target question data, the full amount of data and the target answer model;
第三得到模块,用于根据目标问题数据和目标回答数据,生成问答数据。The third module is used to generate question-answer data based on the target question data and the target answer data.
第三方面,本公开提供了一种计算机设备,包括:存储器和处理器,存储器和处理器之间互相通信连接,存储器中存储有计算机指令,处理器通过执行计算机指令,从而执行上述第一方面或其对应的任一实施方式的问答数据的生 成方法。In a third aspect, the present disclosure provides a computer device, comprising: a memory and a processor, the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the generation of question and answer data of the first aspect or any corresponding embodiment thereof. Into method.
第四方面,本公开提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机指令,计算机指令用于使计算机执行上述第一方面或其对应的任一实施方式的问答数据的生成方法。In a fourth aspect, the present disclosure provides a computer-readable storage medium having computer instructions stored thereon, the computer instructions being used to enable a computer to execute the method for generating question and answer data of the first aspect or any corresponding embodiment thereof.
为了更清楚地说明本公开具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific embodiments of the present disclosure or the technical solutions in the prior art, the drawings required for use in the specific embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present disclosure. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.
图1是根据本公开一些实施例的问答数据的生成方法的流程示意图;FIG1 is a flow chart of a method for generating question and answer data according to some embodiments of the present disclosure;
图2是根据本公开一些实施例的问答数据的生成方法的完整流程示意图;FIG2 is a schematic diagram of a complete flow chart of a method for generating question and answer data according to some embodiments of the present disclosure;
图3是根据本公开一些实施例的问答数据的生成装置的结构框图;FIG3 is a structural block diagram of a device for generating question and answer data according to some embodiments of the present disclosure;
图4是本公开实施例的计算机设备的硬件结构示意图。FIG. 4 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present disclosure.
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present disclosure clearer, the technical solution in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present disclosure.
大部分企业都有现成的大量的文本数据,这些数据是在过去长期的业务开展过程中沉淀下来的。企业若希望进行大模型的智能化尝试,就需要利用这些现有数据。然而这些数据没有经过清洗和梳理、没有经过标注,只能进行预训练。预训练成本非常高,并且大部分企业能积累的数据量不够庞大,若进行预训练效果有限。Most companies have a large amount of existing text data, which has been accumulated in the long-term business development process in the past. If companies want to try to make large models intelligent, they need to use this existing data. However, this data has not been cleaned, sorted, or labeled, and can only be pre-trained. The cost of pre-training is very high, and the amount of data that most companies can accumulate is not large enough, so the effect of pre-training is limited.
因此企业最佳的尝试策略是进行模型微调。这时需要把过去的业务数据进行清洗、QA提取。然而关注到数据安全问题,企业很难将数据给到第三方标注公司进行数据的处理。故当前为了保证从现有文档中能够挖据出质量较高的QA数据,通常采用人工标注的方式,但是该方法要花费较大的人工成本和时间成本,时效性不能保证,同时人工标注需要依赖工作人员自身的专业素养,导致挖掘出的数据质量参差不齐,甚至出现较多质量较差的数据。 Therefore, the best strategy for enterprises to try is to fine-tune the model. At this time, it is necessary to clean and QA the past business data. However, due to data security issues, it is difficult for enterprises to give data to third-party labeling companies for data processing. Therefore, in order to ensure that high-quality QA data can be mined from existing documents, manual labeling is usually used. However, this method costs a lot of labor and time costs, and timeliness cannot be guaranteed. At the same time, manual labeling needs to rely on the professional qualities of the staff themselves, resulting in uneven quality of the mined data, and even a lot of poor quality data.
为了解决上述问题,本公开实施例提出一种问答数据的生成方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。In order to solve the above problems, the embodiment of the present disclosure proposes an embodiment of a method for generating question and answer data. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although the logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.
在本实施例中提供了一种问答数据的生成方法,图1是根据本公开一些实施例的问答数据的生成方法的流程示意图,如图1所示,该方法可以应用于服务器侧,该方法流程包括如下步骤:In this embodiment, a method for generating question and answer data is provided. FIG. 1 is a flow chart of a method for generating question and answer data according to some embodiments of the present disclosure. As shown in FIG. 1 , the method can be applied to a server side. The method flow includes the following steps:
步骤S101,获取多个切割文档块,其中,每个切割文档块内包含第一预设数量个文本数据。Step S101: obtaining a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data.
可选地,在本公开实施例中,服务器侧获取到待提取(或挖掘)问答数据(即QA)的文档,对该文档进行切割,比如等比例切割成10份等,得到多个切割文档块。若是等比例切割,则每个切割文档块内包含相同第一预设数量的文本数据,比如文档共包含10万字文本数据,切割成10份后,每个切割文档块内包含相同的1万字文本数据。Optionally, in the disclosed embodiment, the server side obtains a document of question and answer data (i.e., QA) to be extracted (or mined), and cuts the document, such as cutting it into 10 parts in equal proportion, to obtain a plurality of cut document blocks. If the cut is equal proportion, each cut document block contains the same first preset amount of text data, for example, if the document contains 100,000 words of text data in total, after being cut into 10 parts, each cut document block contains the same 10,000 words of text data.
另外,在对文档进行切割之前,可能存在获取的文档内的文本存在非常多的问题,这时需要对这些文本进行预处理,得到统一格式后的正确文本。如表1所示,表1内记载的是一部分文本可能存在的问题以及对应的解决方案:In addition, before the document is segmented, there may be many problems with the text in the acquired document. In this case, these texts need to be preprocessed to obtain the correct text in a unified format. As shown in Table 1, Table 1 records some possible problems with the text and the corresponding solutions:
表1
Table 1
除了上表中的问题外,还可能会有其他文本问题,比如:在自动抽取系统文字时由于文档的语法错误,错别字等导致误解,对应的解决方案可以为:错误检测与修正、语义歧义消除等。需要说明的是,上述内容仅是举例说明,本公开实施例包含但不限于上述文本问题以及数据预处理方式。In addition to the problems in the above table, there may be other text problems, such as: misunderstandings caused by grammatical errors and typos in documents when automatically extracting system text. The corresponding solutions can be: error detection and correction, semantic ambiguity elimination, etc. It should be noted that the above content is only an example, and the embodiments of the present disclosure include but are not limited to the above text problems and data preprocessing methods.
步骤S102,根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据。Step S102, obtaining target question data corresponding to the cut document block according to the text data in the cut document block and the target question model.
可选地,将每个切割文档块内的文本数据输入训练好的目标提问模型内,可以得到切割文档块相对应的目标问题数据。Optionally, the text data in each cut document block is input into a trained target question model to obtain target question data corresponding to the cut document block.
步骤S103,获取全量数据,其中,全量数据为由多个切割文档块所组成的完整文档中包含的与目标问题数据相关联的内容信息。Step S103, obtaining full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks.
可选地,有些答案可能需要上下文才能完全理解,简单地抽取问题和答案可能导致丢失重要的背景信息。这时可以通过扩大处理时的上下文范围,将与问题和答案紧密相关的上下文信息纳入考虑。这使得模型能够捕获必要的背景信息以更好地理解问题和答案。还可以通过使用句子或段落表示方法来编码上下文信息。这种方式可以帮助模型捕获长距离上下文信息,并在预测时融入到处理流程中。Alternatively, some answers may require context to be fully understood, and simply extracting questions and answers may result in the loss of important context information. This can be achieved by expanding the scope of the context during processing to include context information that is closely related to the question and answer. This allows the model to capture the necessary context information to better understand the question and answer. Context information can also be encoded by using sentence or paragraph representation methods. This approach can help the model capture long-range context information and incorporate it into the processing flow when making predictions.
但是仅用上下文来获取问答数据的话,可能存在全面性的缺失。比如一个文档中的问题的答案可能依赖于另一个文档的内容的情况,这时就需要用到全量数据。其中,全量数据是存储在向量数据库(由多个切割文档块所组成的完整文档数据库)中的,将问题向量化后与向量数据库中的数据进行匹配,利用向量的相似性筛选出与目标问题数据最相关的内容信息,比如10条数据,然后 将目标问题数据以及与之相关的10条数据片段一起给到问题选择回答策略相关模型中,应用回答策略后,即得到相对应的目标回答数据。However, if we only use context to obtain question-answering data, there may be a lack of comprehensiveness. For example, the answer to a question in a document may depend on the content of another document. In this case, full data is needed. The full data is stored in a vector database (a complete document database composed of multiple cut document blocks). After the question is vectorized, it is matched with the data in the vector database. The similarity of the vector is used to filter out the content information most relevant to the target question data, such as 10 pieces of data, and then The target question data and the 10 data fragments related to it are given to the question selection answer strategy related model. After applying the answer strategy, the corresponding target answer data is obtained.
这时,全量数据的确定方式可以为:1.通过Embedding模型+向量数据库可以跨文档搜索和提取信息。2.文档链接:如果文档间有明确的链接,可以按照链接依次抓取和处理文档,形成一种上下文理解。3.复合文档建模:建立一个多输入的模型,同时输入多个相关文档来进行理解和推理。4.答案抽取与合成:当答案跨度较大时,会首先进行答案抽取,将可能的答案片段都抽取出来,然后进行答案合成,将这些片段按照合适的方式拼接起来。在抽取阶段,可以设定多个答案片段,并给每个答案片段一个得分,然后在合成阶段,选择得分最高的答案,或者将得分较高的答案进行合理拼接。5.段落级别处理:当答案可能跨越多个段落时,将段落作为处理单位,同时考虑多个段落的信息进行答案的生成。可以使用滑动窗口或者其他方法切割文档,以便捕捉更大范围的上下文。At this time, the full amount of data can be determined in the following ways: 1. Through the Embedding model + vector database, information can be searched and extracted across documents. 2. Document links: If there are clear links between documents, documents can be captured and processed in sequence according to the links to form a contextual understanding. 3. Compound document modeling: Establish a multi-input model and input multiple related documents at the same time for understanding and reasoning. 4. Answer extraction and synthesis: When the answer span is large, answer extraction will be performed first, all possible answer fragments will be extracted, and then answer synthesis will be performed to splice these fragments in an appropriate way. In the extraction stage, multiple answer fragments can be set and each answer fragment can be given a score. Then, in the synthesis stage, the answer with the highest score can be selected, or the answers with higher scores can be reasonably spliced. 5. Paragraph-level processing: When the answer may span multiple paragraphs, the paragraph is used as the processing unit, and the information of multiple paragraphs is considered at the same time to generate the answer. Documents can be cut using sliding windows or other methods to capture a wider range of context.
可以理解的是,全量数据已经包含了与目标问题数据相关联的上下文信息。It can be understood that the full amount of data already contains context information associated with the target question data.
步骤S104,根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据。Step S104, obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full data and the target answer model.
可选地,本公开实施例在确定与每个切割文档块的目标问题数据相对应的目标回答数据时,需要考虑每个切割文档块的文本数据、由目标提问模型输出的目标问题数据、所有与目标问题数据相关联的全量数据,然后将这些需要考虑的数据输入训练好的目标回答模型内,即可得到与目标问题数据相对应的目标回答数据。Optionally, when determining the target answer data corresponding to the target question data of each cut document block, the embodiment of the present disclosure needs to consider the text data of each cut document block, the target question data output by the target question model, and all the full data associated with the target question data, and then input these data that need to be considered into the trained target answer model to obtain the target answer data corresponding to the target question data.
步骤S105,根据目标问题数据和目标回答数据,生成问答数据。Step S105, generating question and answer data according to the target question data and the target answer data.
可选地,每个切割文档块应对应一个目标问题数据和一个目标回答数据。然后由一个目标问题数据和一个目标回答数据组成的问答数据,就是从每个切割文档块中挖掘出的最终问答数据。Optionally, each cut document block corresponds to a target question data and a target answer data. Then the question-answer data consisting of a target question data and a target answer data is the final question-answer data mined from each cut document block.
另外,可以由所有的切割文档块中挖掘出的最终问答数据组成QA库,用于存储这些问答数据。In addition, the final question and answer data mined from all the cut document blocks can be used to form a QA library for storing these question and answer data.
在本公开实施例中,通过获取多个切割文档块,其中,每个切割文档块内包含第一预设数量个文本数据;根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据;获取全量数据,其中,全量数据为由多个切割文档块所组成的完整文档中包含的与目标问题数据相关联的内容信息; 根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据;根据目标问题数据和目标回答数据,生成问答数据。这样本公开实施例能够高效地从现有文档中挖掘高质量的问答数据,实现自动化的目的,极大地降低了人工成本和时间成本,解决了相关技术在面对文档进行问答数据挖掘时,人工挖掘存在效率低下、数据质量不佳的问题。In the embodiment of the present disclosure, a plurality of cut document blocks are obtained, wherein each cut document block contains a first preset number of text data; target question data corresponding to the cut document block is obtained according to the text data in the cut document block and the target question model; full data is obtained, wherein the full data is content information associated with the target question data contained in a complete document composed of the plurality of cut document blocks; According to the cut document blocks, target question data, full data and target answer model, target answer data corresponding to the target question data is obtained; according to the target question data and the target answer data, question and answer data is generated. In this way, the disclosed embodiment can efficiently mine high-quality question and answer data from existing documents, achieve the purpose of automation, greatly reduce labor costs and time costs, and solve the problems of low efficiency and poor data quality in manual mining when mining question and answer data in documents in related technologies.
在一些可选的实施方式中,在根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据之前,方法还包括:In some optional implementations, before obtaining the target question data corresponding to the cut document block according to the text data in the cut document block and the target question model, the method further includes:
根据切割文档块内的文本数据,确定提问策略;Determine the questioning strategy based on the text data in the cut document block;
根据提问策略对文本数据进行处理,提取多个提问问题;Process the text data according to the questioning strategy and extract multiple questioning questions;
根据问题评分模型获取对多个提问问题的第一评分;Obtaining first scores for multiple questions according to a question scoring model;
根据第一评分从多个提问问题中确定多个候选提问问题;Determine a plurality of candidate questions from the plurality of questions according to the first score;
根据候选提问问题优化初始提问模型,得到目标提问模型。The initial question model is optimized according to the candidate questions to obtain the target question model.
可选地,在本公开实施例中,在生成训练好的目标提问模型之前,需要对初始提问模型进行不断的优化处理,才能得到最终的目标提问模型。在对初始提问模型进行优化前,需要筛选下QA结构化数据。其中,适合提取QA结构化数据的文本有如下特点:Optionally, in the disclosed embodiment, before generating the trained target question model, the initial question model needs to be continuously optimized to obtain the final target question model. Before optimizing the initial question model, the QA structured data needs to be screened. Among them, the text suitable for extracting QA structured data has the following characteristics:
a.内容丰富:文本应当包含丰富的信息点和描述a. Rich content: The text should contain rich information points and descriptions
b.逻辑清晰:文本应当结构合理、表述准确,以确保从中提取出的问题和答案准确无误。b. Clear logic: The text should be well-structured and accurately expressed to ensure that the questions and answers extracted from it are accurate.
c.具备事实性和客观性:文本具有客观事实性质,有助于生成精确的、有参考价值的问题和答案。c. Factualism and objectivity: The text is factual in nature and helps generate precise and informative questions and answers.
d.具备领域和任务相关性:理想的文本类型应与特定领域和任务高度相关,以保证提取的QA数据具有实际应用价值。d. Domain and task relevance: The ideal text type should be highly relevant to specific domains and tasks to ensure that the extracted QA data has practical application value.
这时根据上述适合提取QA结构化数据的文本特点,确定每个切割文档块的提问策略,从而从每个切割文档块中提取出多个提问问题(后文也称为“问题”)。At this time, according to the above-mentioned text characteristics suitable for extracting QA structured data, the question strategy of each cut document block is determined, so as to extract multiple question questions (hereinafter also referred to as "questions") from each cut document block.
为了挖掘出高质量的问题,以便在后续流程中构建与之相关的回答,本公开实施例提前设置了一问题评分模型,用于对获取的多个提问问题进行评分。其中,问题评分模型的评分标准设计:1、问题的难易程度:有挑选出具有挑战性的问题。2、问题的多样性:确保问题来源于不同的领域和知识点。3、问题 的精确性:衡量问题能否准确地描述某个特定概念或事实。4、问题的表述质量:确保问题具有良好的语法和清晰度。5、问题的明确性:问题是否清晰明确,容易理解。6、问题的相关性:问题是否与给定的主题或数据集密切相关。7、问题的信息量:问题是否涉及有意义的信息。8、问题的复杂性:问题是否具有一定的难度,不能通过简单查找就能得到答案。In order to mine high-quality questions and construct relevant answers in the subsequent process, the present embodiment sets up a question scoring model in advance to score the multiple questions obtained. Among them, the scoring criteria of the question scoring model are designed as follows: 1. Difficulty of the question: select challenging questions. 2. Diversity of questions: ensure that the questions come from different fields and knowledge points. 3. Questions 1. Precision: measures whether the question accurately describes a specific concept or fact. 2. Quality of question presentation: ensure that the question has good grammar and clarity. 3. Clarity of question: whether the question is clear and easy to understand. 4. Relevance of question: whether the question is closely related to the given topic or data set. 5. Information content of question: whether the question involves meaningful information. 6. Complexity of question: whether the question is difficult enough to be answered by a simple search.
将多个提问问题输入问题评分模型,得到每个提问问题的评分值。根据评分值对问题进行排序,以便挑选出高质量的候选提问问题。可以理解的是,高质量的候选提问问题可能是多个,这时就需要多次循环,同时设定评分阈值,仅保留高于阈值的问题。另外,可以本公开实施例还可以根据需求调整评分阈值,以获得不同数量和质量的问题。Multiple questions are input into the question scoring model to obtain a score value for each question. Questions are sorted according to the score value to select high-quality candidate questions. It is understandable that there may be multiple high-quality candidate questions, in which case multiple cycles are required, and a scoring threshold is set at the same time, and only questions above the threshold are retained. In addition, the scoring threshold can be adjusted according to the needs of the embodiment of the present disclosure to obtain questions of different quantities and qualities.
基于问题评分模型的反馈,对保留下来的问题进行优化。这包括修改问题表述、补充关键信息等。Based on the feedback from the question scoring model, the retained questions are optimized, including modifying the question wording and adding key information.
将筛选并优化后的候选提问问题作为训练样本,输入到初始提问模型中进行训练。这样不断优化初始提问模型,提高问题质量,经过多轮迭代,系统将生成高质量的问题集合,最终得到目标提问模型。The selected and optimized candidate questions are used as training samples and input into the initial question model for training. In this way, the initial question model is continuously optimized to improve the question quality. After multiple rounds of iterations, the system will generate a high-quality question set and finally obtain the target question model.
另外,针对上述训练初始提问模型,得到目标提问模型的过程中,本公开实施例还提出一些优化方向:In addition, in the process of training the initial question model and obtaining the target question model, the embodiment of the present disclosure also proposes some optimization directions:
a.多因素评分:在问题评分模型设计中,除了上述提到的评分标准,还可以结合多种因素对问题进行评分。例如,使用NLP模型对问题进行语义分析,结合问题的相关性、信息量和复杂性等因素进行综合评分。a. Multi-factor scoring: In the design of the question scoring model, in addition to the scoring criteria mentioned above, multiple factors can also be combined to score the questions. For example, the NLP model can be used to perform semantic analysis on the questions, and the relevance, information content, and complexity of the questions can be combined to give a comprehensive score.
b.自适应阈值:在问题筛选环节,会使用自适应阈值策略来动态调整评分阈值。例如,根据问题数量、问题难度、问题类别等因素动态调整评分阈值。更精确地筛选出高质量的问题,提升问题库的质量。b. Adaptive threshold: In the question screening stage, an adaptive threshold strategy will be used to dynamically adjust the scoring threshold. For example, the scoring threshold is dynamically adjusted based on factors such as the number of questions, question difficulty, and question category. This helps to more accurately screen out high-quality questions and improve the quality of the question library.
c.平衡样本权重:在返回训练阶段,尝试采用平衡样本权重的方法来避免模型过拟合。根据问题的重要性、难易程度等因素为问题赋予不同的权重,平衡不同类别问题在训练过程中的影响。c. Balanced sample weights: In the return training phase, try to use the method of balanced sample weights to avoid model overfitting. Different weights are assigned to questions based on factors such as the importance and difficulty of the questions, balancing the impact of different types of questions in the training process.
d.引入专家知识:在问题优化阶段,会尝试引入领域专家的知识,通过专家的反馈来优化问题。专家对问题进行修正和修改,提高问题的质量。同时,根据专家的反馈,可以进一步优化问题评分模型。d. Introducing expert knowledge: In the problem optimization stage, we will try to introduce the knowledge of domain experts and optimize the problem through expert feedback. Experts will correct and modify the problem to improve the quality of the problem. At the same time, based on the feedback from experts, the problem scoring model can be further optimized.
e.交叉验证:在问题收集和预处理阶段,采用交叉验证的方法来增加问题库 的多样性。例如,在处理同一主题的不同数据集时,可以从每个数据集中抽取一部分问题,然后对这些问题进行评分和筛选。通过这种方法,可以获取更广泛的问题集合,有助于提升问题库的质量。e. Cross-validation: During the question collection and preprocessing phase, cross-validation is used to increase the question pool. For example, when dealing with different datasets on the same topic, a subset of questions can be extracted from each dataset and then scored and filtered. This approach can provide a wider set of questions, which helps improve the quality of the question bank.
在本公开实施例中,通过问题评分模型筛选出高质量问题,使初始提问模型不断适应特定企业或行业的挖掘策略和风格,提高训练好的目标提问模型的准确性。同时采用相关的策略与算法进行动态优化,从而在不断生成高质量和真实性的合成数据中学习和改进。In the disclosed embodiment, high-quality questions are screened out through the question scoring model, so that the initial question model can continuously adapt to the mining strategy and style of a specific enterprise or industry, and improve the accuracy of the trained target question model. At the same time, relevant strategies and algorithms are used for dynamic optimization, so as to learn and improve in the continuous generation of high-quality and authentic synthetic data.
在一些可选的实施方式中,根据切割文档块内的文本数据,确定提问策略,包括:In some optional implementations, determining a questioning strategy based on text data in the segmented document block includes:
确定文本数据的文本类型;Determine the text type of the text data;
根据文本类型确定相对应的提问策略。Determine the corresponding questioning strategy based on the text type.
可选地,对文本数据的文本类型的特征分析,以下文本类型非常适合提取QA结构化数据:Optionally, feature analysis of the text type of the text data is performed. The following text types are very suitable for extracting QA structured data:
a.教育资料:课本、讲义、教程等包含丰富知识和解释性信息。a. Educational materials: Textbooks, handouts, tutorials, etc. contain rich knowledge and explanatory information.
b.技术文档:产品手册、API文档、开发指南等提供详细的技术说明和操作方法。b. Technical documentation: Product manuals, API documentation, development guides, etc. provide detailed technical descriptions and operating methods.
c.研究论文:学术论文、行业报告等研究性文本通常包含丰富的分析和结论,尤其是摘要部分,其中常常会描述研究的问题和结论。c. Research papers: Research texts such as academic papers and industry reports usually contain rich analysis and conclusions, especially the abstract, which often describes the research questions and conclusions.
d.法规和政策文件:官方法规、政策文件和法律文件通常结构严谨、事实性强。d. Regulations and policy documents: Official regulations, policy documents and legal documents are usually well-structured and factual.
e.新闻报道和文章:包含大量有关实事、事件和观点的描述,有助于从中提取时事等相关QA数据。e. News reports and articles: They contain a large amount of descriptions of facts, events, and opinions, which helps to extract relevant QA data such as current affairs.
f.问答社区和论坛:平台上的问答和讨论通常已经存在明确的问题和答案,可直接提取构建QA库。f. Q&A communities and forums: The Q&A and discussions on the platform usually already have clear questions and answers, which can be directly extracted to build a QA library.
g.手册和指南:这些文档通常包含特定的问题和对应的答案,如使用指南、FAQs、产品手册等。g. Manuals and Guides: These documents usually contain specific questions and corresponding answers, such as user guides, FAQs, product manuals, etc.
根据上述文本类型确定出相对应的提问策略,如表2所示(其中,表2内既包括问题提取策略,还包括与该问题对应的回答提取策略):According to the above text types, the corresponding questioning strategy is determined, as shown in Table 2 (wherein Table 2 includes both the question extraction strategy and the answer extraction strategy corresponding to the question):
表2
Table 2
表2中仅是针对教育资料,比如课本、讲义、教程等文本类型;技术文档,比如产品手册、API文档、开发指南等以及研究论文,比如学术论文、行业报告等确定出的相对应的提问策略和回答提取策略,像法规和政策文件、新闻报道和文章等也是根据当前的文本类型得到对应的提问策略和回答提取策略,在这里,本公开实施例不再赘述。Table 2 only determines the corresponding questioning strategies and answer extraction strategies for educational materials, such as textbooks, handouts, tutorials and other text types; technical documents, such as product manuals, API documents, development guides, and research papers, such as academic papers, industry reports, etc., and regulations and policy documents, news reports and articles also obtain corresponding questioning strategies and answer extraction strategies based on the current text type. Here, the embodiments of the present disclosure will not be repeated.
在根据文本类型确定相对应的提问策略时,还有一些辅助策略:When determining the corresponding questioning strategy based on the text type, there are some auxiliary strategies:
a.采用不同的提问模型:根据文本类型的不同,可能需要采用不同的提问模型来针对性地提取问题。例如,在处理技术文档时,使用针对技术领域的预训练模型;在处理法规文件时,使用针对法律领域的预训练模型。a. Use different question models: Depending on the type of text, different question models may be needed to extract questions in a targeted manner. For example, when processing technical documents, use a pre-trained model for the technical field; when processing regulatory documents, use a pre-trained model for the legal field.
b.动态调整提问策略:根据文本的质量、领域和背景知识,动态调整提问策略。例如,调整关键词提取的阈值、使用不同的实体类型和关系类型等。 b. Dynamically adjust the questioning strategy: Dynamically adjust the questioning strategy based on the quality, domain, and background knowledge of the text. For example, adjust the threshold for keyword extraction, use different entity types and relationship types, etc.
c.增加可解释性:生成问题的过程中保持可解释性,使得用户可以跟踪并理解问题生成的策略。为了提高提问过程的可解释性,将提问策略进行可视化展示,提供生成问题的原因,并给出模型对不同文本部分的关注程度。c. Increase interpretability: Maintain interpretability during the question generation process so that users can track and understand the question generation strategy. In order to improve the interpretability of the question generation process, the question generation strategy is visualized, the reasons for generating questions are provided, and the degree of attention paid by the model to different parts of the text is given.
d.自适应策略调整:在处理大量文本时,模型可能需要自动适应不同文本特点和难度。为此,开发自适应的提问策略,如根据文本的复杂度和领域对提问模型进行动态调整,以实现更高质量的问题挖掘。d. Adaptive strategy adjustment: When processing a large amount of text, the model may need to automatically adapt to the characteristics and difficulty of different texts. To this end, develop adaptive questioning strategies, such as dynamically adjusting the questioning model according to the complexity and domain of the text to achieve higher quality question mining.
e.用户交互:为了提高提问策略的效果和灵活性,引入用户交互,使用户能够参与到提问策略的调整中,提供反馈并接收修正建议。例如,在生成问题后让用户对其进行评价,根据用户反馈调整提问策略。e. User interaction: In order to improve the effectiveness and flexibility of the questioning strategy, user interaction is introduced to enable users to participate in the adjustment of the questioning strategy, provide feedback and receive correction suggestions. For example, after the question is generated, users are asked to evaluate it and the questioning strategy is adjusted according to user feedback.
f.结合无监督和有监督方法:在提问策略的选择过程中,结合无监督和有监督方法会提高模型的鲁棒性。无监督方法会帮助提取文本中的潜在结构,而有监督方法会基于已有的标注数据进行更精确的问题生成。f. Combining unsupervised and supervised methods: In the process of selecting question strategies, combining unsupervised and supervised methods will improve the robustness of the model. Unsupervised methods will help extract the latent structure in the text, while supervised methods will generate more accurate questions based on existing annotated data.
g.权衡问题的难度和多样性:在选取提问策略时,需要平衡问题的难度和多样性。简单和直接的问题可能更容易给用户理解,但可能无法涵盖文本的全部内容;而复杂和多样化的问题会涵盖更多维度,但可能会对用户理解带来挑战。在构建QA数据库时,建议综合考虑问题难度和多样性,以避免过于偏重某一类问题。g. Weigh the difficulty and diversity of questions: When choosing a questioning strategy, you need to balance the difficulty and diversity of questions. Simple and direct questions may be easier for users to understand, but may not cover the entire content of the text; while complex and diverse questions will cover more dimensions, but may pose challenges to user understanding. When building a QA database, it is recommended to consider the difficulty and diversity of questions comprehensively to avoid over-emphasizing one type of question.
在本公开实施例中,针对不同文本和行业需求,灵活选择和调整提问策略、回答策略等,使其更贴合实际应用场景。In the embodiments of the present disclosure, questioning strategies, answering strategies, etc. are flexibly selected and adjusted according to different texts and industry requirements to make them more suitable for actual application scenarios.
在一些可选的实施方式中,在根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据之前,方法还包括:In some optional implementations, before obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full data, and the target answer model, the method further includes:
根据切割文档块内的文本数据和目标问题数据,确定回答策略;Determine the answer strategy based on the text data in the cut document block and the target question data;
根据回答策略对文本数据进行处理,得到多个回答数据;Processing the text data according to the answer strategy to obtain multiple answer data;
根据回答评分模型获取对多个回答数据的第二评分;Obtaining a second score for the plurality of answer data according to the answer scoring model;
根据第二评分从多个回答数据中确定多个候选回答数据;determining a plurality of candidate answer data from the plurality of answer data according to the second score;
根据候选回答数据优化初始回答模型,得到目标回答模型。The initial answer model is optimized according to the candidate answer data to obtain the target answer model.
可选地,根据切割文档块内的文本数据和目标问题数据,已经可以确定回答策略(可参见表2),然后得到多个回答数据。收集这些回答数据需要对回答进行一系列的预处理,包括清理无关词汇、纠正拼写错误等,优化回答的表述。 Optionally, based on the text data and target question data in the cut document block, the answer strategy can be determined (see Table 2), and then multiple answer data can be obtained. Collecting these answer data requires a series of preprocessing of the answers, including cleaning irrelevant words, correcting spelling errors, etc., to optimize the expression of the answers.
提前设置好一回答评分模型,对生成的回答进行评分,保留质量高的答案,并将评分信息反馈给训练的初始回答模型,优化初始回答模型的性能。这时可以选择常见的自然语言处理回答评分模型。回答评分模型的目标是根据特定的标准对多个回答数据进行质量排序,评分标准包括:答案的准确性:评估答案的内容是否准确,与问题的匹配度如何。答案的完整性:评估答案是否获得了完整的信息,是否能全面解答问题。答案的表述质量:检查答案的语言表述,包括语法准确性、可理解性等。Set up an answer scoring model in advance, score the generated answers, retain high-quality answers, and feed the scoring information back to the trained initial answer model to optimize the performance of the initial answer model. At this time, you can choose a common natural language processing answer scoring model. The goal of the answer scoring model is to sort the quality of multiple answer data according to specific criteria. The scoring criteria include: Answer accuracy: evaluate whether the content of the answer is accurate and how well it matches the question. Answer completeness: evaluate whether the answer has obtained complete information and whether it can fully answer the question. Answer expression quality: check the language expression of the answer, including grammatical accuracy, comprehensibility, etc.
然后使用回答评分模型,对每个回答进行评分,这个评分可以反应出答案的整体质量。Then use the answer scoring model to score each answer, which can reflect the overall quality of the answer.
对多个回答的第二评分进行排序,设定一个阈值,保留评分超过阈值的多个候选回答数据。将筛选并优化后的多个候选回答数据作为训练样本,输入到初始回答模型中,进行进一步训练,循环迭代,优化初始回答模型性能。The second scores of multiple answers are sorted, a threshold is set, and multiple candidate answer data with scores exceeding the threshold are retained. The multiple candidate answer data after screening and optimization are used as training samples and input into the initial answer model for further training, cyclic iteration, and optimization of the initial answer model performance.
根据需要,反复迭代上述步骤。在迭代过程中,系统会持续生成高质量的回答集合,为构建QA库以及后续流程做好准备,进而得到最终训练好的目标回答模型。最终由目标回答模型得到的目标回答数据应该是准确、完整且表述清晰的。Iterate the above steps repeatedly as needed. During the iteration process, the system will continue to generate high-quality answer sets to prepare for the construction of the QA library and subsequent processes, and then obtain the final trained target answer model. The target answer data obtained by the target answer model should be accurate, complete and clearly expressed.
在训练初始回答模型期间,还有一些辅助策略:There are some auxiliary strategies during the training of the initial answer model:
a.多任务学习:在训练过程中,使用多任务学习策略,同时优化回答的准确性、完整性和表述质量。通过共享隐藏层参数和使用任务间软参数共享等技术实现。a. Multi-task learning: During the training process, a multi-task learning strategy is used to optimize the accuracy, completeness, and presentation quality of the answer at the same time. This is achieved by sharing hidden layer parameters and using soft parameter sharing between tasks.
b.使用强化学习策略:引入强化学习策略,例如使用Actor-Critic算法或Deep Q-Learning来确定最佳答案。在回答生成环节进一步提高质量。b. Use reinforcement learning strategies: Introduce reinforcement learning strategies, such as using Actor-Critic algorithms or Deep Q-Learning to determine the best answer. Further improve the quality of answer generation.
c.结合知识图谱:在回答评分模型中,参考外部知识图谱来增强答案的准确性和信任度。通过利用知识图谱,进一步验证和支持答案的正确性,并在可能的情况下提供附加引用。c. Integrate with knowledge graph: In the answer scoring model, reference external knowledge graph to enhance the accuracy and trust of the answer. By leveraging the knowledge graph, further verify and support the correctness of the answer and provide additional references when possible.
d.实时微调:对于实时应用场景,采取实时微调策略,即根据新生成的答案实时更新回答评分模型。这将使模型能够适应不断变化的数据分布,并提高回答的质量。d. Real-time fine-tuning: For real-time application scenarios, a real-time fine-tuning strategy is adopted, that is, the answer scoring model is updated in real time according to the newly generated answers. This will enable the model to adapt to the changing data distribution and improve the quality of the answers.
e.可解释性:在回答评分模型中加入可解释性机制,如注意力机制或模型敏感性分析,以便把关回答的质量,并提供分析和调整的依据。 e. Explainability: Add explainability mechanisms to the answer scoring model, such as attention mechanisms or model sensitivity analysis, to ensure the quality of the answers and provide a basis for analysis and adjustment.
在本公开实施例中,通过迭代优化和采用先进的机器学习方法,可以为构建高质量的问答库奠定坚实的基础。同时灵活选择和调整提问策略、回答策略等,使其更贴合实际应用场景。In the disclosed embodiments, iterative optimization and the use of advanced machine learning methods can lay a solid foundation for building a high-quality question-answer library. At the same time, the questioning strategy, answering strategy, etc. can be flexibly selected and adjusted to make them more suitable for actual application scenarios.
在一些可选的实施方式中,根据切割文档块内的文本数据和目标问题数据,确定回答策略,包括:In some optional implementations, determining an answer strategy based on text data and target question data in the segmented document block includes:
确定文本数据的文本类型;Determine the text type of the text data;
根据文本类型和目标问题数据,确定回答策略。Determine the answer strategy based on the text type and target question data.
可选地,如表2中,根据文本类型和目标问题数据提取策略,可以确定出回答策略。然后根据回答策略即可得到多个与目标问题数据相关的答案,然后对这些答案进行评分,得到高质量的、由多个候选回答数据组成的回答集合,然后不断迭代优化初始回答模型,进而得到训练好的目标回答模型。需要解释的是,由该目标回答模型输出的答案数量应只有目标回答数据一个。Optionally, as shown in Table 2, an answer strategy can be determined based on the text type and the target question data extraction strategy. Then, according to the answer strategy, multiple answers related to the target question data can be obtained, and then these answers are scored to obtain a high-quality answer set consisting of multiple candidate answer data, and then the initial answer model is continuously iterated and optimized to obtain a trained target answer model. It should be explained that the number of answers output by the target answer model should only be one target answer data.
在一些可选的实施方式中,根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据,包括:In some optional implementations, obtaining target answer data corresponding to the target question data according to the segmented document blocks, the target question data, the full amount of data, and the target answer model includes:
根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到多个回答数据;According to the cut document blocks, the target question data, the full amount of data and the target answer model, multiple answer data are obtained;
将回答数据进行最小单元的拆分,得到多个目标字段;Split the answer data into the smallest unit to obtain multiple target fields;
比较每第二预设数量个回答数据内的目标字段的描述内容;comparing the description contents of the target field in each second preset number of answer data;
保留第二评分最高的回答数据所对应的描述内容;The description content corresponding to the answer data with the second highest score is retained;
将描述内容进行整合,得到目标回答数据。Integrate the description content to obtain the target answer data.
可选地,对每个问题,可能会生成n个回答。这些回答可能由不同的初始回答模型生成,或者由相同的初始回答模型在有轻微差异的输入条件下生成。将筛选过的回答拆解成更小的信息单位(即目标字段)。这些单位可能是句子或者描述特定事实的短语。比较每两个信息单位,如果它们描述相同的事实或细节,只保留评分更高的那个回答数据。如果它们描述不同的事实或细节,都要保留。Optionally, for each question, n answers may be generated. These answers may be generated by different initial answer models, or by the same initial answer model under slightly different input conditions. Break the screened answers into smaller information units (i.e. target fields). These units may be sentences or phrases describing specific facts. Compare each two information units, and if they describe the same facts or details, only keep the answer data with the higher score. If they describe different facts or details, both are retained.
将选定的信息单位组合在一起,形成一个新的回答。同时加以辅助策略来决定信息单位的顺序,这些策略包含但不限于他们在原始回答中的顺序,或根据某种逻辑顺序。最后,对重构后的回答进行后处理,包括检查语法、调整词序和修正拼写错误等,得到最终的目标回答数据。 The selected information units are combined to form a new answer. At the same time, auxiliary strategies are used to determine the order of the information units, which include but are not limited to their order in the original answer, or according to a certain logical order. Finally, the reconstructed answer is post-processed, including checking grammar, adjusting word order, and correcting spelling errors, etc., to obtain the final target answer data.
在本公开实施例中,通过对多个回答数据进行合成,高效的保证了回答的质量和全面性,相比于单一回答极大的提升了QA库构建的质量。In the disclosed embodiment, by synthesizing multiple answer data, the quality and comprehensiveness of the answers are efficiently guaranteed, and the quality of QA library construction is greatly improved compared to a single answer.
在一些可选的实施方式中,在得到关于切割文档块的问答数据之后,方法还包括:In some optional implementations, after obtaining the question-answer data about the cut document blocks, the method further includes:
利用问答数据评分模型对问答数据进行评分,得到问答数据的第三评分;The question and answer data is scored using the question and answer data scoring model to obtain a third score for the question and answer data;
根据第三评分对问答数据进行数据清洗,得到满足预设需求的目标问答数据;The question and answer data is cleaned according to the third score to obtain target question and answer data that meets the preset requirements;
根据目标问答数据进行数据样本扩充,得到第三预设数量个问答数据。The data sample is expanded according to the target question and answer data to obtain a third preset number of question and answer data.
可选地,本公开实施例提前设置好一问答数据评分模型,用于检查得到的QA质量。其中,QA数据核心指标:Optionally, the embodiment of the present disclosure sets up a question-answer data scoring model in advance to check the quality of the obtained QA. Among them, the core indicators of QA data are:
a.问题和答案的完整性:明确提取的问题和答案是否完整,没有被截断或者缺失关键部分。确保提供的答案对用户是有意义的。a. Completeness of questions and answers: Check whether the extracted questions and answers are complete, without being truncated or missing key parts. Ensure that the answers provided are meaningful to users.
b.一致性:从不同部分或不同文档提取的类似或重复的问题和答案是否一致。确保提供的答案在各个文档或文档部分之间是一致的。b. Consistency: Are similar or repeated questions and answers extracted from different parts or different documents consistent? Ensure that the answers provided are consistent across documents or document parts.
c.相关性:提取的问题和答案是否与上下文或主题相关。确保QA数据特定的查询或任务是相关的。c. Relevance: Are the extracted questions and answers relevant to the context or topic? Make sure the QA data is relevant for a specific query or task.
d.可读性:问题和答案是否易于阅读和理解。确保最终模型可以较好的理解提取的内容。d. Readability: Are the questions and answers easy to read and understand? Make sure the final model can understand the extracted content well.
基于上述核心指标对每个切割文档块的问答数据进行评分,得到问答数据的第三评分。其中,评分标准为:Based on the above core indicators, the question and answer data of each cut document block is scored to obtain the third score of the question and answer data. The scoring criteria are:
基于规则的检查:使用正则表达式或其他规则检查问题和答案的格式。自动检测答案的完整性,例如确保答案没有被截断。Rule-based checking: Use regular expressions or other rules to check the format of questions and answers. Automatically check the completeness of answers, such as ensuring that answers are not truncated.
句子嵌入比较:使用预训练的句子嵌入模型将问题和答案转化为向量。比较问题和答案的嵌入来评估其关联性或相似性。Sentence Embedding Comparison: Use a pre-trained sentence embedding model to convert questions and answers into vectors. Compare the embeddings of questions and answers to evaluate their relevance or similarity.
反馈循环:使用语言模型自动生成问题,针对文档提供的答案进行验证。比较自动生成的问题与实际提取的问题,以评估其质量。Feedback loop: Automatically generate questions using language model, validate answers against documents. Compare automatically generated questions with actual extracted questions to assess their quality.
多样性和重复度检查:使用Jaccard相似度、余弦相似度或其他文本相似度方法来检查提取的QA数据之间的重复度。Diversity and duplication check: Use Jaccard similarity, cosine similarity or other text similarity methods to check the duplication between the extracted QA data.
统计分析:自动统计某些关键字或词汇的出现频率,以确定是否有过多的重复或缺失的主题。使用TF-IDF(term frequency–inverse document frequency, 用于信息检索与数据挖掘的常用加权技术)或其他技术识别出现在问题或答案中的异常或罕见的词汇。Statistical analysis: Automatically count the frequency of certain keywords or words to determine if there are too many repeated or missing topics. Use TF-IDF (term frequency–inverse document frequency, Common weighting techniques used in information retrieval and data mining) or other techniques to identify unusual or rare words that appear in questions or answers.
上下文一致性:使用预训练的语言模型评估答案在上下文中的一致性,确保答案与其周围的文本相关。Contextual consistency: Use a pre-trained language model to evaluate the consistency of the answer in context, ensuring that the answer is relevant to the text around it.
错误分析:使用自动化工具,如语法检查器或文本分类器,识别可能的文本错误或不一致性。同时设计一个自动化系统,根据反馈和错误持续微调和优化数据提取模型。Error analysis: Use automated tools, such as grammar checkers or text classifiers, to identify possible text errors or inconsistencies. Also design an automated system to continuously fine-tune and optimize the data extraction model based on feedback and errors.
参考数据集比较:与一个质量经过验证的种子数据集进行对比,使用自动评分系统来评估QA数据的质量。Reference dataset comparison: Comparison with a seed dataset of verified quality, using an automated scoring system to assess the quality of the QA data.
之后本公开实施例会根据第三评分对问答数据进行数据清洗,包括去除低质量问答数据,去除敏感的隐私数据等,得到满足预设需求的目标问答数据,以保证问答数据的高质量。另外,根据清洗后得到的目标问答数据进行数据样本的扩充,以保证高质量的QA数据的数据样本的充足性。Afterwards, the disclosed embodiment will clean the question and answer data according to the third score, including removing low-quality question and answer data, removing sensitive privacy data, etc., to obtain target question and answer data that meets the preset requirements to ensure the high quality of the question and answer data. In addition, the data sample is expanded according to the target question and answer data obtained after cleaning to ensure the sufficiency of the data sample of high-quality QA data.
同时,在生成多个QA数据,得到QA库后,本公开实施例还提出一些辅助策略用于支持更多场景下的QA库的应用。At the same time, after generating multiple QA data and obtaining the QA library, the embodiment of the present disclosure also proposes some auxiliary strategies to support the application of the QA library in more scenarios.
a.增量更新策略:随着业务的发展,企业可能会生成新的文档数据。这时可以设计一个增量更新机制,定期对现有QA库进行更新和优化,以保持其实时性和相关性。a. Incremental update strategy: As the business develops, the enterprise may generate new document data. At this time, an incremental update mechanism can be designed to regularly update and optimize the existing QA library to maintain its timeliness and relevance.
b.用户反馈整合:如果QA库将用于客户支持等场景,可收集用户对问题答案的反馈,用于不断优化和更新QA库。例如,用户可以对答案的相关性、准确性、易懂程度等方面提供反馈,系统依据这些反馈进行不断优化。b. User feedback integration: If the QA library will be used in scenarios such as customer support, user feedback on answers to questions can be collected to continuously optimize and update the QA library. For example, users can provide feedback on the relevance, accuracy, and ease of understanding of answers, and the system will continuously optimize based on these feedbacks.
c.数据层级标注:对QA库中的问题和答案进行层级标注,如分类、难易程度等,便于在用户查询中更精确地匹配和展示相关信息,提升用户体验。c. Data hierarchical annotation: The questions and answers in the QA library are annotated with hierarchical annotations, such as classification, difficulty level, etc., to facilitate more accurate matching and display of relevant information in user queries and improve user experience.
d.引入主题模型:使用主题模型进行问题聚类,以便在QA库中保持较好的问题多样性。这有助于避免单一主题的过度集中,保证QA库涵盖了多个领域和知识点。d. Introducing topic models: Use topic models to cluster questions so as to maintain a good diversity of questions in the QA database. This helps avoid excessive concentration on a single topic and ensures that the QA database covers multiple fields and knowledge points.
e.可视化数据评估:提供可视化工具,呈现QA评分筛选的过程和结果。可以展示问题和答案的分布、质量评分变化趋势等关键指标,帮助企业洞察数据质量的概况。e. Visual data evaluation: Provides visualization tools to present the process and results of QA scoring screening. It can display key indicators such as the distribution of questions and answers, quality score change trends, etc., to help companies gain insight into the overall situation of data quality.
f.集成外部知识库:将QA库与外部知识库相结合,以扩展问题和答案的覆 盖范围,提高对用户查询的适应性。f. Integrate external knowledge base: Combine the QA library with external knowledge base to expand the coverage of questions and answers. coverage and improve adaptability to user queries.
在本公开实施例中,应用多任务学习、强化学习等策略来提高数据质量。同时通过模型集成、数据增强等方法提高模型的鲁棒性,降低不确定性对模型性能的影响。In the embodiments of the present disclosure, strategies such as multi-task learning and reinforcement learning are applied to improve data quality. At the same time, the robustness of the model is improved through methods such as model integration and data enhancement, and the impact of uncertainty on model performance is reduced.
在一些可选的实施方式中,切割文档块内包括图片识别后的文本数据。In some optional implementations, the cut document block includes text data after image recognition.
可选地,在本公开实施例中,服务器侧获取到待提取(或挖掘)问答数据(即QA)的文档后,可以先判断该文档是图片文档还是非图片文档。若是图片文档,则需要对该图片文档进行识别,识别出对应的文本内容,然后再对该文档进行切割,比如等比例切割成10份等,得到多个切割文档块,因此每个切割文档块内包括有图片识别后的文本数据。Optionally, in the disclosed embodiment, after the server side obtains the document of the question and answer data (i.e., QA) to be extracted (or mined), it can first determine whether the document is an image document or a non-image document. If it is an image document, it is necessary to identify the image document, identify the corresponding text content, and then cut the document, such as cutting it into 10 parts in equal proportions, etc., to obtain multiple cut document blocks, so that each cut document block includes the text data after the image recognition.
如图2所示,图2是根据本公开一些实施例的问答数据的生成方法的完整流程示意图,具体流程如下:As shown in FIG. 2 , FIG. 2 is a schematic diagram of a complete process of a method for generating question-answer data according to some embodiments of the present disclosure. The specific process is as follows:
获取到非结构数据,比如行业书籍、文章、论文等;Obtain unstructured data, such as industry books, articles, papers, etc.;
若非结构数据是图片,则进行图片处理,采用视觉编码器特征提取,然后利用OCR技术从图片中提取出文字,同时获取到提取文字的上下文信息,将提取文字和上下文信息一同进行文本预处理和文档切割,比如切割为n份<10k的文档块;若非结构数据不是图片,则对文档内的文本进行预处理以及将文档切割成n份<10k的文档块;根据文档块内的文字的文本类型(比如图2中的常规文档、特殊专业文档、新闻报道类、文学作品类等)选择提问策略;应用多种提问策略以及提问模型(即初始提问模型)提取n个问题,将n个问题输入到问题评分模型内进行评分,如此循环N轮,筛选出高质量问题来优化提问模型,最终输出一个最佳问题输入到QA库和确定回答策略模块。If the unstructured data is a picture, the picture is processed, and visual encoder feature extraction is used. Then, OCR technology is used to extract text from the picture, and the context information of the extracted text is obtained at the same time. The extracted text and context information are used for text preprocessing and document segmentation, such as segmenting into n document blocks of less than 10k. If the unstructured data is not a picture, the text in the document is preprocessed and the document is segmented into n document blocks of less than 10k. The questioning strategy is selected according to the text type of the text in the document block (such as regular documents, special professional documents, news reports, literary works, etc. in Figure 2). A variety of questioning strategies and questioning models (i.e., initial questioning models) are applied to extract n questions, and the n questions are input into the question scoring model for scoring. This cycle is repeated N times to screen out high-quality questions to optimize the questioning model, and finally an optimal question is output to input into the QA library and determine the answer strategy module.
在确定回答策略时,根据各个文档块内文本类型(比如图2中的常规文档、特殊专业文档、新闻报道类、文学作品类等)以及输入的最佳问题一起确定出回答策略,然后再将多种回答策略、各个文档块以及对非结构数据经过向量处理得到的与最佳问题相关联的全量数据一起输入到回答模型,得到多个回答数据,然后将多个回答数据输入回答评分模型,经过N轮循环,筛选出高质量回答进行学习优化回答模型;整合n个回答形成一个最佳回答(也即是最佳答案),输入到QA库。由此可见,QA库内存储都是一个最佳问题对应一个最佳回答的QA数据。 When determining the answer strategy, the answer strategy is determined based on the text type in each document block (such as regular documents, special professional documents, news reports, literary works, etc. in Figure 2) and the input best question. Then, multiple answer strategies, each document block, and the full amount of data associated with the best question obtained by vector processing of unstructured data are input into the answer model to obtain multiple answer data. Then, the multiple answer data are input into the answer scoring model. After N rounds of cycles, high-quality answers are screened out to learn and optimize the answer model; n answers are integrated to form an optimal answer (that is, the best answer), which is input into the QA library. It can be seen that the QA library stores QA data corresponding to one best answer for one best question.
然后利用QA评分模型对QA库内的每个QA数据进行评分,进行质量检查,同时可以对QA数据进行数据清洗,去除低质量数据和去除敏感隐私数据,对数据清洗后的QA数据进行数据扩充,最后根据扩充以及清洗后的QA数据去微调所有模型。Then, the QA scoring model is used to score each QA data in the QA library for quality inspection. At the same time, the QA data can be cleaned to remove low-quality data and sensitive privacy data. The cleaned QA data can be expanded. Finally, all models can be fine-tuned based on the expanded and cleaned QA data.
上述各实施例,数据的处理均在本地完成,企业数据不会被暴露给第三方,大大增加数据的安全性;通过多轮训练,包括初始提问模型、初始回答模型、提问评分模型和回答评分模型,使这些模型不断适应特定企业或行业的挖掘策略和风格,从而实现知识迁移和构建企业专属挖掘模型;同时根据硬件资源和性能需求进行权衡,选择适当的模型架构和参数设置,可以在不影响合成数据质量的前提下,降低计算资源的需求。In the above embodiments, data processing is completed locally, and enterprise data will not be exposed to third parties, which greatly increases data security. Through multiple rounds of training, including the initial question model, the initial answer model, the question scoring model, and the answer scoring model, these models are continuously adapted to the mining strategy and style of a specific enterprise or industry, thereby achieving knowledge transfer and building an enterprise-specific mining model. At the same time, a trade-off is made based on hardware resources and performance requirements, and the appropriate model architecture and parameter settings are selected, which can reduce the demand for computing resources without affecting the quality of synthetic data.
在本实施例中还提供了一种问答数据的生成装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, a device for generating question and answer data is also provided, which is used to implement the above-mentioned embodiments and preferred implementation modes, and will not be repeated hereafter. As used below, the term "module" can be a combination of software and/or hardware that implements a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, the implementation of hardware, or a combination of software and hardware, is also possible and conceivable.
本实施例提供一种问答数据的生成装置,如图3所示,包括:This embodiment provides a device for generating question-answer data, as shown in FIG3 , including:
第一获取模块301,用于获取多个切割文档块,其中,每个切割文档块内包含第一预设数量个文本数据;A first acquisition module 301 is used to acquire a plurality of cut document blocks, wherein each cut document block contains a first preset number of text data;
第一得到模块302,用于根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据;The first obtaining module 302 is used to obtain target question data corresponding to the cut document block according to the text data in the cut document block and the target question model;
第二获取模块303,用于获取全量数据,其中,全量数据为由多个切割文档块所组成的完整文档中包含的与目标问题数据相关联的内容信息;The second acquisition module 303 is used to acquire full data, wherein the full data is content information associated with the target question data contained in a complete document composed of a plurality of cut document blocks;
第二得到模块304,用于根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据;The second obtaining module 304 is used to obtain target answer data corresponding to the target question data according to the cut document blocks, the target question data, the full amount of data and the target answer model;
第三得到模块305,用于根据目标问题数据和目标回答数据,生成问答数据。The third obtaining module 305 is used to generate question-answer data according to the target question data and the target answer data.
在一些可选的实施方式中,该装置还包括:In some optional embodiments, the device further comprises:
第一确定模块,用于在根据切割文档块内的文本数据和目标提问模型,得到切割文档块相对应的目标问题数据之前,根据切割文档块内的文本数据,确定提问策略; A first determination module is used to determine a questioning strategy based on the text data in the cut document block before obtaining target question data corresponding to the cut document block based on the text data in the cut document block and the target question model;
提取模块,用于根据提问策略对文本数据进行处理,提取多个提问问题;An extraction module, used to process text data according to a questioning strategy and extract multiple questioning questions;
第三获取模块,用于根据问题评分模型获取对多个提问问题的第一评分;A third acquisition module is used to obtain first scores for multiple questions according to the question scoring model;
第二确定模块,用于根据第一评分从多个提问问题中确定多个候选提问问题;A second determination module is used to determine a plurality of candidate questions from a plurality of questions according to the first score;
第四得到模块,用于根据候选提问问题优化初始提问模型,得到目标提问模型。The fourth obtaining module is used to optimize the initial question model according to the candidate questions to obtain the target question model.
在一些可选的实施方式中,第一确定模块包括:In some optional implementations, the first determining module includes:
第一确定单元,用于确定文本数据的文本类型;A first determining unit, used to determine the text type of the text data;
第二确定单元,用于根据文本类型确定相对应的提问策略。The second determining unit is used to determine a corresponding questioning strategy according to the text type.
在一些可选的实施方式中,该装置还包括:In some optional embodiments, the device further comprises:
第三确定模块,用于在根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到与目标问题数据相对应的目标回答数据之前,根据切割文档块内的文本数据和目标问题数据,确定回答策略;A third determination module is used to determine an answer strategy based on the text data in the cut document block and the target question data before obtaining the target answer data corresponding to the target question data based on the cut document block, the target question data, the full amount of data and the target answer model;
第五得到模块,用于根据回答策略对文本数据进行处理,得到多个回答数据;The fifth obtaining module is used to process the text data according to the answer strategy to obtain a plurality of answer data;
第四获取模块,用于根据回答评分模型获取对多个回答数据的第二评分;A fourth acquisition module, used for acquiring a second score for the plurality of answer data according to the answer scoring model;
第四确定模块,用于根据第二评分从多个回答数据中确定多个候选回答数据;a fourth determination module, configured to determine a plurality of candidate answer data from the plurality of answer data according to the second score;
第六得到模块,用于根据候选回答数据优化初始回答模型,得到目标回答模型。The sixth obtaining module is used to optimize the initial answer model according to the candidate answer data to obtain the target answer model.
在一些可选的实施方式中,第三确定模块包括:In some optional implementations, the third determining module includes:
第三确定单元,用于确定文本数据的文本类型;A third determining unit, used to determine the text type of the text data;
第四确定单元,用于根据文本类型和目标问题数据,确定回答策略。The fourth determination unit is used to determine the answer strategy according to the text type and the target question data.
在一些可选的实施方式中,第二得到模块304包括:In some optional implementations, the second obtaining module 304 includes:
得到单元,用于根据切割文档块、目标问题数据、全量数据以及目标回答模型,得到多个回答数据;An obtaining unit, used for obtaining a plurality of answer data according to the cut document blocks, the target question data, the full amount of data and the target answer model;
拆分单元,用于将回答数据进行最小单元的拆分,得到多个目标字段;Splitting unit is used to split the answer data into the smallest unit to obtain multiple target fields;
比较单元,用于比较每第二预设数量个回答数据内的目标字段的描述内容;A comparing unit, used for comparing the description content of the target field in each second preset number of answer data;
保留单元,用于保留第二评分最高的回答数据所对应的描述内容;A retaining unit, used for retaining the description content corresponding to the answer data with the second highest score;
整合单元,用于将描述内容进行整合,得到目标回答数据。 The integration unit is used to integrate the description content to obtain target answer data.
在一些可选的实施方式中,该装置还包括:In some optional embodiments, the device further comprises:
评分模块,用于在得到关于切割文档块的问答数据之后,利用问答数据评分模型对问答数据进行评分,得到问答数据的第三评分;A scoring module, for scoring the question and answer data using a question and answer data scoring model after obtaining the question and answer data about the cut document block, to obtain a third score for the question and answer data;
清洗模块,用于根据第三评分对问答数据进行数据清洗,得到满足预设需求的目标问答数据;A cleaning module, used to clean the question and answer data according to the third score to obtain target question and answer data that meets preset requirements;
扩充模块,用于根据目标问答数据进行数据样本扩充,得到第三预设数量个问答数据。The expansion module is used to expand the data sample according to the target question and answer data to obtain a third preset number of question and answer data.
在一些可选的实施方式中,切割文档块内包括图片识别后的文本数据。In some optional implementations, the cut document block includes text data after image recognition.
本实施例中的问答数据的生成装置是以功能单元的形式来呈现,这里的单元是指ASIC电路,执行一个或多个软件或固定程序的处理器和存储器,和/或其他可以提供上述功能的器件。The question and answer data generating device in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and memory that executes one or more software or fixed programs, and/or other devices that can provide the above functions.
上述各个模块和单元的更进一步的功能描述与上述对应实施例相同,在此不再赘述。The further functional description of each of the above modules and units is the same as that of the above corresponding embodiments and will not be repeated here.
本公开实施例还提供一种计算机设备,具有上述图3所示的问答数据的生成装置。The embodiment of the present disclosure also provides a computer device having the device for generating question and answer data shown in FIG. 3 above.
请参阅图4,图4是本公开可选实施例提供的一种计算机设备的结构示意图,如图4所示,该计算机设备包括:一个或多个处理器10、存储器20,以及用于连接各部件的接口,包括高速接口和低速接口。各个部件利用不同的总线互相通信连接,并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在计算机设备内执行的指令进行处理,包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如,耦合至接口的显示设备)上显示GUI的图形信息的指令。在一些可选的实施方式中,若需要,可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样,可以连接多个计算机设备,各个设备提供部分必要的操作(例如,作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图4中以一个处理器10为例。Please refer to Figure 4, which is a schematic diagram of the structure of a computer device provided by an optional embodiment of the present disclosure. As shown in Figure 4, the computer device includes: one or more processors 10, a memory 20, and interfaces for connecting various components, including high-speed interfaces and low-speed interfaces. The various components are connected to each other using different buses for communication, and can be installed on a common motherboard or installed in other ways as needed. The processor can process instructions executed in the computer device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In some optional embodiments, if necessary, multiple processors and/or multiple buses can be used together with multiple memories and multiple memories. Similarly, multiple computer devices can be connected, and each device provides some necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system). In Figure 4, a processor 10 is taken as an example.
处理器10可以是中央处理器,网络处理器或其组合。其中,处理器10还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路,可编程逻辑器件或其组合。上述可编程逻辑器件可以是复杂可编程逻辑器件,现场可编程逻辑门阵列,通用阵列逻辑或其任意组合。The processor 10 may be a central processing unit, a network processor or a combination thereof. The processor 10 may further include a hardware chip. The hardware chip may be a dedicated integrated circuit, a programmable logic device or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general purpose array logic or any combination thereof.
其中,存储器20存储有可由至少一个处理器10执行的指令,以使至少一 个处理器10执行实现上述实施例示出的方法。The memory 20 stores instructions executable by at least one processor 10 to enable at least one The processor 10 executes the method shown in the above embodiment.
存储器20可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据一种小程序落地页的展现的计算机设备的使用所创建的数据等。此外,存储器20可以包括高速随机存取存储器,还可以包括非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些可选的实施方式中,存储器20可选包括相对于处理器10远程设置的存储器,这些远程存储器可以通过网络连接至该计算机设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 20 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function; the data storage area may store data created by the use of a computer device based on the presentation of a small program landing page, etc. In addition, the memory 20 may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device. In some optional embodiments, the memory 20 may optionally include a memory remotely arranged relative to the processor 10, and these remote memories may be connected to the computer device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
存储器20可以包括易失性存储器,例如,随机存取存储器;存储器也可以包括非易失性存储器,例如,快闪存储器,硬盘或固态硬盘;存储器20还可以包括上述种类的存储器的组合。The memory 20 may include a volatile memory, such as a random access memory; the memory may also include a non-volatile memory, such as a flash memory, a hard disk or a solid state drive; the memory 20 may also include a combination of the above types of memory.
该计算机设备还包括通信接口30,用于该计算机设备与其他设备或通信网络通信。The computer device further comprises a communication interface 30 for the computer device to communicate with other devices or a communication network.
本公开实施例还提供了一种计算机可读存储介质,上述根据本公开实施例的方法可在硬件、固件中实现,或者被实现为可记录在存储介质,或者被实现通过网络下载的原始存储在远程存储介质或非暂时机器可读存储介质中并将被存储在本地存储介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件的存储介质上的这样的软件处理。其中,存储介质可为磁碟、光盘、只读存储记忆体、随机存储记忆体、快闪存储器、硬盘或固态硬盘等;进一步地,存储介质还可以包括上述种类的存储器的组合。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件,当软件或计算机代码被计算机、处理器或硬件访问且执行时,实现上述实施例示出的方法。The embodiments of the present disclosure also provide a computer-readable storage medium. The above-mentioned method according to the embodiments of the present disclosure can be implemented in hardware, firmware, or can be implemented as a computer code that can be recorded in a storage medium, or can be implemented as a computer code that is originally stored in a remote storage medium or a non-temporary machine-readable storage medium and will be stored in a local storage medium and downloaded through a network, so that the method described herein can be stored in such software processing on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. Among them, the storage medium can be a magnetic disk, an optical disk, a read-only storage memory, a random access memory, a flash memory, a hard disk or a solid-state drive, etc.; further, the storage medium can also include a combination of the above-mentioned types of memory. It can be understood that a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component that can store or receive software or computer code. When the software or computer code is accessed and executed by a computer, a processor, or hardware, the method shown in the above embodiment is implemented.
虽然结合附图描述了本公开的实施例,但是本领域技术人员可以在不脱离本公开的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。 Although the embodiments of the present disclosure have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the present disclosure, and such modifications and variations are all within the scope defined by the appended claims.
Claims (11)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311434678.6A CN117493508A (en) | 2023-10-31 | 2023-10-31 | Question-answer data generation method and device, computer equipment and storage medium |
| CN202311434678.6 | 2023-10-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025092056A1 true WO2025092056A1 (en) | 2025-05-08 |
Family
ID=89680816
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/107859 Pending WO2025092056A1 (en) | 2023-10-31 | 2024-07-26 | Question-and-answer data generation method and apparatus, and computer device and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN117493508A (en) |
| WO (1) | WO2025092056A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120744069A (en) * | 2025-08-28 | 2025-10-03 | 宁波市大数据投资发展有限公司 | Policy knowledge question-answering method and device, electronic equipment and storage medium |
| CN120975247A (en) * | 2025-10-20 | 2025-11-18 | 苏州元脑智能科技有限公司 | A method for constructing synthetic datasets and an electronic device |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117493508A (en) * | 2023-10-31 | 2024-02-02 | 抖音视界有限公司 | Question-answer data generation method and device, computer equipment and storage medium |
| CN118939766B (en) * | 2024-07-18 | 2025-05-23 | 北京深势科技有限公司 | Processing method and device for adjusting answer text based on self-adaptive retrieval enhancement mechanism |
| CN119673350B (en) * | 2024-11-22 | 2025-12-16 | 清华大学 | Methods, devices, equipment, storage media, and program products for generating medical record reports. |
| CN119474278B (en) * | 2025-01-15 | 2025-04-01 | 杭州华策影视科技有限公司 | Question-answering method, system, computer equipment and storage medium based on large model |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110532369A (en) * | 2019-09-04 | 2019-12-03 | 腾讯科技(深圳)有限公司 | A kind of generation method of question and answer pair, device and server |
| CN114817478A (en) * | 2022-05-13 | 2022-07-29 | 润联软件系统(深圳)有限公司 | Text-based question answering method, device, computer equipment and storage medium |
| CN115114416A (en) * | 2021-03-23 | 2022-09-27 | 阿里巴巴新加坡控股有限公司 | A question-answer pair generation method, device, electronic device and computer storage medium |
| CN117493508A (en) * | 2023-10-31 | 2024-02-02 | 抖音视界有限公司 | Question-answer data generation method and device, computer equipment and storage medium |
-
2023
- 2023-10-31 CN CN202311434678.6A patent/CN117493508A/en active Pending
-
2024
- 2024-07-26 WO PCT/CN2024/107859 patent/WO2025092056A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110532369A (en) * | 2019-09-04 | 2019-12-03 | 腾讯科技(深圳)有限公司 | A kind of generation method of question and answer pair, device and server |
| CN115114416A (en) * | 2021-03-23 | 2022-09-27 | 阿里巴巴新加坡控股有限公司 | A question-answer pair generation method, device, electronic device and computer storage medium |
| CN114817478A (en) * | 2022-05-13 | 2022-07-29 | 润联软件系统(深圳)有限公司 | Text-based question answering method, device, computer equipment and storage medium |
| CN117493508A (en) * | 2023-10-31 | 2024-02-02 | 抖音视界有限公司 | Question-answer data generation method and device, computer equipment and storage medium |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120744069A (en) * | 2025-08-28 | 2025-10-03 | 宁波市大数据投资发展有限公司 | Policy knowledge question-answering method and device, electronic equipment and storage medium |
| CN120975247A (en) * | 2025-10-20 | 2025-11-18 | 苏州元脑智能科技有限公司 | A method for constructing synthetic datasets and an electronic device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117493508A (en) | 2024-02-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| AU2020321751B2 (en) | Neural network system for text classification | |
| WO2025092056A1 (en) | Question-and-answer data generation method and apparatus, and computer device and storage medium | |
| CN119398092B (en) | Construction method and device of multi-group data intelligent agent | |
| US9535980B2 (en) | NLP duration and duration range comparison methodology using similarity weighting | |
| EP3968245A1 (en) | Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus | |
| EP3965024B1 (en) | Automatically labeling functional blocks in pipelines of existing machine learning projects in a corpus adaptable for use in new machine learning projects | |
| EP4575822A1 (en) | Data source mapper for enhanced data retrieval | |
| Li et al. | Machine learning for requirements engineering (ML4RE): A systematic literature review complemented by practitioners’ voices from Stack Overflow | |
| CN118626575A (en) | Weather query system and weather query method | |
| US11403304B2 (en) | Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects | |
| CN113934450B (en) | Method, apparatus, computer device and medium for generating annotation information | |
| CN113326348A (en) | Blog quality evaluation method and tool | |
| CN119917671A (en) | Interview simulation method, system, device and storage medium based on knowledge graph | |
| CN116028620B (en) | Method and system for generating patent abstract based on multi-task feature cooperation | |
| CN118966343A (en) | Question and answer knowledge base construction method, device, equipment and storage medium | |
| Li et al. | IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis | |
| CN117056545A (en) | Question data generation method, content acquisition method and device | |
| CN114997330A (en) | Method, device, terminal device and storage medium for constructing defect checking model | |
| Alreshedy et al. | Predicting the programming language of questions and snippets of StackOverflow using natural language processing | |
| KR102841252B1 (en) | Server and Method for Operating Major-Specific AI Tutor Based on Adaptive Data Retraining and Multi-Level Embedding | |
| CN120067682B (en) | Fine tuning method, device, terminal equipment and medium of government affair question-answering system | |
| CN117874231B (en) | Text classification method, device, and electronic device | |
| CN119003782B (en) | A method and system for checking duplicate questions in a computer-based examination question bank | |
| US20250307297A1 (en) | Document Correlation Optimizer | |
| JP2025098974A (en) | System and method for idealization of research plan using large-scale language model agent oriented architecture |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24884063 Country of ref document: EP Kind code of ref document: A1 |