CN120670565A

CN120670565A - Knowledge retrieval candidate library generation method and system based on incremental pre-training optimization

Info

Publication number: CN120670565A
Application number: CN202511140546.1A
Authority: CN
Inventors: 陈家树; 严晨; 于海军
Original assignee: Shanghai Anshuo Enterprise Credit Reporting Service Co ltd
Current assignee: Shanghai Anshuo Enterprise Credit Reporting Service Co ltd
Priority date: 2025-08-15
Filing date: 2025-08-15
Publication date: 2025-09-19
Anticipated expiration: 2045-08-15
Also published as: CN120670565B

Abstract

The invention discloses a knowledge retrieval candidate library generation method and system based on incremental pre-training optimization, and relates to the technical field of information retrieval, wherein the method comprises the steps of collecting a first retrieval sentence and analyzing and obtaining a retrieval behavior feature vector; calculating to obtain normalized co-occurrence frequency, constructing a mapping function relation to obtain a matched increment pre-training stage number, executing a multi-stage expansion first search sentence by using a sentence increment optimization model to output an expansion search sentence, and further generating a knowledge search candidate library. The invention solves the technical problems of insufficient semantic coverage, waste of computing resources and inaccurate candidate library generation caused by the fact that the traditional knowledge retrieval candidate library generation method cannot dynamically adjust the incremental pre-training level based on the user retrieval behavior characteristics in the electric digital data processing, and achieves the technical effects of semantic coverage expansion, computing resource optimal configuration and candidate library accurate generation.

Description

Knowledge retrieval candidate library generation method and system based on incremental pre-training optimization

Technical Field

The invention relates to the technical field of information retrieval, in particular to a knowledge retrieval candidate library generation method and system based on incremental pre-training optimization.

Background

In the process of electric digital data processing, knowledge retrieval candidate base generation is important to accurate information acquisition. In the prior art, a candidate library is generated by expanding search sentences through a fixed pre-training model, and plays a role in stabilizing a search scene. However, as the search requirements diversify and the accuracy improves, the conventional method reveals limitations in the application of electric digital data processing. Because the traditional method can not dynamically adjust the incremental pre-training level number according to the user retrieval behavior characteristics, insufficient semantic coverage and calculation resource waste of a candidate library are caused in an electric digital data processing flow, the generated candidate library is inaccurate, and the requirements of accurate knowledge retrieval evaluation and high-efficiency application are difficult to meet.

Disclosure of Invention

The application provides a knowledge search candidate library generation method and system based on incremental pre-training optimization, which are used for solving the technical problems that in the process of electric digital data, the traditional knowledge search candidate library generation method cannot dynamically adjust the incremental pre-training level based on the user search behavior characteristics, so that the semantic coverage is insufficient, the calculation resources are wasted and the candidate library generation is inaccurate.

The first aspect of the application provides a knowledge retrieval candidate library generation method based on incremental pre-training optimization, which comprises the steps of collecting a first retrieval sentence of a current user, calling a historical retrieval sentence behavior library to analyze the first retrieval sentence to obtain a retrieval behavior feature vector, calculating the retrieval behavior feature vector to obtain a normalized co-occurrence frequency, constructing a mapping function relation of a normalized co-occurrence frequency sample and an incremental pre-training series sample, obtaining a matching incremental pre-training series corresponding to the normalized co-occurrence frequency by using the mapping function relation, and carrying out multistage expansion on the first retrieval sentence by using the matching incremental pre-training series by using a pre-training sentence incremental optimization model to output an expanded retrieval sentence, and generating a knowledge retrieval candidate library according to the expanded retrieval sentence.

The application provides a knowledge retrieval candidate library generating system based on increment pre-training optimization, which comprises a retrieval behavior feature vector acquisition module, a normalization co-occurrence frequency acquisition module, a mapping function relation construction module and a knowledge retrieval candidate library construction module, wherein the retrieval behavior feature vector acquisition module is used for collecting a first retrieval sentence of a current user, calling a historical retrieval sentence behavior library to analyze the first retrieval sentence to acquire a retrieval behavior feature vector, the normalization co-occurrence frequency acquisition module is used for calculating the retrieval behavior feature vector to acquire a normalization co-occurrence frequency, the mapping function relation construction module is used for constructing a mapping function relation between a normalization co-occurrence frequency sample and an increment pre-training series sample, the mapping function relation is used for acquiring a matching increment pre-training series corresponding to the normalization co-occurrence frequency, the knowledge retrieval candidate library construction module is used for pre-training sentence increment optimization models, the sentence increment optimization models adopt the matching increment pre-training series to conduct multistage expansion on the first retrieval sentence, and output the expanded retrieval sentence, and knowledge retrieval candidate libraries are generated according to the expanded retrieval sentence.

One or more technical schemes provided by the application have at least the following technical effects or advantages:

According to the application, by collecting the first search statement of the current user, calling the historical search statement behavior library to analyze and acquire the search behavior feature vector, calculating the normalized co-occurrence frequency and constructing the mapping function relation between the normalized co-occurrence frequency and the incremental pre-training series, carrying out multistage expansion on the statement incremental optimization model by utilizing the matched incremental pre-training series, outputting the expanded search statement and generating the knowledge search candidate library, thereby realizing the accurate expansion of search semantics and the efficient generation of the candidate library, improving the accuracy and efficiency of knowledge search, and achieving the technical effects of semantic coverage expansion, computing resource optimization configuration and the accurate generation of the candidate library.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a knowledge search candidate library generation method based on incremental pre-training optimization according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a knowledge retrieval candidate library generation system based on incremental pre-training optimization according to an embodiment of the present application.

The reference numerals illustrate that the retrieval behavior feature vector acquisition module 1, the normalized co-occurrence frequency acquisition module 2, the mapping function relation construction module 3 and the knowledge retrieval candidate library construction module 4.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that, the terms "first," "second," and the like in the description of the present application and the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.

In a first embodiment, as shown in fig. 1, a method for generating a knowledge search candidate base based on incremental pre-training optimization, where the method includes:

And step A100, collecting a first search sentence of a current user, and calling a historical search sentence behavior library to analyze the first search sentence to obtain a search behavior feature vector.

In the embodiment of the application, the first search statement is a search statement to be processed input by a current user and is used for triggering a knowledge search candidate library generation flow. The historical search statement behavior library is a database for storing historical search behavior data of a user, comprises effective historical search statements and related features thereof in a preset period, and is used for analyzing the search behavior features of a first search statement.

Specifically, first, an original sentence currently input by a user for triggering knowledge retrieval is acquired. The sentence is used as an input starting point of the whole knowledge retrieval candidate base generation flow, is actively input by a user through an interactive interface (such as a search box), and the content can comprise keywords, phrases or complete sentences, such as the application of artificial intelligence in the technical field. The sentences are captured in real time through the interface, and an original data basis is provided for carrying out feature analysis on a subsequent call history retrieval behavior library, generating expanded retrieval sentences and constructing a candidate library.

Next, a preset period is obtained, effective historical search sentence behaviors in the period are extracted from the historical search sentence behavior library, and the first search sentence is analyzed to obtain a search behavior feature vector, and specific steps are described in detail in a110-a 120.

By collecting the first search sentences of the user and deeply analyzing the historical search behavior characteristics, the accurate depiction of the search requirements of the user is realized, and a data foundation is laid for the follow-up dynamic matching of the incremental pre-training series and the efficient generation of the knowledge search candidate library.

And step A200, calculating the retrieval behavior feature vector to obtain normalized co-occurrence frequency.

In the embodiment of the application, the normalized co-occurrence frequency is a quantized value obtained by mapping the feature fusion index to the target interval.

Optionally, weight calculation is performed on the search behavior feature vector including the search frequency feature vector representing the historical search frequency, the generated time length feature vector representing the time length required by the generation of the search candidate library and the search feedback index vector representing the feedback quantity of the search candidate library, so as to obtain the feature fusion index and map the feature fusion index to the target interval, thereby obtaining the normalized co-occurrence frequency, and the specific steps are described in detail in A210-A220.

And step A300, constructing a mapping function relation between the normalized co-occurrence frequency sample and the increment pre-training series sample, and acquiring a matched increment pre-training series corresponding to the normalized co-occurrence frequency by using the mapping function relation.

In the embodiment of the application, the incremental pre-training series sample is a preset set of different series and is used for constructing a mapping relation with the normalized co-occurrence frequency sample, and the series optimal solution corresponding to each co-occurrence frequency is obtained by calculating the grading value, so that the optimal incremental pre-training series is dynamically matched according to the retrieval behavior characteristics of the user.

In one embodiment of the present application, first, the scoring values of the normalized co-occurrence frequency samples and the incremental pre-training level samples (the information entropy mean of the quality and the computing resource scores are generated for the candidate library) are calculated, the incremental pre-training level optimal solution of each co-occurrence frequency sample is obtained, and the functional relationship is fitted, so as to construct the mapping relationship between the two, and specific steps are described in detail in a310-a 330.

The step of obtaining the matched incremental pre-training series by using the mapping function relation comprises the following steps of firstly, determining the series optimal solution corresponding to each frequency by calculating the scoring value of each frequency and series combination, namely the information entropy mean value of the quality and calculation resource scoring generated by the candidate library based on the constructed normalized co-occurrence frequency sample, such as {0.1,0.3,0.5,0.7,0.9}, and the incremental pre-training series sample, such as {1-5 levels }.

Next, a constructed exponential-decreasing function model (e.g) Fitting the mapping relation between the frequency and the series, and optimizing parameters through a least square method to enable the mean square error of the model output and the optimal solution to be converged to be below 0.1. When the normalized co-occurrence frequency of a certain user is input by 0.6, the model calculates y=5.e (-2.3×0.6) +0.5 (5×0.257+0.5 (1.785)), and the whole is 2 stages, so that compared with the traditional fixed 5 stages, the operation time consumption can be reduced, and the semantic query hit rate is kept above 85%.

By constructing a dynamic mapping function based on the user retrieval behavior characteristics, the accurate matching of the normalized co-occurrence frequency and the incremental pre-training series is realized, the problems of insufficient semantic coverage and resource waste caused by fixed series in the prior art are solved, and a quantized series matching scheme is provided for efficiently generating a knowledge retrieval candidate library.

And step A400, a pre-training sentence increment optimization model, wherein the sentence increment optimization model adopts the matched increment pre-training stage number to carry out multistage expansion on the first search sentence, outputs an expanded search sentence, and generates a knowledge search candidate library according to the expanded search sentence.

In the embodiment of the application, the expanded search statement refers to a related search statement which is output after a first search statement is input into a statement increment optimization model, a search semantic vector is extracted, L rounds of expanded iteration and semantic similarity calculation are combined according to a matched increment pre-training stage number.

Specifically, firstly, the input of the incremental optimization model of the pre-training sentence is an initial search sentence and a matched incremental pre-training series L, and the output is an expanded search sentence set generated after L rounds of semantic expansion iteration, and the expanded search sentence set is used for constructing a knowledge search candidate library. The construction process is based on a pre-training language model (such as BERT and T5), semantic vectors of initial search sentences are firstly extracted through an encoder, then a semantic expansion module and a multi-task learning mechanism are integrated, wherein the semantic expansion module comprises L rounds of iterative computation, each round of iterative computation generates semantic neighborhood and synonymous semantic neighborhood through cosine similarity or contrast learning, related entities are selected to generate search sentences and combine redundancy, and the multi-task learning module integrates quality score values and resource score values, dynamically adjusts weights through gating networks such as MMoE and the like, and balances search quality and computing resource consumption. The training process adopts an incremental pre-training strategy, combines new and old data to adjust the learning rate (such as cosine annealing) and adds the original data to prevent forgetting, introduces contrast learning (such as SimCSE, diffCSE) to optimize semantic vector representation, and pulls similar semantics through InfoNCE loss function and pushes dissimilar semantics. In addition, the model integrates information entropy mean values through dynamic weight optimization (such as gradient or training progress), the quality of the candidate library is precisely controlled while the retrieval coverage is improved, and finally an expanded retrieval statement set containing rich semantic dimensions is output.

Then, after the first search sentence is input into the sentence increment optimization model to extract the search semantic vector, carrying out L rounds of expansion iteration according to the matched increment pre-training level to obtain related search sentences, extracting the semantic vector of the related search sentences to carry out semantic similarity calculation, combining sentences with a value larger than a preset value to be output as expansion search sentences, and the specific steps are described in detail in A410-A430.

Finally, based on the expanded search sentences, a knowledge search candidate library containing rich semantic dimensions is constructed by integrating semantic related multi-element search expressions, and the process realizes systematic expansion of search coverage and accurate improvement of the quality of the candidate library through dynamic expansion and semantic optimization.

By dynamically matching the increment pre-training level based on the user retrieval behavior and combining a multi-level expansion mechanism of a semantic neighborhood and a synonymous neighborhood and combining and optimizing semantic similarity, the accurate expansion of the retrieval sentences is realized, the problems of insufficient semantic coverage and resource waste caused by fixed level in the prior art are solved, and a dynamically optimized model scheme is provided for efficiently generating a knowledge retrieval candidate library.

Further, step a200 in the method provided by the embodiment of the present application includes:

And A210, carrying out weight calculation on the retrieval behavior feature vector to obtain a feature fusion index, mapping the feature fusion index to a target interval, and obtaining the normalized co-occurrence frequency.

And A220, wherein the retrieval behavior feature vector comprises a retrieval frequency feature vector for representing the historical retrieval frequency, a generated time length feature vector for representing the time length required by the generation of the retrieval candidate library and a retrieval feedback index vector for representing the feedback quantity of the generated retrieval candidate library.

Specifically, first, three-dimensional feature vectors including a search frequency feature vector, a generated time length feature vector, and a search feedback index vector are extracted, for example, 12 times of search for approximately 30 days, 3.5 seconds for each generation on average, and 9 times of effective feedback are taken. And secondly, carrying out linear weighted calculation by adopting a preset weight system (such as search frequency 0.4, generation time length 0.3 and search feedback 0.3, wherein specific numerical values are determined by a person skilled in the art according to actual conditions), and the calculation formula of the characteristic fusion index is that the fusion index=0.4×frequency standardized value+0.3×time length standardized value+0.3×feedback standardized value.

For the normalized values of the search frequency, the generation duration and the search feedback index, specific conditions are that firstly, for the feature vector of the search frequency, the actual search times are mapped to the [0,1] interval by adopting a minimum-maximum normalized method based on the historical search times in a preset period (such as about 30 days), for example, when the minimum value of the search times in about 30 days is 0 times and the maximum value is 15 times, the normalized value of the search 12 times is (12-0)/(15-0) =0.8. Secondly, for generating a duration feature vector, considering that the shorter the time consumption is, the more efficient is, firstly, time data are converted into efficiency indexes (such as 1/generating duration), then, the minimum time consumption and the maximum time consumption in a preset period are used as boundaries for standardization, if the minimum time consumption is 2 seconds, the maximum time consumption is 5 seconds, the average time consumption is 3.5 seconds, the corresponding efficiency value is 1/3.5 approximately equal to 0.286, and after standardization, the normalized time is (0.286-1/5)/(1/2-1/5) = (0.286-0.2)/0.3 approximately equal to 0.086/0.3 approximately equal to 0.29, and the example data can be adjusted according to actual conditions, so that the standardized value is ensured to be in a [0,1] interval. Finally, for searching the feedback index vector, based on the effective feedback quantity in a preset period, the minimum-maximum normalization processing is directly carried out, if the minimum value of the feedback quantity is 0 times and the maximum value is 10 times, the normalization value of the effective feedback 9 times is 9/10=0.9. The normalization process takes the extreme value in the preset period as a reference, ensures the unification of the numerical ranges of the feature vectors, and provides normalization input for the subsequent weight calculation.

Taking the data as an example, the frequency is normalized to be 0.8, the duration is normalized to be 0.29, and the feedback is normalized to be 0.9, and the characteristic fusion index is calculated to be 0.677. Finally, the fusion index is mapped to the [0,1] target interval through a linear mapping function, so that the normalized co-occurrence frequency of 0.677 is obtained, the value represents the comprehensive frequency characteristic of the user retrieval behavior, and the fusion index is naturally limited in the [0,1] range because the weight sum is 1 and the normalized value is less than or equal to 1, so that the fusion index is directly used as the normalized co-occurrence frequency through the linear mapping function, and the value directly represents the comprehensive frequency characteristic of the user retrieval behavior.

By carrying out multidimensional weight calculation and target interval mapping on the retrieval behavior feature vector, accurate quantitative characterization of the retrieval behavior of the user is realized, and a scientific numerical basis is provided for the dynamic matching relation between the subsequent construction behavior index and the incremental pre-training series.

Further, step a300 in the method provided by the embodiment of the present application includes:

And A310, calculating the score value of each co-occurrence frequency sample in the normalized co-occurrence frequency samples and each incremental pre-training stage in the incremental pre-training stage samples, and outputting a score value sample.

And A320, generating an information entropy mean value of the quality score value and the calculation resource score value required by the candidate library for the candidate library.

And A330, acquiring an incremental pre-training level optimal solution of each co-occurrence frequency sample in the normalized co-occurrence frequency samples according to the grading value samples, fitting a functional relation between each co-occurrence frequency sample in the normalized co-occurrence frequency samples and the corresponding incremental pre-training level optimal solution, and outputting the mapping functional relation.

In the embodiment of the application, the co-occurrence frequency sample is a normalized co-occurrence frequency value set and is used for constructing a mapping function relation with the incremental pre-training series sample. The candidate library is a knowledge search candidate library generated according to the expanded search sentences, and the sentence increment optimization model adopts a matched increment pre-training series to carry out multistage expansion on the first search sentences and then output the expanded search sentences to generate the expanded search sentences.

Optionally, in knowledge retrieval, a fixed incremental pre-training level is generally adopted in the prior art when a retrieval candidate library is constructed, for example, the fixed incremental pre-training level is uniformly set to 3 levels or 5 levels, and dynamic adjustment of the retrieval behavior characteristics of a user is not combined, so that the semantic coverage loss rate is increased due to insufficient level in a low-frequency scene (for example, less than 0.3) with normalized co-occurrence frequency, and the calculation resource is wasted due to over-high level in a high-frequency scene (for example, more than 0.7).

For the above-mentioned problems, first, a normalized co-occurrence frequency sample set, such as {0.1,0.3,0.5,0.7,0.9} and an incremental pre-training series sample set, such as {1,2,3,4,5} stages, are established. Taking normalized co-occurrence frequency 0.1 and level 5 as an example, when calculating the scoring values, firstly obtaining candidate libraries to generate quality scoring values (specifically, the steps are described in detail in A321-A322), such as semantic query hit rate 85%, consistency scoring labeling information 4.2/5, semantic relevance weighting 0.7, comprehensive quality scoring of 0.85×0.4+0.84×0.3+0.7×0.3=0.812 and computing resource scoring values, such as GPU occupancy rate 60% and time-consuming 12 seconds, normalized resource scoring of 0.4, wherein the information entropy of the two is H (quality) = -0.812×ln0.812-0.188×ln0.188 ≡0.36, H (resource) = -0.4×ln0.4-0.6×ln0.6 ≡0.67, and the information entropy mean is (0.36+0.67)/2 ≡0.515, namely, the score of the combination is 0.515, and the information entropy can be calculated as a probability distribution formulaWhereinThe information entropy is. In a binary distribution scenario, as in the probability distribution of the quality score and the resource score described above, the formula is reduced toWherein,AndAnd generating the ratio of the quality score value and the calculation resource score value required by the candidate library in the probability distribution respectively corresponding to the candidate library.

And then traversing the combination of all normalized co-occurrence frequencies and the incremental pre-training stages to obtain a scoring value sample matrix. For example, when the co-occurrence frequency is 0.7, the score value of stage number 2 is calculated to be 0.78, wherein the quality score is 0.92, the resource score is 0.85, and the information entropy mean is 0.81, which is the highest score at the frequency, so stage number 2 is determined to be the optimal solution.

Based on all the optimal solution data, fitting is performed by adopting an exponential decreasing function model to obtain a mapping relation of 5 stages corresponding to 0.1 and 1 stage corresponding to 0.9, as shown in table 1, wherein the exponential decreasing function model construction process is described in detail in A331.

The combination score of the normalized co-occurrence frequency and the incremental pre-training series is quantitatively analyzed, and the dynamic matching of the retrieval behavior characteristics and the pre-training resources is realized by combining an information entropy mean value evaluation mechanism and an index fitting model, so that the problems of insufficient semantic coverage and resource waste caused by the fixed series in the prior art are solved, and a scientific mapping basis is provided for accurately generating a knowledge retrieval candidate library.

TABLE 1 normalized co-occurrence frequency sample and incremental pre-training stage mapping table

Further, step a330 in the method provided by the embodiment of the present application includes:

And A331, fitting a functional relation between each co-occurrence frequency sample in the normalized co-occurrence frequency samples and a corresponding incremental pre-training series optimal solution through a fitting model, wherein the fitting model is an exponential decreasing function model.

Specifically, the mapping function relation between the co-occurrence frequency sample and the corresponding incremental pre-training series optimal solution is simulated by adopting an exponential decreasing function model in the form ofWherein alpha, beta and gamma are fitting parameters, x is normalized co-occurrence frequency, and y is incremental pre-training series. Firstly, based on the constructed scoring value samples, the series optimal solutions corresponding to different normalized co-occurrence frequencies are obtained, and sample points (0.1,5), (0.3,4), (0.5, 3), (0.7, 2) and (0.9,1) are collected. And secondly, carrying out parameter optimization on the exponential decreasing function by adopting a least square method, and iteratively adjusting the values of alpha, beta and gamma by taking the mean square error of the sample points as a target. Assuming that the initial parameters are α=4, β=2, γ=1, calculating the error between the prediction series and the actual optimal solution, for example, when x=0.1, the prediction y=4·e (-2×0.1) +1≡4×0.8187+1≡ 4.275 and the error between the actual optimal solution 5 is 0.725, updating parameters to alpha=5, beta=2.3, gamma=0.5 by gradient descent, where x=0.1, y=5·e (-2.3×0.1) +0.5≡5×0.7945+0.5≡ 4.472, the error is reduced to 0.528, and finally the overall mean square error is converged to below 0.1.

The input of the model is normalized co-occurrence frequency, the range is 0,1, the output is matched increment pre-training series which is a positive integer, and the dynamic mapping with higher frequency and lower series is realized through the index decreasing characteristic. For example, when x=0.2 is input, the model outputs y+.5·e (-2.3×0.2) +0.5+.5× 0.6387+0.5+. 3.693, rounded to 4 stages, which can improve the semantic query hit rate compared to the fixed 3 stages of the prior art, and when x=0.8, outputs y+.5·e (-2.3×0.8) +0.5+.5× 0.1609+0.5+.1.304, rounded to 1 stage, which can reduce the computational resource consumption compared to the fixed 5 stages.

By constructing an exponential decreasing function model and carrying out parameter fitting based on a scoring optimal solution sample, nonlinear dynamic matching of normalized co-occurrence frequency and incremental pre-training series is realized, the problem of insufficient semantic expansion or resource waste caused by fixed series in the prior art is solved, and a quantized mapping basis is provided for efficient generation of a knowledge retrieval candidate library.

Further, step a400 in the method provided by the embodiment of the present application includes:

And A410, inputting the first search sentence to the sentence increment optimization model, and extracting a search semantic vector.

And A420, carrying out L rounds of expansion iteration on the search semantic vector according to the matching increment pre-training series to obtain L rounds of related search sentences, wherein L is the number of the matching increment pre-training series.

And A430, extracting semantic vectors of the L rounds of related search sentences to perform semantic similarity calculation, combining related search sentences with the semantic similarity being larger than a preset semantic similarity, and outputting the processed related search sentences as expanded search sentences.

Specifically, first, a first search sentence input by a user, such as an "application of artificial intelligence in science and technology images", is input into a model, and a search semantic vector containing keywords is extracted through a semantic analysis module. The process may be implemented based on an encoder of a Transformer architecture such as BERT, for example, by capturing semantic associations in sentences through a multi-layer self-attention mechanism, generating semantic vectors with dimensions 768.

Secondly, carrying out L rounds of expansion iteration on the search semantic vector according to the matching increment pre-training level, including calculating a round of semantic neighborhood of the search semantic vector, selecting a round of semantic related entities from the round of semantic neighborhood to generate a round of related search sentences, and similarly outputting L rounds of related search sentences, wherein the specific steps are described in detail in A421-A422.

Furthermore, the expanded iteration method further comprises the steps of obtaining a synonymous semantic neighborhood of the search semantic vector, selecting synonymous semantic related entities from the synonymous semantic neighborhood to generate synonymous related search sentences, and updating the synonymous semantic related entities into L rounds of related search sentences, wherein the specific steps are described in detail in A423-A425.

Each iteration generates a relevant search statement. For example, after 4 rounds of expansion, multiple related sentences such as "application of deep learning in scientific and technological image AI judgment", "combination of scientific image processing and AI technology" can be generated, and semantic dimensions such as synonymous substitution, field subdivision and the like of the original sentences are covered.

And extracting semantic vectors of all L-round expansion sentences, and calculating semantic similarity between every two sentences by adopting a cosine similarity algorithm. And presetting the similarity threshold value to be 0.7, and merging redundant sentences with the similarity larger than the threshold value. For example, 5 of the 20 expanded sentences may be combined due to high similarity of semantics (e.g., "AI technology" and "artificial intelligence technology"), and finally 15 expanded search sentences are output. The combining mechanism can improve the hit rate of semantic query in a low-frequency search scene, reduces the consumption of computing resources in a high-frequency search scene (such as normalized co-occurrence frequency of 0.8), and remarkably improves the search efficiency in a specific field while maintaining the generalization capability of a model by dynamically adjusting the expansion series and semantic similarity combining strategy.

By constructing a dynamic expansion mechanism based on the user retrieval behavior characteristics and combining semantic vector extraction, multistage iteration expansion and similarity merging optimization, the accurate expansion of the retrieval sentences is realized. Compared with a static expansion mode with a fixed progression in the prior art, the current scheme is improved in semantic coverage comprehensiveness and resource utilization efficiency, and a dynamic optimized model scheme is provided for efficiently generating a knowledge retrieval candidate library.

Further, step a420 in the method provided by the embodiment of the present application includes:

a421, calculating a round of semantic neighborhood of the search semantic vector.

A422, selecting a round of semantic related entities from a round of semantic neighbors of the search semantic vector, generating a round of related search sentences, and similarly outputting the search semantic vector to carry out L rounds of related search sentences of L rounds of expansion iteration.

In the embodiment of the application, the semantic neighborhood is the adjacent area of the search semantic vector in the semantic space and comprises an entity set related to the semantic vector. The semantic related entity is an entity which is selected from semantic neighbors of the search semantic vector and has semantic relevance with the original search statement and is used for generating a related search statement.

Specifically, first, a first search sentence input by a user, such as "application of artificial intelligence in science and technology image", is input into a sentence increment optimization model, and a search semantic vector with a dimension of 768 is extracted through a BERT-based encoder. The process captures semantic associations in sentences using a multi-layer self-attention mechanism, such as identifying domain associations between "artificial intelligence" and "technical imagery". Subsequently, the search semantic vector is iterated for L rounds of expansion according to the number of matching incremental pre-training stages L determined as described above, e.g., 5 stages determined by the mapping function.

In each iteration, a round of semantic neighborhood of the current semantic vector is first computed. This process is based on cosine similarity algorithm, searching the pre-trained semantic space for neighboring vectors with similarity to the current vector greater than 0.6. For example, the first round of expansion may retrieve neighboring entities such as "deep learning" and "scientific image processing" from the knowledge-graph, forming a semantic neighborhood containing 20 nodes.

And then, selecting a round of semantic related entities from the neighborhood, and generating a round of related search sentences through template filling or sequence generating models. For example, a plurality of candidate sentences such as "application of deep learning in technology image AI judgment" are generated by combining "deep learning" and "technology image AI diagnosis". Repeating the steps for L times to finally generate an L-round expansion statement set.

By constructing a dynamic neighborhood expansion mechanism based on semantic vector space and combining cosine similarity calculation, knowledge graph entity retrieval and iterative sentence generation strategies, multidimensional expansion of retrieval semantics is realized, and an intelligent dynamic optimization model is provided for efficiently generating a knowledge retrieval candidate library.

A423, obtaining the synonymous semantic neighborhood of the retrieval semantic vector.

And A424, selecting a synonym semantic related entity from the synonym semantic neighborhood of the retrieval semantic vector to generate a synonym related retrieval statement of the retrieval semantic vector.

And A425, updating the synonym related search statement to the L rounds of related search statements.

In one embodiment, first, a pre-trained statement delta optimization model is utilized to obtain synonymous semantic neighborhood of a search semantic vector. Taking the application of the input sentence 'artificial intelligence in the technical image' as an example, the model calculates the semantic vector of the 'technical image' through the BERT word vector model, and searches the synonymous entities with cosine similarity greater than 0.7 in the semantic space, such as 'scientific image', 'image judgment', and the like, so as to form a synonymous semantic neighborhood set comprising a plurality of synonymous entities.

And secondly, selecting 2-3 entities (such as 'scientific images' and 'image analysis') with the strongest semantic relevance with the primitive sentence from the neighborhood, and constructing synonymous related search sentences through a template generation method, such as 'application of artificial intelligence in scientific images', 'technological application of AI in image analysis', and the like.

And finally, updating the newly generated synonymous statement into a related search statement set obtained by L rounds of expansion iteration, thereby realizing the dynamic expansion of semantic coverage.

By acquiring the synonymous semantic neighborhood, generating the context-related statement and updating the expansion set, the problems of low synonymous expansion coverage rate and poor adaptability in the prior art are solved, and an intelligent expansion scheme is provided for improving the comprehensiveness and accuracy of the knowledge retrieval candidate library.

Further, step a320 in the method provided by the embodiment of the present application includes:

A321, acquiring semantic query hit rate, consistency score labeling information and semantic relevance weight of the generated candidate library.

A322, outputting the candidate library to generate a quality score value according to the semantic query hit rate, the consistency score marking information and the calculation result of the semantic relevance weight.

Alternatively, firstly, the hit rate of the semantic query is obtained, that is, the candidate library is searched by the query sentences of the preset test set, and the ratio of the number of correctly hit queries to the total number of queries is counted, for example, 85 hits in 100 test queries, and the hit rate is 85%. Secondly, consistency scoring labeling information is collected, the semantic consistency of the search result and the query statement in the candidate library is scored (for example, 5 scores are adopted) by a person skilled in the art, and the scores of 10 samples are 4, 5, 4, 3, 5, 4, 5, 3 and 4 respectively, and the average consistency score is 4.2. Finally, a semantic relevance weight is determined, which is trained based on field expert experience or historical data and is used for characterizing the importance of the semantic relevance in the quality assessment, for example, set to 0.3.

In the calculation step, the indexes are integrated in a quantization mode. Assuming that the semantic query hit rate is normalized to 0.85 (full score 1), the consistency score is normalized to 0.84 (4.2/5), and a weighted sum formula is used in combination with the semantic relevance weight of 0.3, wherein the quality score value=hit rate×0.5+consistency score×0.2+relevance weight×0.3, i.e., 0.85×0.5+0.84×0.2+0.3×0.3=0.425+0.168+0.09=0.683. The calculation process comprehensively considers the retrieval accuracy, the result consistency and the semantic relativity, and compared with the single index evaluation in the prior art, the reliability of quality scoring is improved.

The semantic query hit rate, consistency score marking information and semantic relevance weight are acquired through multiple dimensions, and evaluation indexes are integrated based on a scientific calculation model, so that comprehensive quantitative analysis of candidate library generation quality is realized, the problems of single evaluation dimension and inaccurate result in the prior art are solved, and a reliable quality reference basis is provided for optimizing incremental pre-training series matching and improving the retrieval effect.

Further, step a100 in the method provided by the embodiment of the present application includes:

A110, acquiring a preset period.

And A120, extracting effective historical retrieval statement behaviors in the preset period in the historical retrieval statement behavior library, and analyzing the first retrieval statement based on the effective historical retrieval statement behaviors to obtain retrieval behavior feature vectors.

In the embodiment of the application, the preset period is a preset time range and is used for limiting the time span for extracting the historical retrieval behavior data, and the period can be adjusted according to the change characteristic of the retrieval habit of the user. The effective historical search sentence behavior refers to a historical search record which completes a search process and generates effective feedback (such as clicking and collecting search results) in a preset period, and does not contain ineffective search behaviors caused by network faults, operation anomalies and the like.

In one embodiment, a preset period is first acquired, the specific data is adjusted by one skilled in the art according to the user's retrieval habit variation characteristics, assuming a setting of approximately 30 days, the period setting is based on the user's behavior short-term continuity characteristics. Taking the "application of machine learning in a recommendation system" input by a user as an example, effective historical retrieval behaviors during 25 days of 2025, 5 months, and 24 days of 2025, 6 months are extracted from a historical retrieval sentence behavior library according to a preset period. The judgment standard of the effective behavior is that the search flow is completed, the record of feedback (such as clicking and collecting) is generated, and invalid requests caused by network faults and the like are eliminated.

After the extraction of the steps, the user has 12 similar topic retrieval records in nearly 30 days, the average time length of each candidate library generation is 3.8 seconds, and the related result is fed back for 10 times. Based on the effective data, the system analyzes the first search statement, converts the search frequency, the generation duration and the feedback quantity into feature vectors with the dimension of [12,3.8,10], and the vectors can accurately represent the current search behavior mode of the user after standardized processing.

By setting a preset period and extracting effective historical retrieval behaviors in the period, the accurate capture of the current retrieval behavior characteristics of the user is realized, and the data support with strong timeliness and high reliability is provided for the subsequent normalization co-occurrence frequency calculation and incremental pre-training series matching.

In summary, the knowledge retrieval candidate library generation method based on incremental pre-training optimization provided by the embodiment of the application has the following technical effects:

according to the application, a first search sentence of a current user is collected, a historical search sentence behavior library is called to analyze the first search sentence behavior library to obtain a search behavior feature vector, a normalized co-occurrence frequency is obtained through weight calculation and mapping processing, a mapping function relation of a normalized co-occurrence frequency sample and an increment pre-training series sample is constructed, a matched increment pre-training series is obtained, and a pre-training sentence increment optimization model is utilized to carry out multistage expansion on the first search sentence to generate an expanded search sentence, so that a knowledge search candidate library is accurately generated, the generation result of the knowledge search candidate library is more accurate and efficient, and the technical effects of semantic coverage expansion, computing resource optimization configuration and accurate generation of the candidate library are achieved.

In a second embodiment, as shown in fig. 2, based on the same inventive concept as the previous embodiment, the embodiment of the present application provides a knowledge retrieval candidate base generating system based on incremental pretraining optimization, where the system includes:

The retrieval behavior feature vector acquisition module 1 is used for collecting a first retrieval statement of a current user, and calling a historical retrieval statement behavior library to analyze the first retrieval statement to acquire a retrieval behavior feature vector.

The normalized co-occurrence frequency acquisition module 2 is used for calculating the retrieval behavior feature vector to acquire normalized co-occurrence frequency.

The mapping function relation construction module 3 is used for constructing a mapping function relation between the normalized co-occurrence frequency sample and the increment pre-training series sample, and the mapping function relation is utilized to obtain a matching increment pre-training series corresponding to the normalized co-occurrence frequency.

The knowledge search candidate base construction module 4 is used for pre-training a sentence increment optimization model, the sentence increment optimization model adopts the matching increment pre-training stage number to carry out multistage expansion on the first search sentence, an expanded search sentence is output, and a knowledge search candidate base is generated according to the expanded search sentence.

Further, the normalized co-occurrence frequency acquisition module 2 is configured to perform the following steps:

And carrying out weight calculation on the retrieval behavior feature vector to obtain a feature fusion index, mapping the feature fusion index to a target interval, and obtaining normalized co-occurrence frequency, wherein the retrieval behavior feature vector comprises a retrieval frequency feature vector representing historical retrieval frequency, a generated time length feature vector representing the time length required by generating a retrieval candidate library and a retrieval feedback index vector representing the feedback quantity of the generated retrieval candidate library.

Further, the mapping function relationship construction module 3 is configured to execute the following steps:

Calculating the score value of each co-occurrence frequency sample in the normalized co-occurrence frequency samples and each increment pre-training level in the increment pre-training level samples, outputting a score value sample, wherein the score value is an information entropy mean value of a quality score value generated by a candidate library and a calculation resource score value required by the candidate library, acquiring an increment pre-training level optimal solution of each co-occurrence frequency sample in the normalized co-occurrence frequency samples according to the score value sample, fitting a functional relation between each co-occurrence frequency sample in the normalized co-occurrence frequency samples and a corresponding increment pre-training level optimal solution, and outputting the mapping functional relation.

and fitting a functional relation between each co-occurrence frequency sample in the normalized co-occurrence frequency samples and a corresponding incremental pre-training series optimal solution through a fitting model, wherein the fitting model is an exponential decreasing function model.

Further, the knowledge search candidate base construction module 4 is configured to execute the following steps:

The method comprises the steps of inputting a first search sentence to a sentence increment optimization model, extracting a search semantic vector, carrying out L rounds of expansion iteration on the search semantic vector according to the matching increment pre-training level number to obtain L rounds of related search sentences, extracting the semantic vector of the L rounds of related search sentences to carry out semantic similarity calculation, combining related search sentences with the semantic similarity larger than a preset semantic similarity, and outputting the processed related search sentences as expansion search sentences.

Selecting a round of semantic related entities from the round of semantic neighborhood of the search semantic vector, generating a round of related search sentences, and similarly outputting the search semantic vector to carry out L rounds of related search sentences of L rounds of expansion iteration.

The method comprises the steps of obtaining a synonym semantic neighborhood of a search semantic vector, selecting a synonym semantic related entity from the synonym semantic neighborhood of the search semantic vector to generate a synonym related search statement of the search semantic vector, and updating the synonym related search statement to the L-round related search statement.

And outputting the candidate library to generate a quality score value according to the semantic query hit rate, the consistency score labeling information and the calculation result of the semantic relevance weight.

Further, the retrieval behavior feature vector obtaining module 1 is configured to execute the following steps:

extracting the effective historical search sentence behaviors in the preset period from the historical search sentence behavior library, analyzing the first search sentence based on the effective historical search sentence behaviors, and obtaining a search behavior feature vector.

The knowledge retrieval candidate library generation system based on the incremental pre-training optimization provided by the embodiment of the invention can execute the knowledge retrieval candidate library generation method based on the incremental pre-training optimization provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Although the present application makes various references to certain modules in a system according to an embodiment of the present application, any number of different modules may be used and run on a user terminal and/or a server, and each unit and module included are merely divided according to functional logic, but are not limited to the above-described division, so long as the corresponding functions can be implemented, and in addition, specific names of each functional unit are only for convenience of distinguishing from each other, and are not intended to limit the scope of protection of the present application.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application. In some cases, the acts or steps recited in the present application may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Claims

1. The method for generating the knowledge retrieval candidate base based on incremental pre-training optimization is characterized by comprising the following steps:

Collecting a first search sentence of a current user, and calling a historical search sentence behavior library to analyze the first search sentence to obtain a search behavior feature vector;

Calculating the retrieval behavior feature vector to obtain normalized co-occurrence frequency;

constructing a mapping function relation between a normalized co-occurrence frequency sample and an incremental pre-training series sample, and acquiring a matched incremental pre-training series corresponding to the normalized co-occurrence frequency by using the mapping function relation;

The sentence increment optimization model adopts the matching increment pre-training series to carry out multi-stage expansion on the first search sentence, outputs an expanded search sentence, and generates a knowledge search candidate base according to the expanded search sentence.

2. The method of claim 1, wherein the search behavior feature vector is weighted to obtain a feature fusion index, and the feature fusion index is mapped to a target interval to obtain a normalized co-occurrence frequency;

the search behavior feature vector comprises a search frequency feature vector representing historical search frequency, a generated time length feature vector representing time length required by generating a search candidate library and a search feedback index vector representing feedback quantity of the generated search candidate library.

3. The method of claim 1, wherein constructing a mapping of normalized co-occurrence frequency samples to incremental pre-training series samples, the method comprising:

Calculating the scoring values of each co-occurrence frequency sample in the normalized co-occurrence frequency samples and each incremental pre-training series in the incremental pre-training series samples, and outputting scoring value samples;

The scoring values are information entropy average values of quality scoring values generated for the candidate library and computing resource scoring values required by the candidate library;

And according to the scoring value samples, obtaining the increment pre-training level optimal solutions of all the co-occurrence frequency samples in the normalized co-occurrence frequency samples, fitting the functional relation between all the co-occurrence frequency samples in the normalized co-occurrence frequency samples and the corresponding increment pre-training level optimal solutions, and outputting the mapping functional relation.

4. The method of claim 3, wherein each of the normalized co-occurrence frequency samples is fitted to a function of a corresponding incremental pre-training series of optimal solutions by a fitting model that is an exponentially decreasing function model.

5. The method of claim 1, wherein the sentence increment optimization model performs multi-level expansion on the first search sentence using the matching increment pre-training stage number, and outputs an expanded search sentence, the method comprising:

inputting the first search sentence to the sentence increment optimization model, and extracting a search semantic vector;

Performing L rounds of expansion iteration on the search semantic vector according to the matching increment pre-training series to obtain L rounds of related search sentences, wherein L is the number of the matching increment pre-training series;

Extracting semantic vectors of the L rounds of related search sentences to perform semantic similarity calculation, combining related search sentences with the semantic similarity being larger than a preset semantic similarity, and outputting the processed related search sentences as expanded search sentences.

6. The method of claim 5, wherein the search semantic vector is L-round expanded iterated in the matching incremental pre-training series, the method comprising:

calculating a round of semantic neighborhood of the search semantic vector;

Selecting a round of semantic related entities from a round of semantic neighbors of the search semantic vector, generating a round of related search sentences, and similarly outputting the search semantic vector to carry out L rounds of related search sentences of L rounds of expansion iteration.

7. The method of claim 5, wherein the search semantic vector is L-round expanded iterated in the matching incremental pre-training series, the method further comprising:

Acquiring a synonymous semantic neighborhood of the retrieval semantic vector;

selecting a synonym semantic related entity from the synonym semantic neighborhood of the retrieval semantic vector to generate a synonym related retrieval statement of the retrieval semantic vector;

and updating the synonym related search statement to the L rounds of related search statements.

8. The method of claim 3, wherein the step of obtaining a candidate pool to generate the quality score value comprises:

Acquiring semantic query hit rate, consistency score labeling information and semantic relevance weight of the generated candidate library;

And outputting a candidate library to generate a quality score value according to the semantic query hit rate, the consistency score labeling information and the calculation result of the semantic relevance weight.

9. The method of claim 1, wherein invoking a historical search statement behavior library analyzes the first search statement, the method comprising:

Acquiring a preset period;

Extracting effective historical search sentence behaviors in the preset period from the historical search sentence behavior library, analyzing the first search sentence based on the effective historical search sentence behaviors, and obtaining a search behavior feature vector.

10. A knowledge retrieval candidate base generation system based on incremental pre-training optimization for implementing the knowledge retrieval candidate base generation method based on incremental pre-training optimization of any one of claims 1-9, the system comprising:

The retrieval behavior feature vector acquisition module is used for collecting a first retrieval statement of a current user, and calling the historical retrieval statement behavior library to analyze the first retrieval statement to acquire a retrieval behavior feature vector;

The normalized co-occurrence frequency acquisition module is used for calculating the retrieval behavior feature vector to acquire normalized co-occurrence frequency;

the mapping function relation construction module is used for constructing a mapping function relation between the normalized co-occurrence frequency sample and the increment pre-training series sample, and obtaining a matching increment pre-training series corresponding to the normalized co-occurrence frequency by utilizing the mapping function relation;

The knowledge retrieval candidate library construction module is used for pre-training a sentence increment optimization model, the sentence increment optimization model adopts the matched increment pre-training stage number to carry out multistage expansion on the first retrieval sentence, an expanded retrieval sentence is output, and a knowledge retrieval candidate library is generated according to the expanded retrieval sentence.