CN116501837A - Retrieval method, system, device and storage medium based on twin-tower recall - Google Patents
Retrieval method, system, device and storage medium based on twin-tower recall Download PDFInfo
- Publication number
- CN116501837A CN116501837A CN202310469306.0A CN202310469306A CN116501837A CN 116501837 A CN116501837 A CN 116501837A CN 202310469306 A CN202310469306 A CN 202310469306A CN 116501837 A CN116501837 A CN 116501837A
- Authority
- CN
- China
- Prior art keywords
- pseudo
- query text
- query
- text
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请提供了一种基于双塔召回的检索方法、系统、设备及存储介质,属于数据处理技术领域及数字医疗领域,基于人工智能模型从海量的电子数字病历中查询用户所需的医疗信息;通过根据给定医学领域内的查询文本,确定给定查询文本的正相关文章以及不相关文章;输入训练好的伪查询模型,得到正相关文章的第一伪查询文本以及不相关文章的第二伪查询文本;构建正样本、负样本;将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。本申请能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。
This application provides a retrieval method, system, device and storage medium based on twin-tower recall, which belongs to the field of data processing technology and the field of digital medical care, and searches the medical information required by users from a large number of electronic digital medical records based on an artificial intelligence model; According to the query text in the given medical field, determine the positively relevant articles and irrelevant articles of the given query text; input the trained pseudo-query model to get the first pseudo-query text of the positively relevant articles and the second of the irrelevant articles. Pseudo-query text; construct positive samples and negative samples; input positive samples and negative samples into the double-tower recall model for iterative training; input the query to be trained into the trained double-tower recall model to obtain retrieval results. The application can construct multiple strongly correlated negative samples for model learning through pseudo-queries, which greatly improves the retrieval efficiency and accuracy of the model.
Description
技术领域technical field
本申请属于数据处理技术领域及数字医疗领域,具体地,涉及一种基于双塔召回的检索方法、系统、设备及存储介质。The application belongs to the technical field of data processing and the field of digital medical care, and in particular relates to a retrieval method, system, equipment and storage medium based on twin-tower recall.
背景技术Background technique
信息查询成为很多场景中用户快速获取所需信息的渠道。例如在医疗领域中,可以基于人工智能模型从海量的电子病历中查询用户所需的病历信息,有助于为用户提供病历参考。传统的信息检索主要基于关键词进行检索,例如语义检索。目前的语义检索主要采用神经网络模型,包括交互型单塔模型和表示型双塔模型。Information query has become a channel for users to quickly obtain the required information in many scenarios. For example, in the medical field, the medical record information required by users can be queried from massive electronic medical records based on artificial intelligence models, which helps to provide medical record references for users. Traditional information retrieval is mainly based on keywords, such as semantic retrieval. The current semantic retrieval mainly adopts the neural network model, including the interactive single-tower model and the representational double-tower model.
目前主流的基于双塔架构模型的稠密向量召回方法,训练所需要的负样本固定或是负样本和正样本的区分度很大,导致训练过于简单;在模型设计时因为增加了信息交互的信息建模,增加了模型的延迟,进而大大影响了模型最终的检索性能。The current mainstream dense vector recall method based on the two-tower architecture model requires fixed negative samples for training or a large degree of discrimination between negative samples and positive samples, resulting in too simple training; because of the increase in information interaction during model design The model increases the delay of the model, which greatly affects the final retrieval performance of the model.
发明内容Contents of the invention
本发明提出的基于双塔召回的检索方法、系统、设备及存储介质,能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。The retrieval method, system, device and storage medium based on twin-tower recall proposed by the present invention can construct a plurality of strongly correlated negative samples through pseudo-query for model learning, greatly improving the retrieval efficiency and accuracy of the model.
根据本申请实施例的第一个方面,提供了一种基于双塔召回的检索方法,包括:According to the first aspect of the embodiment of the present application, a retrieval method based on twin-tower recall is provided, including:
根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;According to the given query text, determine positively relevant articles and irrelevant articles of the given query text;
将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;Input the positively related articles into the trained pseudo query model to obtain the first pseudo query text of the positively related articles; input the irrelevant articles into the trained pseudo query model to obtain the second pseudo query text of the irrelevant articles;
根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;Construct a positive sample according to the given query text, positively related articles, and the first pseudo query text; construct a negative sample according to the given query text, positively related articles, unrelated articles, the first pseudo query text, and the second pseudo query text;
将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。Input the positive samples and negative samples into the double-tower recall model for iterative training; input the query to the trained double-tower recall model to obtain the retrieval results.
在本申请一些实施方式中,根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章,包括:In some embodiments of the present application, according to a given query text, positively related articles and irrelevant articles of a given query text are determined, including:
选定至少一个公开数据集;Select at least one public dataset;
基于至少一个公开数据集,匹配给定查询文本,得到与给定查询文本相关的至少一个正相关文章,以及得到与给定查询文本不相关的至少一个不相关文章。Based on at least one public data set, match the given query text, obtain at least one positively related article related to the given query text, and obtain at least one irrelevant article not related to the given query text.
在本申请一些实施方式中,将正样本以及负样本输入双塔召回模型进行迭代训练,包括:In some embodiments of the present application, positive samples and negative samples are input into the double-tower recall model for iterative training, including:
根据查询文本以及基于给定文章的伪查询文本,确定查询文本以及伪查询文本之间的相似性得分;Based on the query text and the pseudo-query text based on a given article, determine a similarity score between the query text and the pseudo-query text;
基于向量相似性得分计算双塔召回模型的总损失,使得双塔召回模型输出分布拟合。Computes the total loss of the twin-tower recall model based on the vector similarity score such that the output distribution of the twin-tower recall model is fitted.
在本申请一些实施方式中,根据查询文本以及基于给定文章的伪查询文本,确定查询文本以及伪查询文本之间的相似性得分,包括:In some embodiments of the present application, according to the query text and the pseudo-query text based on a given article, determining the similarity score between the query text and the pseudo-query text includes:
将查询文本对象拼接特殊字符,得到最终查询文本;Splice the query text object with special characters to get the final query text;
将最终查询文本输入查询编码器,得到查询文本中所有字符对应的第一特征向量表示;Inputting the final query text into the query encoder to obtain the first feature vector representations corresponding to all the characters in the query text;
将查询文章对象拼接特殊字符,并将拼接后的查询文章输入伪查询模型,得到至少一个伪查询文本;Splicing the query article object with special characters, and inputting the spliced query article into the pseudo-query model to obtain at least one pseudo-query text;
抽取一个伪查询文本与拼接后的查询文章再次拼接,得到最终文章文本;Extract a pseudo-query text and stitch it again with the spliced query article to obtain the final article text;
将最终文章文本输入文章编码器,得到最终文章文本中所有字符对应的第二特征向量表示;Inputting the final article text into an article encoder to obtain a second feature vector representation corresponding to all characters in the final article text;
根据第一特征向量表示以及第二特征向量,通过向量内积,确定查询文本以及抽取的伪查询文本的相似性得分。According to the first eigenvector representation and the second eigenvector, the similarity score of the query text and the extracted pseudo-query text is determined through vector inner product.
在本申请一些实施方式中,伪查询模型,通过多个文章文本以及文章对应的多个查询文本作为训练样本进行训练;将给定文章输入伪查询模型,得到给定文章对应的至少一个伪查询文本。In some embodiments of the present application, the pseudo-query model is trained by using multiple article texts and multiple query texts corresponding to the articles as training samples; inputting a given article into the pseudo-query model to obtain at least one pseudo-query corresponding to the given article text.
在本申请一些实施方式中,根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本,包括:In some implementations of the present application, according to the given query text, positively related articles, and the first pseudo query text, a positive sample is constructed, including:
将给定查询文本、一个正相关文章以及一个第一伪查询文本进行组合,得到正样本。Combine the given query text, a positively related article, and a first pseudo-query text to obtain a positive sample.
在本申请一些实施方式中,根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本,包括:In some embodiments of the present application, according to the given query text, positively related articles, irrelevant articles, first pseudo query text and second pseudo query text, negative samples are constructed, including:
将给定查询文本、一个正相关文章以及一个第二伪查询文本进行组合,得到一种负样本;Combine the given query text, a positively related article, and a second pseudo-query text to obtain a negative sample;
以及,将给定查询文本、一个不相关文章以及一个第二伪查询文本进行组合,得到一种负样本;And, combine the given query text, an irrelevant article and a second pseudo-query text to obtain a negative sample;
以及,将给定查询文本、一个不相关文章以及一个第一伪查询文本进行组合,得到一种负样本。And, a negative sample is obtained by combining the given query text, an irrelevant article, and a first dummy query text.
根据本申请实施例的第二个方面,提供了一种基于双塔召回的检索系统,具体包括:According to the second aspect of the embodiment of the present application, a retrieval system based on twin-tower recall is provided, specifically including:
文章样本单元,用于根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;The article sample unit is used to determine positively relevant articles and irrelevant articles of a given query text according to the given query text;
伪查询单元,用于将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;用于将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;The pseudo query unit is used to input the positively related articles into the trained pseudo query model to obtain the first pseudo query text of the positively related articles; it is used to input the irrelevant articles into the trained pseudo query model to obtain the second text of the irrelevant articles. dummy query text;
训练样本单元,用于根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;The training sample unit is used to construct positive samples according to the given query text, positively related articles and first pseudo query text; according to the given query text, positively related articles, unrelated articles, first pseudo query text and second pseudo query Text, construct negative samples;
模型检索单元,用于将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。The model retrieval unit is used for inputting positive samples and negative samples into the double-tower recall model for iterative training; inputting the query to be trained into the trained double-tower recall model to obtain retrieval results.
根据本申请实施例的第三个方面,提供了一种基于双塔召回的检索设备,包括:According to the third aspect of the embodiment of the present application, a retrieval device based on twin-tower recall is provided, including:
存储器:用于存储可执行指令;以及memory: for storing executable instructions; and
处理器:用于与存储器连接以执行可执行指令从而完成基于双塔召回的检索方法。Processor: used to connect with the memory to execute executable instructions to complete the retrieval method based on the twin tower recall.
根据本申请实施例的第四个方面,提供了一种计算机可读存储介质,其上存储有计算机程序;计算机程序被处理器执行以实现基于双塔召回的检索方法。According to a fourth aspect of the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored; the computer program is executed by a processor to implement a retrieval method based on twin-tower recall.
采用本申请的基于双塔召回的检索方法、系统、设备及存储介质,通过根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。本申请能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。Using the retrieval method, system, device and storage medium based on the recall of the twin towers of the present application, according to the given query text, determine the positively related articles and irrelevant articles of the given query text; input the positively related articles into the trained pseudo-query model to obtain the first pseudo-query text of positively related articles; input unrelated articles into the trained pseudo-query model to obtain the second pseudo-query text of unrelated articles; according to the given query text, positively related articles and the first pseudo-query text, construct positive samples; construct negative samples based on given query text, positively related articles, irrelevant articles, first pseudo query text, and second pseudo query text; input positive samples and negative samples into the twin-tower recall model for iterative training ; Input the query to the trained twin-tower recall model to obtain the retrieval result. The application can construct multiple strongly correlated negative samples for model learning through pseudo-queries, which greatly improves the retrieval efficiency and accuracy of the model.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the application. In the attached picture:
图1中示出了根据本申请实施例的基于双塔召回的检索方法的步骤示意图;Fig. 1 shows a schematic diagram of the steps of a retrieval method based on twin tower recalls according to an embodiment of the present application;
图2中示出了根据本申请实施例中双塔召回模型进行迭代训练的步骤示意图;Fig. 2 shows a schematic diagram of the steps of iterative training according to the twin-tower recall model in the embodiment of the present application;
图3中示出了根据本申请实施例中计算相似性得分的步骤示意图;Fig. 3 shows a schematic diagram of steps for calculating a similarity score according to an embodiment of the present application;
图4中示出了根据本申请实施例中计算相似性得分的原理流程图;Figure 4 shows a flowchart of the principle of calculating the similarity score according to the embodiment of the present application;
图5中示出了根据本申请实施例的基于双塔召回的检索系统的结构示意图;Fig. 5 shows a schematic structural diagram of a retrieval system based on twin-tower recall according to an embodiment of the present application;
图6中示出了根据本申请实施例的基于双塔召回的检索设备的结构示意图。FIG. 6 shows a schematic structural diagram of a retrieval device based on twin-tower recall according to an embodiment of the present application.
具体实施方式Detailed ways
在实现本申请的过程中,发明人发现在医疗领域中,可以基于人工智能模型从海量的电子病历中查询用户所需的病历信息,有助于为用户提供病历参考。目前检索方法中主流的基于双塔架构的稠密向量召回方法为了达到性能要求一般会从两个方面进行探索:In the process of implementing this application, the inventor found that in the medical field, the medical record information required by users can be queried from a large number of electronic medical records based on artificial intelligence models, which is helpful for providing medical record references for users. In order to meet the performance requirements, the mainstream dense vector recall method based on the twin-tower architecture generally explores from two aspects:
a)在模型训练方面,通常会利用传统BM25方法挑选难度较大的负样本进行训练,对于给定的查询(query),挑选的负样本永远是固定的,不可学习的,会导致最终模型训练所需要的负样本不具有代表性,亦或是负样本和正样本的区分度很大从而过于简单,进而影响模型最终的性能。a) In terms of model training, the traditional BM25 method is usually used to select difficult negative samples for training. For a given query (query), the selected negative samples are always fixed and cannot be learned, which will lead to the final model training. The required negative samples are not representative, or the difference between negative samples and positive samples is too large, which is too simple, which will affect the final performance of the model.
b)在模型设计方面,由于采用双塔这种基础架构,模型后期的信息交互就显得尤为重要,通常会引入较为复杂的模块用于增强查询(query)与文章(passage)之间的信息建模,这些模块的引入一方面增加了模型的延迟,给模型的部署推理带来了很大的负担,另一方面,由于复杂模块的引入无法适配现有的大规模稠密向量检索框架,例如Faiss、Milvus等,减弱了双塔架构原本的优势。b) In terms of model design, due to the use of the infrastructure of the twin towers, the information interaction in the later stage of the model is particularly important, and more complex modules are usually introduced to enhance the information building between the query (query) and the article (passage). On the one hand, the introduction of these modules increases the delay of the model and brings a great burden to the deployment and inference of the model. On the other hand, due to the introduction of complex modules, it cannot adapt to the existing large-scale dense vector retrieval framework, such as Faiss, Milvus, etc. have weakened the original advantages of the twin-tower architecture.
综合一下现有技术方案遇到问题:训练所需要的负样本固定或是负样本和正样本的区分度很大,导致训练过于简单;在模型设计时因为增加了信息交互的信息建模,增加了模型的延迟,进而大大影响了模型最终的检索性能。Summarize the problems encountered in the existing technical solutions: the negative samples required for training are fixed or the degree of discrimination between negative samples and positive samples is very large, which makes the training too simple; in the model design, because of the information modeling of information interaction, the increase of The delay of the model greatly affects the final retrieval performance of the model.
在训练样本获取时,如医学应用场景中,样本图像为医学影像,样本图像包含的对象所属类型为病灶,即机体上发生病变的部分。医学影像是指为了医疗或医学研究,以非侵入方式取得的内部组织,例如,胃部、腹部、心脏、膝盖、脑部的影像,比如,CT(ComputedTomography,电子计算机断层扫描)、MRI(Magnetic Resonance Imaging,磁共振成像)、US(ultrasonic,超声)、X光图像、脑电图以及光学摄影灯由医学仪器生成的图像。When obtaining training samples, such as in medical application scenarios, the sample images are medical images, and the types of objects included in the sample images are lesions, that is, the parts of the body where lesions occur. Medical imaging refers to internal tissues obtained in a non-invasive manner for medical treatment or medical research, such as images of the stomach, abdomen, heart, knees, and brain, such as CT (Computed Tomography, computerized tomography), MRI (Magnetic Resonance Imaging, magnetic resonance imaging), US (ultrasonic, ultrasound), X-ray images, electroencephalograms, and images generated by medical instruments with optical photography lights.
本申请采用的数据是医疗数据,如个人健康档案、处方、检查报告等数据。The data used in this application is medical data, such as personal health records, prescriptions, inspection reports and other data.
基于此,本申请提出一种基于伪查询的多视角双塔稠密向量召回方法,其中:a)通过伪查询来动态的构建多个强相关的负样本用于模型的学习;b)对于给定的一篇文章,提前生成多个不同的视角对其进行描述,不再依赖于模型后期由于信息交互而引入的复杂模块。Based on this, this application proposes a pseudo-query-based multi-view double-tower dense vector recall method, in which: a) through pseudo-query to dynamically construct multiple strongly correlated negative samples for model learning; b) for a given An article of , generates multiple different perspectives in advance to describe it, and no longer depends on the complex modules introduced by the information interaction in the later stage of the model.
具体的,本申请的基于双塔召回的检索方法、系统、设备及存储介质,通过根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。本申请能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。Specifically, the retrieval method, system, device, and storage medium based on the recall of the twin towers of the present application determine the positively related articles and irrelevant articles of the given query text according to the given query text; input the positively related articles into the trained Pseudo-query model, get the first pseudo-query text of positively related articles; input unrelated articles into the trained pseudo-query model, get the second pseudo-query text of unrelated articles; according to the given query text, positively related articles and the first Pseudo-query text to construct positive samples; according to given query text, positively related articles, irrelevant articles, first pseudo-query text and second pseudo-query text, negative samples are constructed; positive samples and negative samples are input into the twin-tower recall model for Iterative training; input the query to the trained twin-tower recall model to obtain the retrieval result. The application can construct multiple strongly correlated negative samples for model learning through pseudo-queries, which greatly improves the retrieval efficiency and accuracy of the model.
本申请通过抽样得到的不同伪查询,为文章表示提供了多个不同视角的描述,使得文章表示中的核心重要信息得以增强并加以凸显。The application provides multiple descriptions from different perspectives for the article presentation by sampling different pseudo-queries, so that the core important information in the article presentation can be enhanced and highlighted.
同时,将模型后期复杂的信息交互迁移至模型前期,可以离线预先计算所有文章的伪查询,在保证性能的同时极大程度降低了模型的推理延迟,保留了双塔架构原本的优势。At the same time, by migrating the complex information interaction in the later stage of the model to the early stage of the model, the pseudo-query of all articles can be pre-calculated offline, which greatly reduces the reasoning delay of the model while ensuring performance, and retains the original advantages of the two-tower architecture.
本申请由于伪查询引入的随机性,负样本的组成不再是一成不变的,从而可以使模型不断地学习到新的信息。此外,在负样本的设计方面,创新性的引入了与给定查询最“接近”的负样本,极大程度的加强了模型对于正负样本的区分程度。In this application, due to the randomness introduced by the pseudo-query, the composition of negative samples is no longer static, so that the model can continuously learn new information. In addition, in terms of negative sample design, the innovative introduction of the negative sample that is "closest" to the given query greatly strengthens the model's ability to distinguish between positive and negative samples.
为了使本申请实施例中的技术方案及优点更加清楚明白,以下结合附图对本申请的示例性实施例进行进一步详细的说明,显然,所描述的实施例仅是本申请的一部分实施例,而不是所有实施例的穷举。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。In order to make the technical solutions and advantages in the embodiments of the present application clearer, the exemplary embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings. Apparently, the described embodiments are only part of the embodiments of the present application, and Not an exhaustive list of all embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
实施例1Example 1
图1中示出了根据本申请实施例的基于双塔召回的检索方法的步骤示意图。如图1所示,本申请实施例的基于双塔召回的检索方法,包括以下步骤:FIG. 1 shows a schematic diagram of steps of a retrieval method based on twin-tower recall according to an embodiment of the present application. As shown in Figure 1, the retrieval method based on the twin-tower recall of the embodiment of the present application includes the following steps:
S1:根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章。S1: According to the given query text, determine positively relevant articles and irrelevant articles of the given query text.
S2:将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本。S2: Input positively related articles into the trained pseudo-query model to obtain the first pseudo-query text of positively related articles; input irrelevant articles into the trained pseudo-query model to obtain the second pseudo-query text of irrelevant articles.
S3:根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本。S3: Construct a positive sample according to the given query text, positively related articles and the first pseudo query text; construct a negative sample according to the given query text, positively related articles, irrelevant articles, the first pseudo query text and the second pseudo query text sample.
S4:将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。S4: Input the positive samples and negative samples into the double-tower recall model for iterative training; input the query to the trained double-tower recall model to obtain the retrieval results.
本申请能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。The application can construct multiple strongly correlated negative samples for model learning through pseudo-queries, which greatly improves the retrieval efficiency and accuracy of the model.
具体实施时,S1中根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章,包括:During specific implementation, according to the given query text in S1, positively related articles and irrelevant articles of the given query text are determined, including:
首先,选定至少一个公开数据集;例如利用公开数据集TREC 2022Passageranking dataset。First, select at least one public dataset; for example, use the public dataset TREC 2022 Passageranking dataset.
然后,基于至少一个公开数据集,匹配给定查询文本,得到与给定查询文本相关的至少一个正相关文章,以及得到与给定查询文本不相关的至少一个不相关文章。Then, based on at least one public data set, the given query text is matched, at least one positively related article related to the given query text is obtained, and at least one irrelevant article not related to the given query text is obtained.
具体数字医疗领域场景下,所述数据集为医疗数据,查询文本为医疗文本。医疗数据如个人健康档案、处方、检查报告等数据。医疗文本可以是医疗电子记录(ElectronicHealthcare Record),电子化的个人健康记录,包括病历、心电图、医学影像等一系列具备保存备查价值的电子化记录。In a specific digital medical field scenario, the data set is medical data, and the query text is medical text. Medical data such as personal health records, prescriptions, inspection reports and other data. Medical texts can be electronic medical records (Electronic Healthcare Records), electronic personal health records, including medical records, electrocardiograms, medical images, and a series of electronic records that are valuable for future reference.
图2中示出了根据本申请实施例中双塔召回模型进行迭代训练的步骤示意图。FIG. 2 shows a schematic diagram of the steps of iterative training according to the two-tower recall model in the embodiment of the present application.
如图2所示,S4步骤中,将正样本以及负样本输入双塔召回模型进行迭代训练,包括:As shown in Figure 2, in step S4, positive samples and negative samples are input into the double-tower recall model for iterative training, including:
S41:根据查询文本以及基于给定文章的伪查询文本,确定查询文本以及伪查询文本之间的相似性得分;S42:基于向量相似性得分计算双塔召回模型的总损失,使得双塔召回模型输出分布拟合。S41: According to the query text and the pseudo-query text based on the given article, determine the similarity score between the query text and the pseudo-query text; S42: Calculate the total loss of the two-tower recall model based on the vector similarity score, so that the two-tower recall model Output distribution fit.
图3中示出了根据本申请实施例中计算相似性得分的步骤示意图。图4中示出了根据本申请实施例中计算相似性得分的原理流程图。FIG. 3 shows a schematic diagram of steps for calculating a similarity score according to an embodiment of the present application. Fig. 4 shows a flow chart of the principle of calculating the similarity score according to the embodiment of the present application.
如图3和图4所示,具体的,S41中根据查询文本以及基于给定文章的伪查询文本,确定查询文本以及伪查询文本之间的相似性得分,包括:As shown in Figures 3 and 4, specifically, in S41, according to the query text and the pseudo-query text based on a given article, determine the similarity score between the query text and the pseudo-query text, including:
S411:将查询文本对象拼接特殊字符,得到最终查询文本;S411: concatenate the query text object with special characters to obtain the final query text;
S412:将最终查询文本输入查询编码器,得到查询文本中所有字符对应的第一特征向量表示;S412: Input the final query text into the query encoder to obtain the first feature vector representations corresponding to all the characters in the query text;
S413:将查询文章对象拼接特殊字符,并将拼接后的查询文章输入伪查询模型,得到至少一个伪查询文本;S413: Splice the query article object with special characters, and input the spliced query article into the pseudo-query model to obtain at least one pseudo-query text;
S414:抽取一个伪查询文本与拼接后的查询文章再次拼接,得到最终文章文本;S414: extracting a pseudo-query text and splicing it again with the spliced query article to obtain the final article text;
S415:将最终文章文本输入文章编码器,得到最终文章文本中所有字符对应的第二特征向量表示;S415: Input the final article text into the article encoder to obtain a second feature vector representation corresponding to all characters in the final article text;
S416:根据第一特征向量表示以及第二特征向量,通过向量内积,确定查询文本以及抽取的伪查询文本的相似性得分。S416: According to the first eigenvector representation and the second eigenvector, determine the similarity score of the query text and the extracted pseudo-query text through vector inner product.
其中,S2的伪查询模型,通过多个文章文本以及文章对应的多个查询文本作为训练样本进行训练;将给定文章输入伪查询模型,得到给定文章对应的至少一个伪查询文本。Among them, the pseudo-query model of S2 is trained by using multiple article texts and multiple query texts corresponding to the articles as training samples; a given article is input into the pseudo-query model to obtain at least one pseudo-query text corresponding to the given article.
具体的,S3中根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本,包括:将给定查询文本、一个正相关文章以及一个第一伪查询文本进行组合,得到正样本。Specifically, in S3, a positive sample is constructed according to the given query text, positively related articles, and the first pseudo query text, including: combining the given query text, a positively related article, and a first pseudo query text to obtain a positive sample .
具体的,S3中根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本,包括:将给定查询文本、一个正相关文章以及一个第二伪查询文本进行组合,得到一种负样本;以及,将给定查询文本、一个不相关文章以及一个第二伪查询文本进行组合,得到一种负样本;以及,将给定查询文本、一个不相关文章以及一个第一伪查询文本进行组合,得到一种负样本。Specifically, in S3, negative samples are constructed according to the given query text, positively related articles, irrelevant articles, the first pseudo-query text and the second pseudo-query text, including: combining the given query text, a positively related article and a first pseudo-query text Two pseudo-query texts are combined to obtain a negative sample; and, a given query text, an irrelevant article and a second pseudo-query text are combined to obtain a negative sample; and, a given query text, a Irrelevant articles and a first pseudo-query text are combined to obtain a negative sample.
参见图4,为了进一步说明本申请实施例的基于双塔召回的检索方案,以下通过具体实施场景进行说明。Referring to FIG. 4 , in order to further illustrate the retrieval solution based on the twin-tower recall of the embodiment of the present application, the following will be described through specific implementation scenarios.
步骤1)用q表示查询文本对象,q1,q2,q3,q4,q5,…,qi表示查询文本对象q序列中的每个字符,i是查询文本对象的长度。Step 1) Use q to represent the query text object, q1, q2, q3, q4, q5,..., qi represent each character in the query text object q sequence, and i is the length of the query text object.
步骤2)在查询文本对象前拼接特殊字符[CLS]标记查询文本的开始,在查询文本对象后拼接特殊字符[SEP]标记查询文本的结束;Step 2) splicing special characters [CLS] before the query text object to mark the beginning of the query text, and splicing special characters [SEP] to mark the end of the query text after the query text object;
最终查询文本的形式为[CLS],q1,q2,q3,q4,q5,…,ql,[SEP]。The final query text is of the form [CLS],q1,q2,q3,q4,q5,...,ql,[SEP].
步骤3)将步骤2)得到的拼接后的查询文本作为输入,通过查询编码器,这里查询编码器选用现在主流的BERT模型,得到对应所有位置的编码向量表示Eq_ori;Step 3) take the spliced query text obtained in step 2) as input, through the query encoder, where the query encoder selects the current mainstream BERT model, and obtains the encoding vector representation Eq_ori corresponding to all positions;
Eq_ori=[E[CLS],Eq1,Eq2,Eq3,…,Eql,E[SEP]];Eq_ori=[E[CLS],Eq1,Eq2,Eq3,...,Eql,E[SEP]];
其中,Eq_ori中每个位置都是一个768维的向量。Among them, each position in Eq_ori is a 768-dimensional vector.
步骤4)取出步骤3)得到的Eq_ori中的第一个向量E[CLS],将其通过L2标准化,使其L2范数等于1,得到的结果记作Eq,其作为查询的特征向量表示。Step 4) Take out the first vector E[CLS] in Eq_ori obtained in step 3), standardize it through L2, make its L2 norm equal to 1, and record the obtained result as Eq, which is represented as the feature vector of the query.
步骤5)用d表示文章文本对象,d1,d2,d3,d4,d5,…,dn表示文章文本对象d序列中的每个字符,n是文章文本对象的长度。Step 5) Use d to represent the article text object, d1, d2, d3, d4, d5, ..., dn represent each character in the sequence d of the article text object, and n is the length of the article text object.
步骤6)在文章文本对象前先拼接特殊字符[cls]标记文章文本的开始,在文章文本对象后拼接特殊字符[SEP]标记文章文本的结束;Step 6) Splicing special characters [cls] to mark the beginning of the article text before the article text object, and splicing special characters [SEP] to mark the end of the article text after the article text object;
现阶段文章文本的形式为A=[cls],d1,d2,d3,d4,d5,…,dn,[SEP]。The form of the article text at this stage is A=[cls],d1,d2,d3,d4,d5,...,dn,[SEP].
步骤7)将步骤6)得到的结果作为docTTTTTquery模型的输入,得到伪查询列表Pseudo_query_list:Step 7) Use the result obtained in step 6) as the input of the docTTTTTquery model to obtain the pseudo query list Pseudo_query_list:
Pseudo_query_list=docTTTTTquery(A);Pseudo_query_list = docTTTTTquery(A);
Pseudo_query_list=[qq1,qq2,qq3,…,qq10];Pseudo_query_list=[qq1,qq2,qq3,...,qq10];
其中docTTTTTquery模型通过将大量的文章(passage)和与其对应的查询(query)作为训练样本,训练好的模型通过给定的文章作为输入,便可以预测出其文章对应的可能的查询列表。docTTTTTquery模型细节可参见现有模型,此处不再赘述。Among them, the docTTTTTquery model uses a large number of passages and corresponding queries as training samples, and the trained model can predict the possible query list corresponding to the passages by using the given passages as input. For the details of the docTTTTTquery model, please refer to the existing model, so I won't go into details here.
这里通过docTTTTTquery模型本身的top-k采样策略,设置k=10,最终得到相对于给定的文章对应的10个不同的伪查询描述,其中每个伪查询描述qqi都可以代表给定输入文章A的一种摘要,一种视角。Here, through the top-k sampling strategy of the docTTTTTquery model itself, set k=10, and finally get 10 different pseudo-query descriptions corresponding to a given article, where each pseudo-query description qqi can represent a given input article A A summary, a perspective.
步骤8)抽取出伪查询列表Pseudo_query_list中的每一个qqi,与步骤5)中的文章对象进行拼接,开头拼接特殊字符[cls],结尾拼接特殊字符[SEP];得到最终文章文本的形式为:[cls],qqi,d1,d2,d3,d4,d5,…,dn,[SEP]。Step 8) Extract each qqi in the pseudo-query list Pseudo_query_list, and splicing with the article object in step 5), splicing special characters [cls] at the beginning and splicing special characters [SEP] at the end; the form of the final article text is: [cls], qqi, d1, d2, d3, d4, d5, ..., dn, [SEP].
步骤9)将步骤8)得到的拼接后的文章文本作为输入,通过文章编码器,得到对应所有位置的编码向量表示Ed_ori,这里文章编码器也选用BERT模型;Step 9) take the spliced article text obtained in step 8) as input, and obtain the encoding vector representation Ed_ori corresponding to all positions through the article encoder, where the article encoder also selects the BERT model;
Ed_ori=[E[cls],Eqqi,Ed1,Ed2,Ed3,…,Edn]。Ed_ori=[E[cls], Eqqi, Ed1, Ed2, Ed3, . . . , Edn].
步骤10)取出步骤9)得到的Ed_ori中的第一个向量E[CLS],将其通过L2标准化,使其L2范数等于1,得到的结果记作Ed。Step 10) Take out the first vector E[CLS] in Ed_ori obtained in step 9), standardize it by L2, make its L2 norm equal to 1, and denote the obtained result as Ed.
步骤11)计算步骤4)得到的查询的向量表示,与步骤10)得到的文章的向量表示,之间的相似性;这里相似性的度量采用简单的向量内积,最终得分即为给定的查询和和给定的文章和其中一个伪查询这个组合的相似性得分。Step 11) Calculate the similarity between the vector representation of the query obtained in step 4) and the vector representation of the article obtained in step 10); here the similarity is measured using a simple vector inner product, and the final score is the given The query sums the similarity score for the combination of a given article and one of the pseudo-queries.
接下来具体描述双塔召回模型训练阶段。Next, the training phase of the twin-tower recall model will be described in detail.
首先,利用公开数据集TREC 2022Passage ranking dataset,其中对于给定的查询q,存在一个正相关的文章d+,不相关的文章d-,构成组合(q,d+,d-)。利用步骤5)-7)我们可以得到对于正相关文章d+的伪查询列表qq+,对于不相关文章d-的伪查询列表qq-。First, use the public dataset TREC 2022 Passage ranking dataset, where for a given query q, there is a positively related article d+ and an irrelevant article d-, forming a combination (q, d+, d-). Using steps 5)-7), we can obtain the pseudo-query list qq+ for the positively related article d+, and the pseudo-query list qq- for the irrelevant article d-.
对于给定的q来说,正样本的构建形式为:(q,[qqi+,d+]);即正相关文章d+的伪查询列表qq+中每一个伪查询和正相关文章d+的组合,组合形式便是步骤8),为便于后续公式推导,这里简记作(q,d+)。For a given q, the construction form of the positive sample is: (q,[qqi+,d+]); that is, the combination of each pseudo-query in the pseudo-query list qq+ of the positively related article d+ and the positively related article d+, the combination form is is step 8), for the convenience of subsequent formula derivation, it is abbreviated as (q,d + ) here.
负样本的构建形式为:(q,[qqi-,d+])或者(q,[qqi+,d-])或者(q,[qqi-,d-])这三种形式。即采用不相关文章的每一个伪查询qqi-和正相关文章d+的组合,或者采用正相关文章的的每一个伪查询qqi+和不相关文章d-的组合;或者采用不相关文章的每一个伪查询qqi-和不相关文章d-的组合;最后,所有负样本的总和简记作。其中前两种负样本其实是还有部分正样本的成分的,进一步加大了模型的学习难度。The construction form of the negative sample is: (q, [qqi-, d+]) or (q, [qqi+, d-]) or (q, [qqi-, d-]) three forms. That is, the combination of each pseudo-query qqi- of irrelevant articles and the combination of positively relevant articles d+, or the combination of each pseudo-query qqi+ of positively relevant articles and d- of irrelevant articles; or the combination of each pseudo-query of irrelevant articles The combination of qqi- and irrelevant articles d-; finally, the sum of all negative samples is abbreviated as . Among them, the first two negative samples actually contain some positive samples, which further increases the difficulty of learning the model.
模型迭代训练时,尽量使模型对正样本的打分尽可能的高,对于负样本的打分尽可能的低,因此设计了如下损失函数:When the model is iteratively trained, try to make the model score as high as possible for positive samples and as low as possible for negative samples, so the following loss function is designed:
其中,s()便是上述步骤11)计算的相似性得分。Wherein, s() is the similarity score calculated in the above step 11).
使模型对于正样本s(q,d+)得分预测尽可能高,接近1;对于负样本s(q,d)得分预测尽可能低,接近0。Make the model predict the score of the positive sample s(q,d + ) as high as possible, close to 1; for the negative sample s(q,d) the score prediction is as low as possible, close to 0.
最后采用随机梯度下降算法(SGD)和pytorch框架进行模型的建模,参数的更新。Finally, the stochastic gradient descent algorithm (SGD) and the pytorch framework are used to model the model and update the parameters.
最后,推理阶段:对于给定的查询q和文章d,按照步骤1)-11)计算伪查询列表中的每一个伪查询最后的相似性得分,以最高分作为查询q和文章d的最终相似性得分。Finally, inference stage: For a given query q and article d, calculate the final similarity score of each pseudo-query in the pseudo-query list according to steps 1)-11), and take the highest score as the final similarity between query q and article d sex score.
最后,具体的,本申请的基于双塔召回的检索方法,通过根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。本申请能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。Finally, specifically, the retrieval method based on the recall of the twin towers of this application determines the positively related articles and irrelevant articles of the given query text according to the given query text; input the positively related articles into the trained pseudo-query model to obtain The first pseudo-query text of positively related articles; input the unrelated articles into the trained pseudo-query model to obtain the second pseudo-query text of unrelated articles; according to the given query text, positively related articles and the first pseudo-query text, construct Positive samples; construct negative samples based on given query text, positively related articles, irrelevant articles, first pseudo query text, and second pseudo query text; input positive samples and negative samples into the double-tower recall model for iterative training; Query and input the trained twin-tower recall model to get the retrieval results. The application can construct multiple strongly correlated negative samples for model learning through pseudo-queries, which greatly improves the retrieval efficiency and accuracy of the model.
本申请通过抽样得到的不同伪查询,为文章表示提供了多个不同视角的描述,使得文章表示中的核心重要信息得以增强并加以凸显。The application provides multiple descriptions from different perspectives for the article presentation by sampling different pseudo-queries, so that the core important information in the article presentation can be enhanced and highlighted.
同时,将模型后期复杂的信息交互迁移至模型前期,可以离线预先计算所有文章的伪查询,在保证性能的同时极大程度降低了模型的推理延迟,保留了双塔架构原本的优势。At the same time, by migrating the complex information interaction in the later stage of the model to the early stage of the model, the pseudo-query of all articles can be pre-calculated offline, which greatly reduces the reasoning delay of the model while ensuring performance, and retains the original advantages of the two-tower architecture.
本申请由于伪查询引入的随机性,负样本的组成不再是一成不变的,从而可以使模型不断地学习到新的信息。此外,在负样本的设计方面,创新性的引入了与给定查询最“接近”的负样本,极大程度的加强了模型对于正负样本的区分程度。In this application, due to the randomness introduced by the pseudo-query, the composition of negative samples is no longer static, so that the model can continuously learn new information. In addition, in terms of negative sample design, the innovative introduction of the negative sample that is "closest" to the given query greatly strengthens the model's ability to distinguish between positive and negative samples.
实施例2Example 2
本实施例提供了一种基于双塔召回的检索系统,对于本实施例的基于双塔召回的检索系统中未披露的细节,请参照其它实施例中的基于双塔召回的检索方法的具体实施内容。This embodiment provides a retrieval system based on twin-tower recall. For the undisclosed details in the retrieval system based on twin-tower recall in this embodiment, please refer to the specific implementation of the retrieval method based on twin-tower recall in other embodiments. content.
图5中示出了根据本申请实施例的基于双塔召回的检索系统的结构示意图。FIG. 5 shows a schematic structural diagram of a retrieval system based on twin-tower recall according to an embodiment of the present application.
如图5所示,本申请实施例的基于双塔召回的检索系统,具体包括文章样本单元10、伪查询单元20、训练样本单元30以及模型检索单元40。As shown in FIG. 5 , the retrieval system based on twin-tower recall in the embodiment of the present application specifically includes an article sample unit 10 , a pseudo-query unit 20 , a training sample unit 30 and a model retrieval unit 40 .
具体的,specific,
文章样本单元10,用于根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;The article sample unit 10 is used to determine positively related articles and irrelevant articles of the given query text according to the given query text;
伪查询单元20,用于将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;用于将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;Pseudo-query unit 20, for inputting positively related articles into the trained pseudo-query model to obtain the first pseudo-query text of positively related articles; for inputting irrelevant articles into the trained pseudo-query model to obtain the first pseudo-query text of irrelevant articles Two dummy query text;
训练样本单元30,用于根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;The training sample unit 30 is used to construct a positive sample according to the given query text, positively related articles and first pseudo query text; according to the given query text, positively related articles, unrelated articles, first pseudo query text and second pseudo Query text and construct negative samples;
模型检索单元40,用于将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。The model retrieval unit 40 is configured to input positive samples and negative samples into the double-tower recall model for iterative training; input the query to be trained into the trained double-tower recall model to obtain retrieval results.
采用本申请的基于双塔召回的检索系统,能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。By adopting the retrieval system based on twin-tower recall of the present application, multiple strongly correlated negative samples can be constructed through pseudo-queries for model learning, which greatly improves the retrieval efficiency and accuracy of the model.
本申请通过抽样得到的不同伪查询,为文章表示提供了多个不同视角的描述,使得文章表示中的核心重要信息得以增强并加以凸显。The application provides multiple descriptions from different perspectives for the article presentation by sampling different pseudo-queries, so that the core important information in the article presentation can be enhanced and highlighted.
同时,将模型后期复杂的信息交互迁移至模型前期,可以离线预先计算所有文章的伪查询,在保证性能的同时极大程度降低了模型的推理延迟,保留了双塔架构原本的优势。At the same time, by migrating the complex information interaction in the later stage of the model to the early stage of the model, the pseudo-query of all articles can be pre-calculated offline, which greatly reduces the reasoning delay of the model while ensuring performance, and retains the original advantages of the two-tower architecture.
实施例3Example 3
本实施例提供了一种基于双塔召回的检索设备,对于本实施例的基于双塔召回的检索设备中未披露的细节,请参照其它实施例中的基于双塔召回的检索方法或系统具体的实施内容。This embodiment provides a retrieval device based on twin-tower recall. For details not disclosed in the retrieval device based on twin-tower recall in this embodiment, please refer to the retrieval method or system based on twin-tower recall in other embodiments for details. implementation content.
图6中示出了根据本申请实施例的基于双塔召回的检索设备400的结构示意图。FIG. 6 shows a schematic structural diagram of a retrieval device 400 based on twin-tower recall according to an embodiment of the present application.
如图6所示,基于双塔召回的检索设备400,包括:As shown in Fig. 6, the retrieval device 400 based on the recall of the twin towers includes:
存储器402:用于存储可执行指令;以及Memory 402: for storing executable instructions; and
处理器401:用于与存储器402连接以执行可执行指令从而基于双塔召回的检索方法。Processor 401: used to connect with memory 402 to execute executable instructions so as to recall the retrieval method based on the twin towers.
本领域技术人员可以理解,示意图6仅仅是基于双塔召回的检索设备400的示例,并不构成对基于双塔召回的检索设备400的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如基于双塔召回的检索设备400还可以包括输入输出设备、网络接入设备、总线等。Those skilled in the art can understand that the schematic diagram 6 is only an example of the retrieval device 400 based on the recall of the twin towers, and does not constitute a limitation to the retrieval device 400 based on the recall of the twin towers, and may include more or less components than those shown in the figure. Or combine certain components, or different components, for example, the retrieval device 400 based on the recall of the twin towers may also include input and output devices, network access devices, buses, and the like.
所称处理器401(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application SpecificIntegrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器401也可以是任何常规的处理器等,处理器401是基于双塔召回的检索设备400的控制中心,利用各种接口和线路连接整个基于双塔召回的检索设备400的各个部分。The processor 401 (Central Processing Unit, CPU) may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate array ( Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 401 can also be any conventional processor, etc. The processor 401 is the control center of the retrieval device 400 recalled based on the twin towers, and utilizes various interfaces and lines to connect the entire twin tower-based Various parts of Retrieval Device 400 are recalled.
存储器402可用于存储计算机可读指令,处理器401通过运行或执行存储在存储器402内的计算机可读指令或模块,以及调用存储在存储器402内的数据,实现基于双塔召回的检索设备400的各种功能。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据基于双塔召回的检索设备400使用所创建的数据等。此外,存储器402可以包括硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)或其他非易失性/易失性存储器件。The memory 402 can be used to store computer-readable instructions, and the processor 401 realizes the retrieval device 400 based on the twin-tower recall by running or executing the computer-readable instructions or modules stored in the memory 402, and calling the data stored in the memory 402. various functions. The memory 402 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.); The retrieval device 400 based on the twin-tower recall uses the created data and the like. In addition, the memory 402 may include a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card), at least one magnetic disk storage device, a flash memory device, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), or other non-volatile/volatile storage devices.
基于双塔召回的检索设备400集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。If the integrated modules of the twin-tower recall-based retrieval device 400 are realized in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on such an understanding, the present invention realizes all or part of the processes in the methods of the above embodiments, and can also use computer-readable instructions to instruct related hardware to complete, and the computer-readable instructions can be stored in a computer-readable storage medium. When the computer-readable instructions are executed by the processor, the steps of the above-mentioned various method embodiments can be realized.
实施例4Example 4
本实施例提供了一种计算机可读存储介质,其上存储有计算机程序;计算机程序被处理器执行以实现其他实施例中的基于双塔召回的检索方法。This embodiment provides a computer-readable storage medium, on which a computer program is stored; the computer program is executed by a processor to implement the retrieval method based on twin-tower recall in other embodiments.
本申请实施例的基于双塔召回的检索设备及存储介质,通过根据给定查询文本,确定给定查询文本的正相关文章以及不相关文章;将正相关文章输入训练好的伪查询模型,得到正相关文章的第一伪查询文本;将不相关文章输入训练好的伪查询模型,得到不相关文章的第二伪查询文本;根据给定查询文本、正相关文章以及第一伪查询文本,构建正样本;根据给定查询文本、正相关文章、不相关文章、第一伪查询文本以及第二伪查询文本,构建负样本;将正样本以及负样本输入双塔召回模型进行迭代训练;将待查询输入训练后的双塔召回模型,得到检索结果。本申请能够通过伪查询构建多个强相关的负样本用于模型的学习,大大提高了模型的检索效率以及准确性。The retrieval device and storage medium based on the recall of the twin towers in the embodiment of the present application determine the positively related articles and irrelevant articles of the given query text according to the given query text; input the positively related articles into the trained pseudo-query model to obtain The first pseudo-query text of positively related articles; input the unrelated articles into the trained pseudo-query model to obtain the second pseudo-query text of unrelated articles; according to the given query text, positively related articles and the first pseudo-query text, construct Positive samples; construct negative samples based on given query text, positively related articles, irrelevant articles, first pseudo query text, and second pseudo query text; input positive samples and negative samples into the double-tower recall model for iterative training; Query and input the trained twin-tower recall model to get the retrieval results. The application can construct multiple strongly correlated negative samples for model learning through pseudo-queries, which greatly improves the retrieval efficiency and accuracy of the model.
本申请通过抽样得到的不同伪查询,为文章表示提供了多个不同视角的描述,使得文章表示中的核心重要信息得以增强并加以凸显。The application provides multiple descriptions from different perspectives for the article presentation by sampling different pseudo-queries, so that the core important information in the article presentation can be enhanced and highlighted.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.
在本发明使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used herein and in the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本发明可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本发明范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present invention, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。While preferred embodiments of the present application have been described, additional changes and modifications to these embodiments can be made by those skilled in the art once the basic inventive concept is appreciated. Therefore, the appended claims are intended to be construed to cover the preferred embodiment and all changes and modifications which fall within the scope of the application.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310469306.0A CN116501837A (en) | 2023-04-19 | 2023-04-19 | Retrieval method, system, device and storage medium based on twin-tower recall |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310469306.0A CN116501837A (en) | 2023-04-19 | 2023-04-19 | Retrieval method, system, device and storage medium based on twin-tower recall |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116501837A true CN116501837A (en) | 2023-07-28 |
Family
ID=87317827
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310469306.0A Pending CN116501837A (en) | 2023-04-19 | 2023-04-19 | Retrieval method, system, device and storage medium based on twin-tower recall |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116501837A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118536607A (en) * | 2024-07-25 | 2024-08-23 | 安徽思高智能科技有限公司 | A method and device for generating large model API sequence based on retrieval enhancement |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113761105A (en) * | 2021-05-24 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Text data processing method, apparatus, device and medium |
| CN114416929A (en) * | 2022-01-27 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Sample generation method, device, equipment and storage medium of entity recall model |
| CN115329749A (en) * | 2022-10-14 | 2022-11-11 | 成都数之联科技股份有限公司 | Recall and ordering combined training method and system for semantic retrieval |
-
2023
- 2023-04-19 CN CN202310469306.0A patent/CN116501837A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113761105A (en) * | 2021-05-24 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Text data processing method, apparatus, device and medium |
| CN114416929A (en) * | 2022-01-27 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Sample generation method, device, equipment and storage medium of entity recall model |
| CN115329749A (en) * | 2022-10-14 | 2022-11-11 | 成都数之联科技股份有限公司 | Recall and ordering combined training method and system for semantic retrieval |
Non-Patent Citations (1)
| Title |
|---|
| ZEHAN LI ET AL: "Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval", 《ARXIV》, 8 August 2022 (2022-08-08), pages 1 - 5 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118536607A (en) * | 2024-07-25 | 2024-08-23 | 安徽思高智能科技有限公司 | A method and device for generating large model API sequence based on retrieval enhancement |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111063410B (en) | Method and device for generating medical image text report | |
| WO2020215984A1 (en) | Medical image detection method based on deep learning, and related device | |
| CN108417272B (en) | Similar case recommendation method and device with time sequence constraints | |
| CN109935294A (en) | Text report output method, text report output device, storage medium and terminal | |
| CN110390674A (en) | Image processing method, device, storage medium, equipment and system | |
| Zhou et al. | MedVersa: A Generalist Foundation Model for Medical Image Interpretation | |
| CN114842270B (en) | A method, device, electronic device and medium for classifying target images | |
| CN114121218B (en) | Virtual scene construction method, device, equipment and medium applied to surgery | |
| Zhu et al. | 3D pyramid pooling network for abdominal MRI series classification | |
| CN117875319B (en) | Medical field labeling data acquisition method and device and electronic equipment | |
| CN116501837A (en) | Retrieval method, system, device and storage medium based on twin-tower recall | |
| Chen et al. | MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output | |
| CN113096086A (en) | Ki67 index determination method and system | |
| US20250266159A1 (en) | Systems and methods for performing medical tasks using a medical artificial intelligence system | |
| CN118737392B (en) | A method, device and product for recognizing and positioning colonoscopy images | |
| CN109710928A (en) | The entity relation extraction method and device of non-structured text | |
| CN116994687B (en) | Clinical decision support model explanation system based on counterfactual comparison | |
| CN119252449A (en) | Diagnosis guidance methods, equipment, media and products based on large models | |
| CN118116620A (en) | Medical question-answering method, device and electronic equipment | |
| CN117611910A (en) | Self-supervision identification method, system, equipment and medium based on esophageal endoscope image | |
| CN117218134A (en) | Blood vessel segmentation method, system, media and electronic equipment for CTA images | |
| WO2024221231A1 (en) | Semi-supervised learning method and apparatus based on model framework | |
| CN116779137A (en) | A data processing method and system based on medical knowledge graph | |
| CN116306640A (en) | A named entity recognition method for medical electronic medical records based on word vector fusion | |
| Al-Shahari et al. | Accelerating biomedical image segmentation using equilibrium optimization with a deep learning approach |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |