[go: up one dir, main page]

CN118484527A - A data processing method and system based on multi-source heterogeneous matching - Google Patents

A data processing method and system based on multi-source heterogeneous matching Download PDF

Info

Publication number
CN118484527A
CN118484527A CN202410618584.2A CN202410618584A CN118484527A CN 118484527 A CN118484527 A CN 118484527A CN 202410618584 A CN202410618584 A CN 202410618584A CN 118484527 A CN118484527 A CN 118484527A
Authority
CN
China
Prior art keywords
expert
label information
similarity
optimal
paper
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202410618584.2A
Other languages
Chinese (zh)
Inventor
吴灏
周秋杏
陈飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qizhi Industry University Research Technology Achievement Transformation Shenzhen Co ltd
Original Assignee
Qizhi Industry University Research Technology Achievement Transformation Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qizhi Industry University Research Technology Achievement Transformation Shenzhen Co ltd filed Critical Qizhi Industry University Research Technology Achievement Transformation Shenzhen Co ltd
Priority to CN202410618584.2A priority Critical patent/CN118484527A/en
Publication of CN118484527A publication Critical patent/CN118484527A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing method and system based on multi-source heterogeneous matching relates to the field of data processing, and the method comprises the following steps: the method comprises the steps of carrying out text splitting processing on demand information of product research and development in a target company to obtain research and development demand label information, acquiring expert label information in an expert database, patent label information in a patent database and paper label information in a paper database, respectively calculating first similarity of the research and development demand label information and the patent label information and second similarity of optimal patent label information and paper label information, selecting the patent and paper label information with the largest similarity as an optimal matching result, and then selecting the patent, paper and expert with the largest similarity as the optimal matching result, so that a user can acquire the most relevant information of the research and development demand in three dimensions of the patent, paper and expert at the same time, comprehensive inquiry and global decision of the user are facilitated, and user experience is improved.

Description

一种基于多源异构匹配的数据处理方法及系统A data processing method and system based on multi-source heterogeneous matching

技术领域Technical Field

本申请涉及数据处理领域,尤其涉及一种基于多源异构匹配的数据处理方法及系统。The present application relates to the field of data processing, and in particular to a data processing method and system based on multi-source heterogeneous matching.

背景技术Background Art

随着科技的迅速发展,专利申请量逐年上升,各行业竞争日益激烈企业不仅要快速发现和利用相关的专利技术,还需要借鉴前沿的学术成果,与业内优秀的专家学者开展合作。With the rapid development of science and technology, the number of patent applications has increased year by year, and competition in various industries has become increasingly fierce. Enterprises must not only quickly discover and utilize relevant patented technologies, but also learn from cutting-edge academic achievements and cooperate with outstanding experts and scholars in the industry.

现有的信息服务系统通常针对专利、论文、专家等不同类型的数据分别进行管理和检索。在专利方面,主要采用基于关键词匹配的检索方法,通过关键词索引实现专利文本的快速查找和相关度排序,在论文方面,主要依托学术文献数据库,支持对论文标题、摘要、关键词等结构化信息的条件检索,在专家方面,主要通过人工编制的专家目录和简介进行浏览查询,但是这些方法各自独立,缺乏融合关联,无法实现跨领域、跨资源的综合检索和关联推荐,这导致用户很难从海量的异构数据中发现专利、论文、专家之间的内在联系,获取有针对性的信息服务,降低了用户体验。Existing information service systems usually manage and retrieve different types of data such as patents, papers, and experts separately. In terms of patents, the main search method is based on keyword matching, and the keyword index is used to achieve rapid search and relevance sorting of patent texts. In terms of papers, it mainly relies on academic literature databases to support conditional search of structured information such as paper titles, abstracts, and keywords. In terms of experts, it mainly browses and queries through manually compiled expert catalogs and profiles. However, these methods are independent of each other and lack fusion and association, and cannot achieve cross-domain and cross-resource comprehensive search and association recommendations. This makes it difficult for users to discover the internal connections between patents, papers, and experts from massive heterogeneous data and obtain targeted information services, which reduces the user experience.

发明内容Summary of the invention

本申请提供了一种基于多源异构匹配的数据处理方法及系统,用于通过对目标公司内产品研发的需求信息进行文本拆分处理,得到研发需求标签信息,再获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,通过分别计算研发需求标签信息和专利标签信息的第一相似度,以及最优专利标签信息与论文标签信息的第二相似度,选取相似度最大的专利和论文标签信息作为最优匹配结果,充分利用了不同来源标签信息之间的语义关联,计算每个专家的专家专利标签与最优专利标签的第三相似度、专家论文标签与最优论文标签的第四相似度,并将第三、第四相似度加权平均,选取相似度最大的专利、论文和专家作为最优匹配结果,使得用户可以同时获取某一研发需求在专利、论文、专家三个维度的最相关信息,方便了用户的综合查询和全局决策,提高了用户体验。The present application provides a data processing method and system based on multi-source heterogeneous matching, which is used to obtain R&D demand label information by performing text splitting processing on product R&D demand information within a target company, and then obtain expert label information in an expert database, patent label information in a patent database, and paper label information in a paper database. By respectively calculating the first similarity between the R&D demand label information and the patent label information, and the second similarity between the optimal patent label information and the paper label information, the patent and paper label information with the greatest similarity are selected as the optimal matching result, and the semantic association between label information from different sources is fully utilized. The third similarity between the expert patent label of each expert and the optimal patent label, and the fourth similarity between the expert paper label and the optimal paper label are calculated, and the third and fourth similarities are weighted averaged to select the patent, paper, and expert with the greatest similarity as the optimal matching result, so that users can simultaneously obtain the most relevant information in the three dimensions of patents, papers, and experts for a certain R&D demand, which facilitates users' comprehensive query and global decision-making and improves user experience.

第一方面,本申请提供了一种基于多源异构匹配的数据处理方法,应用于基于多源异构匹配的数据处理系统,该方法包括:In a first aspect, the present application provides a data processing method based on multi-source heterogeneous matching, which is applied to a data processing system based on multi-source heterogeneous matching. The method includes:

对目标公司内产品研发的需求信息进行文本拆分处理,得到该目标公司内产品研发的研发需求标签信息;Perform text splitting processing on the demand information of product development within the target company to obtain the research and development demand label information of the product development within the target company;

获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,该专家数据库内包含每个专家的专家标签信息和该专家对应的专家专利标签信息和专家论文标签信息;Obtain expert label information in an expert database, patent label information in a patent database, and paper label information in a paper database, wherein the expert database contains the expert label information of each expert and the expert patent label information and expert paper label information corresponding to the expert;

分别计算该研发需求标签信息和该专利标签信息的第一相似度,选取最大第一相似度对应的最优专利标签信息;Calculate the first similarity between the R&D demand label information and the patent label information respectively, and select the optimal patent label information corresponding to the maximum first similarity;

分别计算该最优专利标签信息与该论文标签信息的第二相似度,选取最大第二相似度对应的最优论文标签信息;Calculate the second similarity between the optimal patent label information and the paper label information respectively, and select the optimal paper label information corresponding to the maximum second similarity;

分别计算每个专家的专家专利标签信息与该最优专利标签信息的第三相似度和该专家论文标签信息与该最优论文标签信息的第四相似度,将加权平均相似度最高的专家作为最优目标专家,该加权平均相似度为该第三相似度和该第四相似度的加权平均值;Calculate the third similarity between each expert's expert patent label information and the optimal patent label information and the fourth similarity between the expert's paper label information and the optimal paper label information respectively, and take the expert with the highest weighted average similarity as the optimal target expert, where the weighted average similarity is the weighted average of the third similarity and the fourth similarity;

将该最优目标专家对应的最优专家标签信息、该最优专利标签信息和该最优论文标签信息发送至客户端。The optimal expert label information, the optimal patent label information and the optimal paper label information corresponding to the optimal target expert are sent to the client.

在上述实施例中,通过对目标公司内产品研发的需求信息进行文本拆分处理,得到研发需求标签信息,再获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,通过分别计算研发需求标签信息和专利标签信息的第一相似度,以及最优专利标签信息与论文标签信息的第二相似度,选取相似度最大的专利和论文标签信息作为最优匹配结果,充分利用了不同来源标签信息之间的语义关联,计算每个专家的专家专利标签与最优专利标签的第三相似度、专家论文标签与最优论文标签的第四相似度,并将第三、第四相似度加权平均,选取相似度最大的专利、论文和专家作为最优匹配结果,使得用户可以同时获取某一研发需求在专利、论文、专家三个维度的最相关信息,方便了用户的综合查询和全局决策。In the above embodiment, by performing text splitting processing on the demand information of product R&D within the target company, R&D demand label information is obtained, and then the expert label information in the expert database, the patent label information in the patent database and the paper label information in the paper database are obtained. By respectively calculating the first similarity between the R&D demand label information and the patent label information, and the second similarity between the optimal patent label information and the paper label information, the patent and paper label information with the greatest similarity are selected as the optimal matching result, and the semantic association between label information from different sources is fully utilized. The third similarity between the expert patent label of each expert and the optimal patent label, and the fourth similarity between the expert paper label and the optimal paper label are calculated, and the third and fourth similarities are weighted averaged to select the patent, paper and expert with the greatest similarity as the optimal matching result, so that users can simultaneously obtain the most relevant information in the three dimensions of patents, papers and experts for a certain R&D demand, which facilitates users' comprehensive query and global decision-making.

结合第一方面的一些实施例,在一些实施例中,该分别计算该研发需求标签信息和该专利标签信息的第一相似度,选取最大第一相似度对应的最优专利标签信息的步骤,具体包括:In conjunction with some embodiments of the first aspect, in some embodiments, the step of respectively calculating the first similarity between the R&D demand label information and the patent label information and selecting the optimal patent label information corresponding to the maximum first similarity specifically includes:

对该研发需求标签信息进行文本预处理,得到研发需求关键词集合;Perform text preprocessing on the R&D demand label information to obtain a set of R&D demand keywords;

对该专利标签信息进行文本预处理,得到专利关键词集合;Perform text preprocessing on the patent tag information to obtain a patent keyword set;

分别计算该研发需求关键词集合和该专利关键词集合的第一相似度,得到第一相似度合集;Calculate the first similarity of the R&D demand keyword set and the patent keyword set respectively to obtain a first similarity collection;

对该第一相似度合集按照第一相似度值由大到小进行排序,将第一相似度值最大的专利标签信息作为最优专利标签信息。The first similarity collection is sorted from large to small according to the first similarity value, and the patent label information with the largest first similarity value is used as the optimal patent label information.

在上述实施例中,针对研发需求标签信息和专利标签信息分别进行文本预处理,提取各自的关键词集合,再计算两个关键词集合的相似度,获得相似度合集。通过对相似度合集按照相似度值由大到小排序,将相似度最大的专利标签信息选为最优匹配结果,充分利用了文本挖掘和相似度计算技术,通过提炼关键词并比较其相似程度,实现了研发需求与专利的精准匹配,同时,排序筛选出最相似的专利,为后续的专利分析和利用提供了高质量的数据支持。In the above embodiment, text preprocessing is performed on the R&D demand label information and the patent label information respectively, and the respective keyword sets are extracted, and then the similarity of the two keyword sets is calculated to obtain a similarity collection. By sorting the similarity collection from large to small according to the similarity value, the patent label information with the greatest similarity is selected as the optimal matching result, making full use of text mining and similarity calculation technology, and realizing accurate matching of R&D demand and patents by refining keywords and comparing their similarities. At the same time, the most similar patents are sorted and screened, providing high-quality data support for subsequent patent analysis and utilization.

结合第一方面的一些实施例,在一些实施例中,该分别计算每个专家的专家专利标签信息与该最优专利标签信息的第三相似度和该专家论文标签信息与该最优论文标签信息的第四相似度,将加权平均相似度最高的专家作为最优目标专家,该加权平均相似度为该第三相似度和该第四相似度的加权平均值的步骤,具体包括:In combination with some embodiments of the first aspect, in some embodiments, the step of respectively calculating the third similarity between the expert patent label information of each expert and the optimal patent label information and the fourth similarity between the expert paper label information and the optimal paper label information, taking the expert with the highest weighted average similarity as the optimal target expert, and the weighted average similarity being the weighted average of the third similarity and the fourth similarity, specifically includes:

对该最优专利标签信息、最优论文标签信息、每个专家的专家专利标签信息和专家论文标签信息分别进行向量化表示,得到最优专利特征向量、最优论文特征向量、专家专利特征向量和专家论文特征向量;The optimal patent label information, the optimal paper label information, the expert patent label information of each expert, and the expert paper label information are respectively vectorized to obtain the optimal patent feature vector, the optimal paper feature vector, the expert patent feature vector, and the expert paper feature vector;

分别计算该最优专利特征向量与每个该专家专利特征向量的第三相似度,得到第三相似度集合;Calculate the third similarity between the optimal patent feature vector and each of the expert patent feature vectors respectively to obtain a third similarity set;

分别计算该最优论文特征向量与每个该专家论文特征向量的第四相似度,得到第四相似度集合;Calculate the fourth similarity between the optimal paper feature vector and each of the expert paper feature vectors to obtain a fourth similarity set;

对每个专家对应的第三相似度和第四相似度进行加权求和,得到每个该专家的加权平均相似度集合,该加权相似度集合包含若干个加权平均相似度,该加权平均相似度为该第三相似度和该第四相似度的加权平均值;Performing weighted summation on the third similarity and the fourth similarity corresponding to each expert to obtain a weighted average similarity set of each expert, wherein the weighted similarity set includes a plurality of weighted average similarities, and the weighted average similarity is a weighted average of the third similarity and the fourth similarity;

将该加权平均相似度集合中加权平均相似度最高的专家确定为最优目标专家。The expert with the highest weighted average similarity in the weighted average similarity set is determined as the optimal target expert.

在上述实施例中,通过对最优专利标签、最优论文标签以及每个专家的专家专利标签和专家论文标签进行向量化表示,得到各自的特征向量。然后分别计算最优专利特征向量与每个专家专利特征向量的相似度,以及最优论文特征向量与每个专家论文特征向量的相似度,再对每个专家的这两个相似度进行加权求和,将加权平均相似度最高的专家选为最优目标专家,融合专利和论文两个维度的相似度,并赋予一定权重,可以更加全面和平衡地评估专家的综合匹配度,提高了专家推荐的合理性和可信度。In the above embodiment, the best patent label, the best paper label, and the expert patent label and expert paper label of each expert are vectorized to obtain their respective feature vectors. Then, the similarity between the best patent feature vector and each expert patent feature vector, and the similarity between the best paper feature vector and each expert paper feature vector are calculated respectively, and then the two similarities of each expert are weighted and summed, and the expert with the highest weighted average similarity is selected as the optimal target expert. The similarity of the two dimensions of patents and papers is integrated and assigned a certain weight, so that the comprehensive matching degree of experts can be evaluated more comprehensively and balancedly, and the rationality and credibility of expert recommendations are improved.

结合第一方面的一些实施例,在一些实施例中,在该将该最优目标专家对应的最优专家标签信息、该最优专利标签信息和该最优论文标签信息发送至客户端的步骤之后,该方法还包括:In combination with some embodiments of the first aspect, in some embodiments, after the step of sending the optimal expert label information, the optimal patent label information, and the optimal paper label information corresponding to the optimal target expert to the client, the method further includes:

接收该客户端发送的需求文档信息;Receive the requirement document information sent by the client;

对该需求文档信息进行文本挖掘,得到技术关键词、产品特征关键词和应用场景关键词;Perform text mining on the requirement document information to obtain technical keywords, product feature keywords, and application scenario keywords;

将该技术关键词、该产品特征关键词和该应用场景关键词在该专利数据库中进行检索,得到专利文献合集,该专利文献合集内包含数量不少于一个的专利文献;Searching the technical keyword, the product feature keyword, and the application scenario keyword in the patent database to obtain a patent document collection, wherein the patent document collection includes no less than one patent document;

根据每个该专利文献的专利文献关键词与该需求文档信息计算风险值,得到风险专利合集,该风险专利合集内包含数量不少于一个的大于预设风险阈值的专利文献;Calculate the risk value according to the patent document keywords of each patent document and the information of the demand document to obtain a risk patent collection, wherein the risk patent collection contains at least one patent document with a risk greater than a preset risk threshold;

将该风险专利合集发送至该客户端。The risk patent collection is sent to the client.

在上述实施例中,由客户端上传需求文档,对其进行文本挖掘提取技术关键词、产品特征关键词和应用场景关键词,利用这些关键词在专利数据库中进行检索,获得一批相关专利文献,然后根据每篇专利的关键词与需求文档的相似程度,计算其风险值,筛选出高风险专利组成风险专利合集并反馈给客户,通过深入挖掘需求文档的多维度关键信息,并以此为基础开展专利检索和风险评估,可以更加精准、全面地找出与需求相关的高风险专利,让客户对潜在侵权风险有清晰的认知,为其研发策略的制定和实施提供参考和指导,降低了企业的法律风险。In the above embodiment, the client uploads the demand document, performs text mining to extract technical keywords, product feature keywords and application scenario keywords, uses these keywords to search in the patent database, obtains a batch of relevant patent documents, and then calculates the risk value of each patent based on the similarity between the keywords and the demand document, screens out high-risk patents to form a risk patent collection and feeds back to the customer. By deeply mining the multi-dimensional key information of the demand document and conducting patent search and risk assessment based on this, high-risk patents related to the demand can be found more accurately and comprehensively, so that customers can have a clear understanding of the potential infringement risks, provide reference and guidance for the formulation and implementation of their R&D strategies, and reduce the legal risks of enterprises.

结合第一方面的一些实施例,在一些实施例中,该将该风险专利合集发送至该客户端的步骤之后,该方法还包括:In combination with some embodiments of the first aspect, in some embodiments, after the step of sending the risky patent collection to the client, the method further includes:

将不大于预设风险阈值的备选专利文献存储至该预设专利数据库中的The candidate patent documents that are not greater than the preset risk threshold are stored in the preset patent database.

备选专利栏目中。In the Alternative Patents column.

在上述实施例中,在识别出高风险专利的同时,该方案还将风险值不超过预设阈值的其他专利文献存储到备选专利栏目,这些备选专利虽然风险较低,但也与需求具有一定相关性,对企业的研发工作仍有参考价值。通过建立备选专利库,企业可以在规避风险的同时,有针对性地学习和借鉴这些专利中的相关技术内容,用于指导和优化自身的研发方向与技术方案。In the above embodiment, while identifying high-risk patents, the solution also stores other patent documents whose risk values do not exceed the preset threshold in the alternative patent column. Although these alternative patents have lower risks, they are also relevant to the demand and still have reference value for the company's R&D work. By establishing an alternative patent library, companies can avoid risks while learning and drawing on the relevant technical content in these patents in a targeted manner to guide and optimize their own R&D directions and technical solutions.

结合第一方面的一些实施例,在一些实施例中,该获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息的步骤之后,具体包括:In conjunction with some embodiments of the first aspect, in some embodiments, after the step of obtaining expert label information in the expert database, patent label information in the patent database, and paper label information in the paper database, the following steps are specifically included:

获取该客户端反馈的用户满意度评分,该用户满意度评分包括对该最优专家标签信息的第一满意度评分、对该最优专利标签信息的第二满意度评分和对该最优论文标签信息的第三满意度评分;Obtaining a user satisfaction score fed back by the client, the user satisfaction score including a first satisfaction score for the optimal expert label information, a second satisfaction score for the optimal patent label information, and a third satisfaction score for the optimal paper label information;

若存在该用户满意度评分低于预设评分阈值,则获取该客户端发送的反馈信息;If the user satisfaction score is lower than the preset score threshold, the feedback information sent by the client is obtained;

基于大语言模型提取该反馈信息中的标签筛选条件;Extracting label filtering conditions in the feedback information based on a large language model;

根据该标签筛选条件对该专家数据库中的初始专家标签信息进行筛选,得到该专家标签信息,对该专利数据库中的初始专利标签信息进行筛选得到该专利标签信息,对该论文数据库中的初始论文标签信息进行筛选,得到该论文标签信息。The initial expert label information in the expert database is screened according to the label screening condition to obtain the expert label information, the initial patent label information in the patent database is screened to obtain the patent label information, and the initial paper label information in the paper database is screened to obtain the paper label information.

在上述实施例中,通过获取用户对推荐结果的满意度评分,判断是否存在评分低于阈值的情况,如果满意度较低,则会进一步获取用户的反馈意见,利用大语言模型从中提取出标签筛选条件,并据此对专家、专利、论文的初始标签进行筛选,得到更加精炼和贴近用户需求的标签信息,根据用户的主观评价和意见动态调整标签筛选策略,使标签更加符合用户的实际需求和偏好,同时通过大语言模型智能提取用户反馈中的关键信息,可以有效提升个性化服务能力,不断优化和完善知识匹配效果,增强用户粘性。In the above embodiment, by obtaining the user's satisfaction score for the recommendation results, it is determined whether there is a score below the threshold. If the satisfaction is low, the user's feedback is further obtained, and the label screening conditions are extracted from it using the large language model. The initial labels of experts, patents, and papers are screened accordingly to obtain more refined label information that is closer to user needs. The label screening strategy is dynamically adjusted according to the user's subjective evaluation and opinions to make the label more in line with the user's actual needs and preferences. At the same time, the key information in the user feedback is intelligently extracted through the large language model, which can effectively improve the personalized service capabilities, continuously optimize and improve the knowledge matching effect, and enhance user stickiness.

结合第一方面的一些实施例,在一些实施例中,在该获取该客户端反馈的用户满意度评分,该用户满意度评分包括对该最优专家标签信息的第一满意度评分、对该最优专利标签信息的第二满意度评分和对该最优论文标签信息的第三满意度评分的步骤之后,该方法还包括:In conjunction with some embodiments of the first aspect, in some embodiments, after the step of obtaining a user satisfaction score fed back by the client, the user satisfaction score including a first satisfaction score for the optimal expert label information, a second satisfaction score for the optimal patent label information, and a third satisfaction score for the optimal paper label information, the method further includes:

若不存在该用户满意度评分低于预设评分阈值,则向该客户端推送该最优目标专家的详细介绍信息、该最优专利标签信息对应的专利详细信息和该最优论文标签信息对应的论文详细信息。If there is no user satisfaction score lower than the preset score threshold, the detailed introduction information of the optimal target expert, the patent details corresponding to the optimal patent label information, and the paper details corresponding to the optimal paper label information are pushed to the client.

在上述实施例中,通过充分考虑了用户满意度的另一种情况,即所有评分均达到预设阈值,表明用户对当前的推荐结果较为满意,则进一步向用户推送更加详尽的信息,包括最优专家的详细介绍、最优专利和论文的详细内容等,通过提供这些额外的详细信息,用户可以更全面和深入地了解所获取的知识资源,有助于其更好地理解和利用推荐结果,提升整体的服务体验。In the above embodiment, by fully considering another situation of user satisfaction, that is, all scores reach the preset threshold, indicating that the user is relatively satisfied with the current recommendation results, more detailed information is further pushed to the user, including a detailed introduction of the best experts, detailed content of the best patents and papers, etc. By providing these additional detailed information, users can have a more comprehensive and in-depth understanding of the acquired knowledge resources, which helps them better understand and utilize the recommendation results and improve the overall service experience.

第二方面,本申请实施例提供了一种基于多源异构匹配的数据处理系统,该基于多源异构匹配的数据处理系统包括:一个或多个处理器和存储器;该存储器与该一个或多个处理器耦合,该存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令,该一个或多个处理器调用该计算机指令以使得该基于多源异构匹配的数据处理系统执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a second aspect, an embodiment of the present application provides a data processing system based on multi-source heterogeneous matching, the data processing system based on multi-source heterogeneous matching comprising: one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code includes computer instructions, the one or more processors call the computer instructions to enable the data processing system based on multi-source heterogeneous matching to perform the method described in the first aspect and any possible implementation manner of the first aspect.

第三方面,本申请实施例提供一种包含指令的计算机程序产品,当上述计算机程序产品在基于多源异构匹配的数据处理系统上运行时,使得上述基于多源异构匹配的数据处理系统执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a third aspect, an embodiment of the present application provides a computer program product comprising instructions. When the above-mentioned computer program product is run on a data processing system based on multi-source heterogeneous matching, the above-mentioned data processing system based on multi-source heterogeneous matching executes the method described in the first aspect and any possible implementation method of the first aspect.

第四方面,本申请实施例提供一种计算机可读存储介质,包括指令,当上述指令在基于多源异构匹配的数据处理系统上运行时,使得上述基于多源异构匹配的数据处理系统执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, comprising instructions. When the above instructions are executed on a data processing system based on multi-source heterogeneous matching, the above data processing system based on multi-source heterogeneous matching executes the method described in the first aspect and any possible implementation method of the first aspect.

可以理解地,上述第二方面提供的基于多源异构匹配的数据处理系统,第三方面提供的计算机程序产品和第四方面提供的计算机存储介质均用于执行本申请实施例所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。It can be understood that the data processing system based on multi-source heterogeneous matching provided in the second aspect, the computer program product provided in the third aspect, and the computer storage medium provided in the fourth aspect are all used to execute the method provided in the embodiment of the present application. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method, which will not be repeated here.

本申请实施例中提供的一个或多个技术方案,至少具有如下技术效果或优点:One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

1、本申请通过对目标公司内产品研发的需求信息进行文本拆分处理,得到研发需求标签信息,再获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,通过分别计算研发需求标签信息和专利标签信息的第一相似度,以及最优专利标签信息与论文标签信息的第二相似度,选取相似度最大的专利和论文标签信息作为最优匹配结果,充分利用了不同来源标签信息之间的语义关联,计算每个专家的专家专利标签与最优专利标签的第三相似度、专家论文标签与最优论文标签的第四相似度,并将第三、第四相似度加权平均,选取相似度最大的专利、论文和专家作为最优匹配结果,使得用户可以同时获取某一研发需求在专利、论文、专家三个维度的最相关信息,方便了用户的综合查询和全局决策,提高了用户体验。1. This application obtains R&D demand label information by performing text splitting processing on the product R&D demand information within the target company, and then obtains expert label information in the expert database, patent label information in the patent database, and paper label information in the paper database. By respectively calculating the first similarity between the R&D demand label information and the patent label information, and the second similarity between the optimal patent label information and the paper label information, the patent and paper label information with the greatest similarity are selected as the optimal matching results. The semantic association between label information from different sources is fully utilized to calculate the third similarity between each expert's expert patent label and the optimal patent label, and the fourth similarity between the expert paper label and the optimal paper label. The third and fourth similarities are weighted averaged, and the patents, papers, and experts with the greatest similarity are selected as the optimal matching results, so that users can simultaneously obtain the most relevant information in the three dimensions of patents, papers, and experts for a certain R&D demand, which facilitates users' comprehensive query and global decision-making and improves user experience.

2、本申请通过由客户端上传需求文档,对其进行文本挖掘提取技术关键词、产品特征关键词和应用场景关键词,利用这些关键词在专利数据库中进行检索,获得一批相关专利文献,然后根据每篇专利的关键词与需求文档的相似程度,计算其风险值,筛选出高风险专利组成风险专利合集并反馈给客户,通过深入挖掘需求文档的多维度关键信息,并以此为基础开展专利检索和风险评估,可以更加精准、全面地找出与需求相关的高风险专利,让客户对潜在侵权风险有清晰的认知,为其研发策略的制定和实施提供参考和指导,降低了企业的法律风险。2. This application is to have the client upload the demand document, perform text mining on it to extract technical keywords, product feature keywords and application scenario keywords, use these keywords to search in the patent database, obtain a batch of relevant patent documents, and then calculate the risk value of each patent based on the similarity between the keywords and the demand document, screen out high-risk patents to form a risk patent collection and feedback to the customer. By deeply mining the multi-dimensional key information of the demand document and conducting patent search and risk assessment based on this, high-risk patents related to the demand can be found more accurately and comprehensively, so that customers can have a clear understanding of the potential infringement risks, provide reference and guidance for the formulation and implementation of their R&D strategies, and reduce the legal risks of enterprises.

3、本申请通过获取用户对推荐结果的满意度评分,判断是否存在评分低于阈值的情况,如果满意度较低,则会进一步获取用户的反馈意见,利用大语言模型从中提取出标签筛选条件,并据此对专家、专利、论文的初始标签进行筛选,得到更加精炼和贴近用户需求的标签信息,根据用户的主观评价和意见动态调整标签筛选策略,使标签更加符合用户的实际需求和偏好,同时通过大语言模型智能提取用户反馈中的关键信息,可以有效提升系统的个性化服务能力,不断优化和完善知识匹配效果,增强用户粘性。3. This application obtains the user's satisfaction score for the recommendation results to determine whether there is a score below the threshold. If the satisfaction is low, it will further obtain user feedback, use the large language model to extract label screening conditions, and filter the initial labels of experts, patents, and papers accordingly to obtain more refined label information that is closer to user needs. The label screening strategy is dynamically adjusted according to the user's subjective evaluation and opinions to make the label more in line with the user's actual needs and preferences. At the same time, the large language model is used to intelligently extract key information from user feedback, which can effectively improve the system's personalized service capabilities, continuously optimize and improve knowledge matching effects, and enhance user stickiness.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请实施例中基于多源异构匹配的数据处理方法的一个流程示意图;FIG1 is a flow chart of a data processing method based on multi-source heterogeneous matching in an embodiment of the present application;

图2是本申请实施例中基于多源异构匹配的数据处理方法的另一个流程示意图;FIG2 is another flow chart of a data processing method based on multi-source heterogeneous matching in an embodiment of the present application;

图3是本申请实施例中基于多源异构匹配的数据处理方法的另一个流程示意图;FIG3 is another flow chart of a data processing method based on multi-source heterogeneous matching in an embodiment of the present application;

图4是本申请实施例中基于多源异构匹配的数据处理方法的另一个流程示意图;FIG4 is another flow chart of a data processing method based on multi-source heterogeneous matching in an embodiment of the present application;

图5是本申请实施例中基于多源异构匹配的数据处理系统的一种实体装置结构示意图。FIG. 5 is a schematic diagram of a physical device structure of a data processing system based on multi-source heterogeneous matching in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

本申请以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书中所使用的那样,单数表达形式"一个"、"一种"、"上述"、"该"和"这一"旨在也包括复数表达形式,除非其上下文中明确地有相反指示。还应当理解,本申请中使用的术语"和/或"是指包含一个或多个所列出项目的任何或所有可能组合。The terms used in the following embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to be used as limitations to the present application. As used in the specification of the present application, the singular expressions "a", "a kind", "above", "the" and "this" are intended to also include plural expressions, unless there is a clear indication to the contrary in the context. It should also be understood that the term "and/or" used in the present application refers to any or all possible combinations comprising one or more of the listed items.

以下,术语"第一"、"第二"仅用于描述目的,而不能理解为暗示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有"第一"、"第二"的特征可以明示或者隐含地包括一个或者更多个该特征,在本申请实施例的描述中,除非另有说明,"多个"的含义是两个或两个以上。In the following, the terms "first" and "second" are used for descriptive purposes only and should not be understood as suggesting or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the meaning of "plurality" is two or more.

为便于理解,下面介绍本申请实施例的应用场景。For ease of understanding, the application scenarios of the embodiments of the present application are introduced below.

在相关技术中,可以通过采用关键词检索和人工筛选的方式,来实现从专利、论文等数据库中查找所需的技术信息。下面介绍使用相关技术中的基于多源异构匹配的数据处理方法的场景:In related technologies, keyword retrieval and manual screening can be used to find the required technical information from databases such as patents and papers. The following describes the scenarios in which the data processing method based on multi-source heterogeneous matching in related technologies is used:

某家电子设备制造企业A公司,近年来致力于开发新一代智能手表产品。为了提升产品的创新性和竞争力,A公司迫切需要了解该领域的最新技术动向和研究成果,希望能够找到相关的专利、论文以及行业专家,为其研发工作提供指导和支持。A公司尝试利用现有的信息服务系统来解决上述问题。他们先是使用专利检索系统,通过输入关键词来查找相关专利,但是,由于智能手表涉及的技术点较为分散,单纯的关键词匹配往往会漏掉一些重要但表述不同的专利。同时,检索结果中也混杂了大量不相关或质量较低的专利,需要人工筛选和判断,这些局限性导致A公司无法高效连贯地利用各类信息资源,大大影响了其研发创新的进程和质量。Company A, an electronic equipment manufacturer, has been committed to developing a new generation of smart watch products in recent years. In order to improve the innovation and competitiveness of its products, Company A urgently needs to understand the latest technological trends and research results in this field, and hopes to find relevant patents, papers and industry experts to provide guidance and support for its R&D work. Company A tried to use the existing information service system to solve the above problems. They first used the patent search system to find relevant patents by entering keywords. However, since the technical points involved in smart watches are relatively scattered, simple keyword matching often misses some important but differently expressed patents. At the same time, the search results are also mixed with a large number of irrelevant or low-quality patents, which require manual screening and judgment. These limitations have prevented Company A from efficiently and coherently using various information resources, greatly affecting the progress and quality of its R&D innovation.

而采用本申请实施例中的基于多源异构匹配的数据处理方法,通过从专利库、论文库和专家库中提取结构化的标签信息,并计算其与研发需求的语义相似度,实现跨领域、跨对象的关联匹配,不仅大幅提升了信息检索和筛选的效率,还能够发现表述不同但内在相关的隐性知识。下面介绍使用了本申请中基于多源异构匹配的数据处理方法的场景:The data processing method based on multi-source heterogeneous matching in the embodiment of the present application is adopted to extract structured label information from patent libraries, paper libraries and expert libraries, and calculate its semantic similarity with R&D needs to achieve cross-domain and cross-object correlation matching, which not only greatly improves the efficiency of information retrieval and screening, but also can discover implicit knowledge that is expressed differently but is inherently related. The following introduces the scenarios in which the data processing method based on multi-source heterogeneous matching in the present application is used:

A公司决定采用基于多源异构匹配的数据处理方案来解决上述难题。首先,他们提炼出智能手表研发项目的关键技术需求,并将其转化为结构化的标签信息。然后,自动从专利库、论文库和专家库中提取相应的专利标签、论文标签和专家标签。通过计算需求标签与专利标签、论文标签的语义相似度,快速找出了最相关的专利和论文。接着,综合考虑专家与其论文、专利的关联度,通过加权求和计算出与需求最匹配的目标专家。最终,将匹配出的最优专利、论文和专家信息反馈给A公司的研发团队。通过本方案,研发人员可以迅速获取到最关键、最权威的技术参考资料,全面了解相关技术的原理、应用和专家观点。这些信息的高度关联和聚合,使得研发人员对智能手表的技术脉络和发展趋势有了清晰的认知,极大地提升了研发的针对性和前瞻性。同时,通过智能匹配到的行业专家,A公司还建立了深度的产学研合作关系,为后续的持续创新提供了智力支持。Company A decided to adopt a data processing solution based on multi-source heterogeneous matching to solve the above problems. First, they extracted the key technical requirements of the smart watch R&D project and converted them into structured label information. Then, the corresponding patent labels, paper labels and expert labels were automatically extracted from the patent library, paper library and expert library. By calculating the semantic similarity between the demand label and the patent label and paper label, the most relevant patents and papers were quickly found. Then, the correlation between the expert and his paper and patent was comprehensively considered, and the target expert who best matched the demand was calculated by weighted summation. Finally, the best matched patent, paper and expert information was fed back to the R&D team of Company A. Through this solution, R&D personnel can quickly obtain the most critical and authoritative technical reference materials and fully understand the principles, applications and expert opinions of related technologies. The high correlation and aggregation of this information enables R&D personnel to have a clear understanding of the technical context and development trend of smart watches, greatly improving the pertinence and foresight of R&D. At the same time, through the intelligent matching of industry experts, Company A has also established a deep industry-university-research cooperation relationship, providing intellectual support for subsequent continuous innovation.

可见,采用本申请实施例中的基于多源异构匹配的数据处理方法,在实现快速获取跨领域、跨对象的相关知识资源的同时,还可以有效解决传统关键词检索和人工筛选模式下存在的信息碎片化、低关联性等问题,进而实现了知识资源的深度整合与高效利用,为企业的研发创新提供了全方位的支持。It can be seen that the data processing method based on multi-source heterogeneous matching in the embodiment of the present application can not only realize the rapid acquisition of relevant knowledge resources across fields and objects, but also effectively solve the problems of information fragmentation and low correlation existing in traditional keyword retrieval and manual screening modes, thereby realizing the deep integration and efficient utilization of knowledge resources, and providing all-round support for the research and development innovation of enterprises.

为便于理解,下面结合上述场景,对本实施提供的方法进行流程叙述。请参阅图1,为本申请实施例中基于多源异构匹配的数据处理方法的一个流程示意图。For ease of understanding, the following describes the process of the method provided by this embodiment in combination with the above scenario. Please refer to Figure 1, which is a flow chart of a data processing method based on multi-source heterogeneous matching in an embodiment of the present application.

S101、对目标公司内产品研发的需求信息进行文本拆分处理,得到该目标公司内产品研发的研发需求标签信息。S101. Perform text splitting processing on the demand information of product development within the target company to obtain research and development demand label information of product development within the target company.

其中,目标公司是指需要进行产品研发并寻求相关知识资源支持的企业。产品研发的需求信息是指目标公司针对特定产品研发项目提出的技术需求说明,通常以文本形式呈现,包含了对研发目标、关键技术点、应用场景等方面的描述;文本拆分处理是指采用自然语言处理技术,将非结构化的需求文本转化为结构化的信息单元的过程;研发需求标签信息表示从需求文本中提炼出的关键词或短语,用于表征研发需求的核心内容要素,便于后续进行标签匹配。Among them, the target company refers to an enterprise that needs to conduct product research and development and seek relevant knowledge resource support. Product research and development demand information refers to the technical requirements proposed by the target company for a specific product research and development project, which is usually presented in text form and includes descriptions of research and development goals, key technical points, application scenarios, etc.; text splitting processing refers to the process of converting unstructured demand text into structured information units using natural language processing technology; research and development demand label information represents keywords or phrases extracted from the demand text, which are used to characterize the core content elements of the research and development demand, facilitating subsequent label matching.

该步骤通常在目标公司提出新的产品研发需求时执行,是整个多源异构匹配流程的起点。具体地,首先需要目标公司以文档的形式提供详细的产品研发需求说明。然后,对该文档进行预处理,如分词、词性标注、命名实体识别等。在此基础上,利用关键词提取、主题模型等文本挖掘技术,自动识别出能够代表核心研发需求的关键词、短语或实体,并以标签的形式存储。This step is usually performed when the target company proposes new product development requirements and is the starting point of the entire multi-source heterogeneous matching process. Specifically, the target company is first required to provide a detailed description of the product development requirements in the form of a document. Then, the document is preprocessed, such as word segmentation, part-of-speech tagging, named entity recognition, etc. On this basis, text mining technologies such as keyword extraction and topic models are used to automatically identify keywords, phrases or entities that can represent core R&D requirements and store them in the form of tags.

S102、获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,该专家数据库内包含每个专家的专家标签信息和该专家对应的专家专利标签信息和专家论文标签信息。S102, obtaining expert label information in an expert database, patent label information in a patent database, and paper label information in a paper database, wherein the expert database contains the expert label information of each expert and the expert patent label information and expert paper label information corresponding to the expert.

其中,专家数据库是指存储了各领域专家信息的结构化数据集合,包括专家的基本属性、研究方向、成果等;专利数据库表示收录专利文献及其元数据的数据集合。论文数据库是指收录学术论文及其元数据的数据集合;专家标签信息表示对专家的研究领域、技术方向等特征的关键词描述;专利标签信息和论文标签信息分别表示对专利和论文的核心内容要素的关键词描述;专家专利标签信息和专家论文标签信息则表示每个专家的专利和论文所对应的标签信息。Among them, the expert database refers to a structured data set that stores information about experts in various fields, including the basic attributes, research directions, and achievements of experts; the patent database refers to a data set that includes patent documents and their metadata. The paper database refers to a data set that includes academic papers and their metadata; expert label information refers to a keyword description of the expert's research field, technical direction, and other characteristics; patent label information and paper label information refer to keyword descriptions of the core content elements of patents and papers, respectively; expert patent label information and expert paper label information refer to the label information corresponding to each expert's patents and papers.

该步骤是在获取研发需求标签后,为进行后续的多源异构匹配做数据准备。具体地,首先分别连接专家、专利和论文的数据库,根据预先定义的数据库模式,提取各类数据的结构化表示形式。对于专家数据,需要获取专家的基本信息、研究领域、主要成果等属性值,以及反映其研究特征的标签词。对于专利和论文数据,需要获取每条记录的标题、摘要、关键词、分类号等元数据信息,并提取其中最能够代表核心内容的标签词。同时,还要获取每个专家与其所发表的专利和论文的对应关系数据,用于建立专家与其成果之间的映射。This step is to prepare data for subsequent multi-source heterogeneous matching after obtaining the R&D demand labels. Specifically, first connect the databases of experts, patents, and papers respectively, and extract the structured representation of each type of data according to the pre-defined database schema. For expert data, it is necessary to obtain the expert's basic information, research field, main achievements and other attribute values, as well as label words that reflect their research characteristics. For patent and paper data, it is necessary to obtain metadata information such as the title, abstract, keywords, classification number, etc. of each record, and extract the label words that best represent the core content. At the same time, it is also necessary to obtain the corresponding relationship data between each expert and the patents and papers he published, in order to establish a mapping between experts and their achievements.

在一些实施例中,可以通过多种方式获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息:可选地,对于专利数据库,可以重点获取专利的标题、摘要、权利要求等关键部分的文本内容,然后采用与需求标签提取类似的文本挖掘方法,识别出能够代表专利核心技术点的关键词,构成专利标签;可选地,以专家为中心,对其各类属性信息分别进行标签化处理,如根据研究领域名称匹配预定义的领域词表生成领域标签,根据论文关键词生成研究方向标签等,最后汇总各维度标签形成完整的专家标签信息。可以理解的是,还可以采用本体构建、语义分析等其他知识工程和自然语言处理技术,来实现对专家、专利、论文数据的标签化表示和组织,此处不作限定。In some embodiments, expert label information in the expert database, patent label information in the patent database, and paper label information in the paper database can be obtained in a variety of ways: Optionally, for the patent database, the text content of key parts such as the title, abstract, and claims of the patent can be obtained, and then a text mining method similar to the demand label extraction can be used to identify keywords that can represent the core technical points of the patent to form patent labels; Optionally, with experts as the center, various types of attribute information are labeled separately, such as matching the predefined domain vocabulary according to the research field name to generate domain labels, generating research direction labels according to paper keywords, etc., and finally summarizing the labels of each dimension to form complete expert label information. It is understandable that other knowledge engineering and natural language processing technologies such as ontology construction and semantic analysis can also be used to achieve labeled representation and organization of expert, patent, and paper data, which is not limited here.

S103、分别计算该研发需求标签信息和该专利标签信息的第一相似度,选取最大第一相似度对应的最优专利标签信息。S103, respectively calculating the first similarity between the R&D demand label information and the patent label information, and selecting the optimal patent label information corresponding to the maximum first similarity.

其中,第一相似度表示研发需求标签与专利标签之间的语义相似程度,用于衡量二者在技术主题、功能应用等方面的匹配度。最优专利标签信息是指与研发需求标签第一相似度最高的那些专利标签,反映了最契合研发需求的专利内容。Among them, the first similarity indicates the semantic similarity between the R&D demand label and the patent label, which is used to measure the matching degree between the two in terms of technical themes, functional applications, etc. The best patent label information refers to those patent labels with the highest first similarity to the R&D demand label, reflecting the patent content that best meets the R&D demand.

该步骤在获取研发需求标签和专利标签后执行,目的是找出与研发需求最相关的专利知识。具体地,首先需要选择合适的相似度计算方法,如余弦相似度、Jaccard相似度等,以词频向量、TF-IDF向量等形式表示标签。然后,使用该相似度方法,分别计算每个研发需求标签与每个专利标签的相似度分值,得到一个相似度矩阵。接着,对于每个研发需求标签,选取与之相似度最高的Top-N个专利标签作为候选的最优专利标签。最后,综合所有研发需求标签的最优专利标签,通过加权平均、投票等策略,决策出整个研发需求的最优匹配专利标签集合。This step is performed after obtaining the R&D demand labels and patent labels, with the aim of finding the patent knowledge most relevant to the R&D demand. Specifically, it is necessary to first select a suitable similarity calculation method, such as cosine similarity, Jaccard similarity, etc., and represent the label in the form of word frequency vector, TF-IDF vector, etc. Then, using this similarity method, calculate the similarity score of each R&D demand label and each patent label respectively to obtain a similarity matrix. Next, for each R&D demand label, select the Top-N patent labels with the highest similarity as the candidate optimal patent labels. Finally, the optimal patent labels of all R&D demand labels are combined, and the optimal matching patent label set for the entire R&D demand is decided through weighted average, voting and other strategies.

在一些实施例中,可以通过多种方式实现研发需求标签与专利标签的相似度计算和最优专利标签的选取:可选地,将标签词用预训练的词向量模型如Word2Vec、GloVe等进行嵌入表示,然后计算标签词向量的平均值作为整个标签的嵌入向量,再通过向量点积、余弦相似度等方法计算两个标签的相似度;可选地,基于外部知识库如WordNet构建标签词之间的语义关系网络,通过图上的最短路径、随机游走等方式计算标签之间的语义相似度。在选取最优专利标签时,可以先基于相似度排序,并设置相似度阈值,过滤掉相似度较低的专利标签,然后再从剩余的专利标签中按照相似度降序选取前N个作为最优匹配结果。可以理解的是,还可以采用深度学习、主题模型等其他语义匹配和排序算法来实现研发需求标签与专利标签的相似度计算和优选,此处不作限定。In some embodiments, the similarity calculation between the R&D demand label and the patent label and the selection of the optimal patent label can be achieved in a variety of ways: Optionally, the label word is embedded with a pre-trained word vector model such as Word2Vec, GloVe, etc., and then the average value of the label word vector is calculated as the embedding vector of the entire label, and then the similarity of the two labels is calculated by vector dot product, cosine similarity and other methods; Optionally, a semantic relationship network between label words is constructed based on an external knowledge base such as WordNet, and the semantic similarity between labels is calculated by the shortest path on the graph, random walk and other methods. When selecting the optimal patent label, you can first sort based on similarity, set a similarity threshold, filter out patent labels with low similarity, and then select the top N from the remaining patent labels in descending order of similarity as the optimal matching result. It is understandable that other semantic matching and sorting algorithms such as deep learning and topic models can also be used to achieve similarity calculation and optimization between R&D demand labels and patent labels, which are not limited here.

S104、分别计算该最优专利标签信息与该论文标签信息的第二相似度,选取最大第二相似度对应的最优论文标签信息。S104, respectively calculating the second similarity between the optimal patent label information and the paper label information, and selecting the optimal paper label information corresponding to the maximum second similarity.

其中,第二相似度表示最优专利标签与论文标签之间的语义相似程度,用于衡量二者在技术主题、创新点等方面的一致性。最优论文标签信息是指与最优专利标签第二相似度最高的那些论文标签,反映了最契合专利内容的学术研究成果。Among them, the second similarity indicates the semantic similarity between the best patent label and the paper label, which is used to measure the consistency between the two in terms of technical themes, innovations, etc. The best paper label information refers to those paper labels with the highest second similarity to the best patent label, reflecting the academic research results that best fit the patent content.

该步骤在获得最优专利标签后执行,目的是找出与专利技术最相关的科研论文知识。具体地,采用与第一相似度类似的计算方法,分别计算每个最优专利标签与每个论文标签的第二相似度,得到一个新的相似度矩阵。然后,对于每个最优专利标签,选取与之第二相似度最高的Top-M个论文标签作为候选的最优论文标签。最后,综合所有最优专利标签的最优论文标签,通过加权平均、投票等策略,决策出整个研发需求的最优匹配论文标签集合。This step is performed after obtaining the optimal patent label, with the aim of finding the scientific research paper knowledge that is most relevant to the patent technology. Specifically, a calculation method similar to the first similarity is used to calculate the second similarity between each optimal patent label and each paper label, and a new similarity matrix is obtained. Then, for each optimal patent label, the top-M paper labels with the highest second similarity are selected as candidate optimal paper labels. Finally, the optimal paper labels of all optimal patent labels are combined, and the optimal matching paper label set for the entire R&D needs is decided through weighted averaging, voting and other strategies.

S105、分别计算每个专家的专家专利标签信息与该最优专利标签信息的第三相似度和该专家论文标签信息与该最优论文标签信息的第四相似度,将加权平均相似度最高的专家作为最优目标专家,该加权平均相似度为该第三相似度和该第四相似度的加权平均值。S105. Calculate the third similarity between each expert's expert patent label information and the optimal patent label information and the fourth similarity between the expert's paper label information and the optimal paper label information respectively, and take the expert with the highest weighted average similarity as the optimal target expert. The weighted average similarity is the weighted average of the third similarity and the fourth similarity.

其中,专家专利标签信息是指每个专家所发表的专利的标签信息集合,反映了该专家的专利研发特征。专家论文标签信息表示每个专家所发表的论文的标签信息集合,反映了该专家的学术研究特征。第三相似度用于衡量专家的专利标签与最优专利标签在技术主题、创新点等方面的匹配程度。第四相似度用于衡量专家的论文标签与最优论文标签在研究领域、关键技术等方面的一致性。最优目标专家是指综合专利和论文两方面特征,与研发需求匹配度最高的专家。加权平均相似度是指第三相似度和第四相似度的加权平均值,用于平衡专利和论文两类标签对专家匹配度的贡献。Among them, expert patent label information refers to the label information set of patents published by each expert, reflecting the patent research and development characteristics of the expert. Expert paper label information refers to the label information set of papers published by each expert, reflecting the academic research characteristics of the expert. The third similarity is used to measure the degree of match between the expert's patent label and the optimal patent label in terms of technical themes, innovations, etc. The fourth similarity is used to measure the consistency between the expert's paper label and the optimal paper label in terms of research fields, key technologies, etc. The optimal target expert refers to the expert who combines the characteristics of both patents and papers and has the highest match with R&D needs. The weighted average similarity refers to the weighted average of the third similarity and the fourth similarity, which is used to balance the contribution of the two types of labels, patents and papers, to the expert matching degree.

在一些实施例中,可以通过多种方式实现专家专利标签、专家论文标签与最优标签的相似度计算以及最优目标专家的选取:可选地,采用与第一、第二相似度计算相同的方法,分别计算第三、第四相似度,如词向量相似度、主题相似度等,然后根据预设权重如专利0.6、论文0.4的比例,加权平均得到综合匹配度;可选地,将专利标签和论文标签分别进行Doc2Vec文档嵌入,然后联合训练一个端到端的专家匹配模型,如双塔DNN模型,来学习专家特征与需求特征的匹配关系,最后根据模型输出的匹配概率值排序得到最优专家。在确定最优目标专家时,可以直接选取加权平均相似度(或匹配概率)最高的Top-1专家;也可以取Top-K个专家,再结合其他因素如专家的影响力、合作意愿等,通过多指标决策方法确定最终的最优专家。可以理解的是,还可以采用协同过滤、基于图网络的表示学习等其他算法来实现专家与需求的匹配计算与优选,此处不作限定。In some embodiments, the similarity calculation of expert patent labels, expert paper labels and optimal labels and the selection of the optimal target expert can be realized in a variety of ways: Optionally, the same method as the first and second similarity calculations is used to calculate the third and fourth similarities, such as word vector similarity, topic similarity, etc., and then the weighted average is used to obtain the comprehensive matching degree according to the preset weights such as the ratio of 0.6 for patents and 0.4 for papers; Optionally, the patent labels and paper labels are respectively embedded in Doc2Vec documents, and then an end-to-end expert matching model, such as a dual-tower DNN model, is jointly trained to learn the matching relationship between expert features and demand features, and finally the optimal expert is obtained by sorting the matching probability values output by the model. When determining the optimal target expert, the Top-1 expert with the highest weighted average similarity (or matching probability) can be directly selected; or the Top-K experts can be taken, and then combined with other factors such as the influence of the experts, willingness to cooperate, etc., to determine the final optimal expert through a multi-index decision-making method. It is understandable that other algorithms such as collaborative filtering and representation learning based on graph networks can also be used to achieve the matching calculation and optimization of experts and needs, which are not limited here.

S106、将该最优目标专家对应的最优专家标签信息、该最优专利标签信息和该最优论文标签信息发送至客户端。S106. Send the optimal expert label information, the optimal patent label information and the optimal paper label information corresponding to the optimal target expert to the client.

其中,客户端是指企业用户用于提交研发需求、接收匹配结果的终端设备,如PC、移动APP等。最优专家标签信息表示与最优目标专家相关的研究领域、技术方向等特征标签。The client refers to the terminal device used by enterprise users to submit R&D requirements and receive matching results, such as PC, mobile APP, etc. The optimal expert label information represents the characteristic labels such as research fields and technical directions related to the optimal target expert.

该步骤在确定最优目标专家及其相关专利和论文后执行,目的是将匹配的结果反馈给企业用户。具体地,后台将最优目标专家的信息,包括基本信息如姓名、单位,以及与研发需求匹配的特征标签,如最优专利标签、最优论文标签等,封装成结构化的数据对象。同时,还可以将相关专利和论文的文献信息,如标题、关键词、摘要等,作为匹配结果的补充说明。然后,通过网络将这些结果数据发送给企业用户的客户端,由客户端根据预定义的可视化模板,将匹配结果渲染呈现给用户。This step is performed after the optimal target expert and its related patents and papers are determined, with the purpose of feeding back the matching results to the enterprise user. Specifically, the background encapsulates the information of the optimal target expert, including basic information such as name, unit, and feature tags that match R&D needs, such as the optimal patent tag, the optimal paper tag, etc., into a structured data object. At the same time, the literature information of related patents and papers, such as titles, keywords, abstracts, etc., can also be used as a supplementary description of the matching results. Then, these result data are sent to the client of the enterprise user through the network, and the client renders the matching results to the user according to the predefined visualization template.

在一些实施例中,可以通过多种方式实现将最优目标专家对应的标签信息发送至客户端:可选地,在服务器端预先生成包含匹配结果的静态网页和文档,如HTML、PDF等格式,然后将网页URL或文档下载链接发送给客户端,由客户端访问在线网页或下载文档查看。在客户端的界面设计上,可以采用列表、卡片、图谱等多种布局方式,并提供专家画像、技术路线图等可视化元素,方便企业用户浏览理解。可以理解的是,还可以使用移动推送、邮件发送等其他信息交互方式来实现匹配结果向企业用户的分发,此处不作限定。In some embodiments, the label information corresponding to the optimal target expert can be sent to the client in a variety of ways: Optionally, static web pages and documents containing matching results are pre-generated on the server side, such as HTML, PDF and other formats, and then the web page URL or document download link is sent to the client, and the client accesses the online web page or downloads the document for viewing. In the interface design of the client, a variety of layout methods such as lists, cards, and graphs can be used, and visual elements such as expert portraits and technology roadmaps can be provided to facilitate enterprise users to browse and understand. It is understandable that other information interaction methods such as mobile push and email sending can also be used to achieve the distribution of matching results to enterprise users, which is not limited here.

下面对本实施提供的方法进行进一步的更具体的流程叙述。请参阅图2,为本申请实施例中基于多源异构匹配的数据处理方法的另一个流程示意图。The following is a more detailed description of the process of the method provided by this embodiment. Please refer to Figure 2, which is another flowchart of the data processing method based on multi-source heterogeneous matching in the embodiment of this application.

S201、对目标公司内产品研发的需求信息进行文本拆分处理,得到该目标公司内产品研发的研发需求标签信息。S201. Perform text splitting processing on the demand information of product development within the target company to obtain research and development demand label information of product development within the target company.

可以理解的是,该步骤与步骤S101类似,此处不再赘述。It can be understood that this step is similar to step S101 and will not be described in detail here.

S202、获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,该专家数据库内包含每个专家的专家标签信息和该专家对应的专家专利标签信息和专家论文标签信息。S202, obtaining expert label information in an expert database, patent label information in a patent database, and paper label information in a paper database, wherein the expert database contains the expert label information of each expert and the expert patent label information and expert paper label information corresponding to the expert.

可以理解的是,该步骤与步骤S102类似,此处不再赘述。It can be understood that this step is similar to step S102 and will not be described in detail here.

S203、对该研发需求标签信息进行文本预处理,得到研发需求关键词集合。S203: Perform text preprocessing on the R&D demand tag information to obtain a set of R&D demand keywords.

具体来说,对研发需求标签信息的文本预处理可以分为以下几个步骤:首先是文本清洗,去除研发需求描述中的无意义字符、标点符号、HTML标签等噪音数据,并统一文本格式;其次是分词和词性标注,利用自然语言处理工具将研发需求文本切分成独立的词语,并标注每个词语的词性,如名词、动词、形容词等;再次是停用词过滤,去除研发需求文本中的常见停用词,如"的"、"是"、"了"等,提升关键词提取的精度;最后是词干提取和词形还原,将研发需求文本中的词语还原为其原始形式,消除词形变化对关键词提取的影响。Specifically, the text preprocessing of R&D demand label information can be divided into the following steps: first, text cleaning, removing meaningless characters, punctuation marks, HTML tags and other noise data in the R&D demand description, and unifying the text format; second, word segmentation and part-of-speech tagging, using natural language processing tools to divide the R&D demand text into independent words, and mark the part of speech of each word, such as noun, verb, adjective, etc.; third, stop word filtering, removing common stop words in the R&D demand text, such as "的", "是", "了", etc., to improve the accuracy of keyword extraction; finally, stem extraction and word form restoration, restoring the words in the R&D demand text to their original form, eliminating the impact of word form changes on keyword extraction.

举例来说,如果一条研发需求的标签信息为"我们需要开发一种新型锂电池正极材料,希望能够提升材料的能量密度和循环寿命"。经过文本预处理后,将其转化为以下结构化信息:"锂电池/n 正极/n 材料/n 新型/a 开发/v 提升/v 能量密度/n 循环寿命/n",并提取出"锂电池"、"正极材料"、"能量密度"、"循环寿命"等关键词语,构建成一个规范化的研发需求关键词集合,用于后续的分析和应用。For example, if the label information of a research and development requirement is "We need to develop a new type of positive electrode material for lithium batteries, hoping to improve the energy density and cycle life of the material", after text preprocessing, it is converted into the following structured information: "lithium battery/n positive electrode/n material/n new/a development/v improvement/v energy density/n cycle life/n", and keywords such as "lithium battery", "positive electrode material", "energy density", and "cycle life" are extracted to construct a standardized set of research and development requirement keywords for subsequent analysis and application.

S204、对该专利标签信息进行文本预处理,得到专利关键词集合。S204: Perform text preprocessing on the patent tag information to obtain a patent keyword set.

与研发需求标签信息类似,原始的专利标签信息也需要进行文本预处理,提取出其中的关键词语,构建规范化的专利关键词集合。由于专利文本内容通常更加专业和规范,因此专利标签信息的预处理流程与研发需求标签信息略有不同,需要针对性地进行优化和调整。Similar to R&D demand tag information, the original patent tag information also needs to be preprocessed to extract key words and build a standardized patent keyword set. Since the patent text content is usually more professional and standardized, the preprocessing process of patent tag information is slightly different from that of R&D demand tag information, and needs to be optimized and adjusted in a targeted manner.

具体来说,对专利标签信息的文本预处理可以分为以下几个步骤:首先是专利字段解析,从原始的专利文本中提取出发明名称、摘要、权利要求、说明书等关键字段的内容,并进行结构化存储;其次是专业词汇识别,利用专利领域的术语词典和命名实体识别工具,从专利文本中识别出技术关键词、产品名称、公司机构等专业词汇;再次是关键词排序,根据专业词汇在专利文本中的出现频率、位置、共现关系等特征,计算每个词汇的重要性权重,并按照权重大小进行降序排列;最后是关键词标引,参考专利分类号、领域词表等标准,对排序后的专业词汇进行语义标引,映射到标准化的技术关键词上,构建规范化的专利关键词集合。Specifically, the text preprocessing of patent label information can be divided into the following steps: first, patent field parsing, extracting the contents of key fields such as invention name, abstract, claims, and instructions from the original patent text, and storing them in a structured manner; second, professional vocabulary recognition, using the terminology dictionary and named entity recognition tools in the patent field, identifying technical keywords, product names, company organizations and other professional vocabulary from the patent text; third, keyword sorting, calculating the importance weight of each vocabulary based on the frequency, position, co-occurrence relationship and other characteristics of the professional vocabulary in the patent text, and arranging them in descending order according to the weight; finally, keyword indexing, referring to standards such as patent classification numbers and field vocabularies, semantically indexing the sorted professional vocabulary, mapping them to standardized technical keywords, and constructing a standardized patent keyword set.

举例来说,如果一项专利的标签信息为"一种锂离子电池正极材料及其制备方法",其中包含专利名称、摘要、权利要求等多个字段。经过文本预处理后,从专利摘要中提取出"锂离子电池正极材料"、"前驱体"、"包覆"、"掺杂"、"烧结"等关键词语,从权利要求中提取出"尖晶石型锰酸锂"、"球形颗粒"、"包覆层"、"掺杂元素"、"烧结温度"等专业词汇,并根据词频和共现关系计算权重,选取Top10作为该专利的关键词标签,映射到标准的技术关键词词表中,如"锂离子电池"、"正极材料"、"尖晶石锰酸锂"、"包覆改性"、"元素掺杂"、"烧结工艺"等,构建成一个规范化的专利关键词集合。For example, if the label information of a patent is "a positive electrode material for lithium-ion batteries and its preparation method", it contains multiple fields such as patent name, abstract, and claims. After text preprocessing, key words such as "positive electrode material for lithium-ion batteries", "precursor", "coating", "doping", and "sintering" are extracted from the patent abstract, and professional terms such as "spinel lithium manganese oxide", "spherical particles", "coating layer", "doping elements", and "sintering temperature" are extracted from the claims. The weights are calculated based on the word frequency and co-occurrence relationship, and the Top 10 are selected as the keyword labels of the patent, mapped to the standard technical keyword vocabulary, such as "lithium-ion battery", "positive electrode material", "spinel lithium manganese oxide", "coating modification", "element doping", "sintering process", etc., to construct a standardized patent keyword set.

S205、分别计算该研发需求关键词集合和该专利关键词集合的第一相似度,得到第一相似度合集。S205 , respectively calculating the first similarities of the R&D demand keyword set and the patent keyword set to obtain a first similarity set.

具体来说,可以采用多种相似度计算方法,如余弦相似度、Jaccard相似度、编辑距离等,根据关键词集合的特点选择最适合的算法。Specifically, a variety of similarity calculation methods can be used, such as cosine similarity, Jaccard similarity, edit distance, etc., and the most suitable algorithm can be selected according to the characteristics of the keyword set.

以余弦相似度为例,首先需要将研发需求关键词集合和专利关键词集合都表示为向量的形式,每个关键词对应向量中的一个维度,关键词的重要性权重作为向量的分量值。然后,计算两个向量之间的夹角余弦值,作为它们之间的相似度分数。夹角越小,余弦值越接近1,表示两个关键词集合越相似;反之,夹角越大,余弦值越接近0,表示两个关键词集合越不相似。Taking cosine similarity as an example, first, we need to express both the R&D demand keyword set and the patent keyword set in the form of vectors. Each keyword corresponds to a dimension in the vector, and the importance weight of the keyword is used as the component value of the vector. Then, calculate the cosine value of the angle between the two vectors as the similarity score between them. The smaller the angle, the closer the cosine value is to 1, indicating that the two keyword sets are more similar; conversely, the larger the angle, the closer the cosine value is to 0, indicating that the two keyword sets are less similar.

举例来说,假设一条研发需求的关键词集合为{"锂电池":0.6,"正极材料":0.8,"能量密度":0.5,"循环寿命":0.4},一项专利的关键词集合为{"锂离子电池":0.7,"正极材料":0.9,"尖晶石锰酸锂":0.6,"包覆改性":0.5}。将这两个关键词集合表示为向量,并计算它们的余弦相似度,得到相似度分数为0.85。For example, suppose the keyword set of a research and development requirement is {"lithium battery": 0.6, "positive electrode material": 0.8, "energy density": 0.5, "cycle life": 0.4}, and the keyword set of a patent is {"lithium-ion battery": 0.7, "positive electrode material": 0.9, "spinel lithium manganese oxide": 0.6, "coating modification": 0.5}. These two keyword sets are represented as vectors, and their cosine similarity is calculated, and the similarity score is 0.85.

重复以上过程,分别计算该研发需求关键词集合与每一项专利关键词集合之间的第一相似度,最终得到一个第一相似度合集,反映了该研发需求与各个专利之间的相关程度。Repeat the above process to calculate the first similarity between the R&D demand keyword set and each patent keyword set, and finally obtain a first similarity collection, which reflects the correlation between the R&D demand and each patent.

S206、对该第一相似度合集按照第一相似度值由大到小进行排序,将第一相似度值最大的专利标签信息作为最优专利标签信息。S206: Sort the first similarity collection from large to small according to the first similarity values, and use the patent label information with the largest first similarity value as the optimal patent label information.

在得到研发需求与各个专利之间的第一相似度合集后,需要对这些相似度值进行排序和筛选,以便快速找出与研发需求最相关、最匹配的专利标签信息。具体来说,可以按照第一相似度值由大到小的顺序,对第一相似度合集中的所有专利进行降序排列,将相似度最高的专利排在最前面,相似度最低的专利排在最后面。After obtaining the first similarity collection between the R&D requirements and each patent, it is necessary to sort and filter these similarity values in order to quickly find the patent label information that is most relevant and matches the R&D requirements. Specifically, all patents in the first similarity collection can be sorted in descending order according to the first similarity value from large to small, with the patent with the highest similarity at the front and the patent with the lowest similarity at the end.

举例来说,假设一条研发需求与专利A、B、C、D、E之间的第一相似度值分别为0.85、0.76、0.92、0.68、0.81,得到的第一相似度合集为{(A,0.85),(B,0.76),(C,0.92),(D,0.68),(E,0.81)}。对这个合集按照第一相似度值进行降序排列,得到排序后的结果为{(C,0.92),(A,0.85),(E,0.81),(B,0.76),(D,0.68)}。可以看出,专利C与该研发需求的相关性最高,第一相似度值达到了0.92,远高于其他专利。因此,可以将专利C的标签信息作为最优专利标签信息,重点向研发人员推荐。For example, suppose the first similarity values between a research and development requirement and patents A, B, C, D, and E are 0.85, 0.76, 0.92, 0.68, and 0.81, respectively. The first similarity collection is {(A, 0.85), (B, 0.76), (C, 0.92), (D, 0.68), (E, 0.81)}. This collection is sorted in descending order according to the first similarity value, and the sorted result is {(C, 0.92), (A, 0.85), (E, 0.81), (B, 0.76), (D, 0.68)}. It can be seen that patent C has the highest correlation with the research and development requirement, and the first similarity value reaches 0.92, which is much higher than other patents. Therefore, the label information of patent C can be used as the optimal patent label information and recommended to R&D personnel.

S207、分别计算该最优专利标签信息与该论文标签信息的第二相似度,选取最大第二相似度对应的最优论文标签信息。S207, respectively calculating the second similarity between the optimal patent label information and the paper label information, and selecting the optimal paper label information corresponding to the maximum second similarity.

可以理解的是,该步骤与步骤S105类似,此处不再赘述。It can be understood that this step is similar to step S105 and will not be described in detail here.

S208、对该最优专利标签信息、最优论文标签信息、每个专家的专家专利标签信息和专家论文标签信息分别进行向量化表示,得到最优专利特征向量、最优论文特征向量、专家专利特征向量和专家论文特征向量。S208. Vectorize the optimal patent label information, the optimal paper label information, the expert patent label information of each expert, and the expert paper label information to obtain the optimal patent feature vector, the optimal paper feature vector, the expert patent feature vector, and the expert paper feature vector.

具体来说,可以将最优专利标签信息、最优论文标签信息、每个专家的专利标签信息和论文标签信息都表示为一个高维向量空间中的特征向量。Specifically, the optimal patent label information, the optimal paper label information, the patent label information and the paper label information of each expert can all be represented as feature vectors in a high-dimensional vector space.

这种向量化表示通常采用One-Hot编码或TF-IDF加权等方法。以One-Hot编码为例,首先需要构建一个包含所有标签关键词的词典,词典中的每个关键词对应向量空间中的一个维度。然后,对于每个标签信息,如果其中包含某个关键词,就在对应维度上赋值为1,否则赋值为0。最终,每个标签信息都被表示为一个由0和1组成的高维稀疏向量,向量的维度等于词典中关键词的数量。This vectorized representation usually uses methods such as One-Hot encoding or TF-IDF weighting. Taking One-Hot encoding as an example, you first need to build a dictionary containing all tag keywords. Each keyword in the dictionary corresponds to a dimension in the vector space. Then, for each tag information, if it contains a keyword, assign a value of 1 to the corresponding dimension, otherwise assign a value of 0. Finally, each tag information is represented as a high-dimensional sparse vector composed of 0 and 1, and the dimension of the vector is equal to the number of keywords in the dictionary.

举例来说,假设词典包含五个关键词["锂电池","正极材料","固态电解质","能量密度","循环寿命"],最优专利的标签信息为"锂电池,正极材料,能量密度",最优论文的标签信息为"固态电解质,锂电池,循环寿命",专家A的专利标签信息为"锂电池,正极材料",专家A的论文标签信息为"正极材料,固态电解质,循环寿命"。经过One-Hot编码,得到以下特征向量:最优专利特征向量:[1,1,0,1,0],最优论文特征向量:[1,0,1,0,1],专家A专利特征向量:[1,1,0,0,0],专家A论文特征向量:[0,1,1,0,1]。For example, suppose the dictionary contains five keywords ["lithium battery", "positive electrode material", "solid electrolyte", "energy density", "cycle life"], the label information of the best patent is "lithium battery, positive electrode material, energy density", the label information of the best paper is "solid electrolyte, lithium battery, cycle life", the patent label information of expert A is "lithium battery, positive electrode material", and the paper label information of expert A is "positive electrode material, solid electrolyte, cycle life". After One-Hot encoding, the following feature vectors are obtained: best patent feature vector: [1, 1, 0, 1, 0], best paper feature vector: [1, 0, 1, 0, 1], expert A patent feature vector: [1, 1, 0, 0, 0], expert A paper feature vector: [0, 1, 1, 0, 1].

可以看出,这些特征向量以直观的方式刻画了不同标签信息之间的异同,为后续的相似度计算和专家推荐提供了数学基础。当然,除了One-Hot编码外,还可以使用TF-IDF加权等更复杂的向量化方法,此处不作限定。It can be seen that these feature vectors depict the similarities and differences between different tag information in an intuitive way, providing a mathematical basis for subsequent similarity calculations and expert recommendations. Of course, in addition to One-Hot encoding, more complex vectorization methods such as TF-IDF weighting can also be used, which is not limited here.

S209、分别计算该最优专利特征向量与每个该专家专利特征向量的第三相似度,得到第三相似度集合。S209, respectively calculating the third similarity between the optimal patent feature vector and each of the expert patent feature vectors to obtain a third similarity set.

在得到最优专利特征向量和各个专家专利特征向量后,可以通过计算它们之间的相似度,来评估每个专家与最优专利的匹配程度。After obtaining the optimal patent feature vector and each expert patent feature vector, the matching degree between each expert and the optimal patent can be evaluated by calculating the similarity between them.

以余弦相似度为例,它通过计算两个向量之间的夹角余弦值来衡量它们的相似程度。夹角越小,余弦值越接近1,表示两个向量越相似;反之,夹角越大,余弦值越接近0,表示两个向量越不相似。具体计算公式如下:cos(θ) = (A·B) / (||A|| ||B||);Taking cosine similarity as an example, it measures the similarity between two vectors by calculating the cosine value of the angle between them. The smaller the angle, the closer the cosine value is to 1, indicating that the two vectors are more similar; conversely, the larger the angle, the closer the cosine value is to 0, indicating that the two vectors are less similar. The specific calculation formula is as follows: cos(θ) = (A·B) / (||A|| ||B||);

其中,A和B分别为两个特征向量,A·B表示向量的点积,||A||和||B||表示向量的模长(即向量各维度分量的平方和的开方)。Among them, A and B are two eigenvectors, A·B represents the dot product of the vectors, and ||A|| and ||B|| represent the modulus of the vector (i.e., the square root of the sum of the squares of the components of each dimension of the vector).

举例来说,假设最优专利特征向量为[1,1,0,1,0],专家A的专利特征向量为[1,1,0,0,0],专家B的专利特征向量为[0,1,1,0,1],专家C的专利特征向量为[1,0,0,1,1]。分别计算最优专利特征向量与这三个专家专利特征向量的余弦相似度,得到:For example, assuming that the optimal patent feature vector is [1, 1, 0, 1, 0], the patent feature vector of expert A is [1, 1, 0, 0, 0], the patent feature vector of expert B is [0, 1, 1, 0, 1], and the patent feature vector of expert C is [1, 0, 0, 1, 1]. Calculate the cosine similarity between the optimal patent feature vector and the patent feature vectors of these three experts respectively, and get:

与专家A的第三相似度:cos(θ_A)=(11+11+00+10+0*0)/(sqrt(3)*sqrt(2)) =0.82,与专家B的第三相似度:cos(θ_B)=(10+11+01+10+0*1)/(sqrt(3)*sqrt(3))=0.33,专家C的第三相似度:cos(θ_C)=(11+10+00+11+0*1)/(sqrt(3)*sqrt(3))=0.67。The third similarity with expert A: cos(θ_A)=(11+11+00+10+0*0)/(sqrt(3)*sqrt(2)) =0.82, the third similarity with expert B: cos(θ_B)=(10+11+01+10+0*1)/(sqrt(3)*sqrt(3))=0.33, the third similarity with expert C: cos(θ_C)=(11+10+00+11+0*1)/(sqrt(3)*sqrt(3))=0.67.

可以看出,专家A的专利特征向量与最优专利特征向量的余弦相似度最高,为0.82,说明专家A的专利研究方向与最优专利最为接近,可能是该领域的权威专家。专家C的专利特征向量与最优专利特征向量的余弦相似度其次,为0.67,也具有一定的相关性。而专家B的专利特征向量与最优专利特征向量的余弦相似度最低,仅为0.33,说明专家B的专利研究方向与最优专利差异较大,匹配度较低。It can be seen that the cosine similarity between the patent feature vector of expert A and the optimal patent feature vector is the highest, which is 0.82, indicating that the patent research direction of expert A is closest to the optimal patent and he may be an authoritative expert in this field. The cosine similarity between the patent feature vector of expert C and the optimal patent feature vector is second, which is 0.67, and also has a certain correlation. The cosine similarity between the patent feature vector of expert B and the optimal patent feature vector is the lowest, which is only 0.33, indicating that the patent research direction of expert B is quite different from the optimal patent and the matching degree is low.

S210、分别计算该最优论文特征向量与每个该专家论文特征向量的第四相似度,得到第四相似度集合。S210, respectively calculating the fourth similarity between the optimal paper feature vector and each of the expert paper feature vectors to obtain a fourth similarity set.

具体来说,可以采用与第三相似度计算相同的方法,如余弦相似度,来衡量最优论文特征向量与每个专家论文特征向量之间的夹角余弦值。夹角越小,余弦值越接近1,表示两个向量越相似,专家的论文研究方向与最优论文越接近;反之,夹角越大,余弦值越接近0,表示两个向量越不相似,专家的论文研究方向与最优论文差异越大。Specifically, the same method as the third similarity calculation, such as cosine similarity, can be used to measure the cosine value of the angle between the optimal paper feature vector and each expert paper feature vector. The smaller the angle, the closer the cosine value is to 1, indicating that the two vectors are more similar and the expert's paper research direction is closer to the optimal paper; conversely, the larger the angle, the closer the cosine value is to 0, indicating that the two vectors are less similar and the expert's paper research direction is more different from the optimal paper.

举例来说,假设最优论文特征向量为[1,0,1,0,1],专家A的论文特征向量为[0,1,1,0,1],专家B的论文特征向量为[1,1,0,1,0],专家C的论文特征向量为[0,0,1,1,1]。分别计算最优论文特征向量与这三个专家论文特征向量的余弦相似度,得到:For example, suppose the optimal paper feature vector is [1, 0, 1, 0, 1], the paper feature vector of expert A is [0, 1, 1, 0, 1], the paper feature vector of expert B is [1, 1, 0, 1, 0], and the paper feature vector of expert C is [0, 0, 1, 1, 1]. Calculate the cosine similarity between the optimal paper feature vector and the feature vectors of these three experts' papers, and you get:

与专家A的第四相似度:cos(θ_A)=(10+01+11+00+1*1)/(sqrt(3)*sqrt(3))=0.67,与专家B的第四相似度:cos(θ_B)=(11+01+10+01+1*0)/(sqrt(3)*sqrt(3))=0.33,与专家C的第四相似度:cos(θ_C)=(10+00+11+01+1*1)/(sqrt(3)*sqrt(3))=0.67;The fourth similarity with expert A: cos(θ_A)=(10+01+11+00+1*1)/(sqrt(3)*sqrt(3))=0.67, the fourth similarity with expert B: cos(θ_B)=(11+01+10+01+1*0)/(sqrt(3)*sqrt(3))=0.33, the fourth similarity with expert C: cos(θ_C)=(10+00+11+01+1*1)/(sqrt(3)*sqrt(3))=0.67;

可以看出,专家A和专家C的论文特征向量与最优论文特征向量的余弦相似度都为0.67,说明他们的论文研究方向与最优论文比较接近,具有一定的相关性。而专家B的论文特征向量与最优论文特征向量的余弦相似度较低,仅为0.33,说明专家B的论文研究方向与最优论文差异较大。It can be seen that the cosine similarity between the paper feature vectors of experts A and C and the feature vector of the best paper is 0.67, indicating that their paper research directions are close to the best paper and have a certain correlation. However, the cosine similarity between the paper feature vector of expert B and the feature vector of the best paper is lower, only 0.33, indicating that the research direction of expert B's paper is quite different from that of the best paper.

通过这种方式,可以计算出最优论文特征向量与每个专家论文特征向量之间的第四相似度,得到一个第四相似度集合,如{(A,0.67),(B,0.33),(C,0.67)}。In this way, the fourth similarity between the optimal paper feature vector and each expert paper feature vector can be calculated to obtain a fourth similarity set, such as {(A, 0.67), (B, 0.33), (C, 0.67)}.

S211、对每个专家对应的第三相似度和第四相似度进行加权求和,得到每个该专家的加权平均相似度集合,该加权相似度集合包含若干个加权平均相似度,该加权平均相似度为该第三相似度和该第四相似度的加权平均值。S211. Perform weighted summation on the third similarity and the fourth similarity corresponding to each expert to obtain a weighted average similarity set for each expert, wherein the weighted similarity set includes a plurality of weighted average similarities, and the weighted average similarity is a weighted average of the third similarity and the fourth similarity.

在得到每个专家的第三相似度(与最优专利的匹配度)和第四相似度(与最优论文的匹配度)后,需要综合考虑这两个相似度,以全面评估每个专家与研发需求的总体相关性。一种常用的方法是对第三相似度和第四相似度进行加权求和,得到一个加权平均相似度,作为专家匹配度的最终得分。After obtaining the third similarity (matching degree with the best patent) and the fourth similarity (matching degree with the best paper) of each expert, it is necessary to consider these two similarities comprehensively to comprehensively evaluate the overall relevance of each expert to the R&D needs. A common method is to perform a weighted summation of the third similarity and the fourth similarity to obtain a weighted average similarity as the final score of the expert matching degree.

具体来说,可以根据研发需求的侧重点,合理设置第三相似度和第四相似度的权重系数。例如,如果研发需求更看重专利技术的创新性和实用性,可以给第三相似度赋予较高的权重;如果研发需求更看重理论基础和学术价值,可以给第四相似度赋予较高的权重。然后,对每个专家的第三相似度和第四相似度进行加权求和,得到该专家的加权平均相似度。Specifically, the weight coefficients of the third and fourth similarities can be reasonably set according to the focus of R&D needs. For example, if the R&D needs place more emphasis on the innovation and practicality of patented technology, a higher weight can be given to the third similarity; if the R&D needs place more emphasis on theoretical basis and academic value, a higher weight can be given to the fourth similarity. Then, the third and fourth similarities of each expert are weighted and summed to obtain the weighted average similarity of the expert.

举例来说,假设第三相似度的权重为0.6,第四相似度的权重为0.4。对于专家A,其第三相似度为0.82,第四相似度为0.67;对于专家B,其第三相似度为0.33,第四相似度也为0.33;对于专家C,其第三相似度为0.67,第四相似度为0.67。根据加权求和公式,计算每个专家的加权平均相似度:For example, suppose the weight of the third similarity is 0.6 and the weight of the fourth similarity is 0.4. For expert A, the third similarity is 0.82 and the fourth similarity is 0.67; for expert B, the third similarity is 0.33 and the fourth similarity is also 0.33; for expert C, the third similarity is 0.67 and the fourth similarity is 0.67. According to the weighted summation formula, the weighted average similarity of each expert is calculated:

专家A的加权平均相似度=0.6*0.82+0.4*0.67=0.76,专家B的加权平均相似度=0.6*0.33+0.4*0.33=0.33,专家C的加权平均相似度=0.6*0.67+0.4*0.67=0.67。The weighted average similarity of expert A = 0.6*0.82+0.4*0.67=0.76, the weighted average similarity of expert B = 0.6*0.33+0.4*0.33=0.33, and the weighted average similarity of expert C = 0.6*0.67+0.4*0.67=0.67.

可以看出,专家A的加权平均相似度最高,为0.76,说明综合考虑专利和论文两个方面,专家A与研发需求的总体匹配度最高。专家C的加权平均相似度为0.67,也具有较高的匹配度。而专家B的加权平均相似度最低,仅为0.33,说明其与研发需求的总体相关性较弱。It can be seen that expert A has the highest weighted average similarity of 0.76, which means that considering both patents and papers, expert A has the highest overall match with R&D needs. Expert C has a weighted average similarity of 0.67, which also has a high match. Expert B has the lowest weighted average similarity of only 0.33, which means that its overall correlation with R&D needs is weak.

S212、将该加权平均相似度集合中加权平均相似度最高的专家确定为最优目标专家。S212: Determine the expert with the highest weighted average similarity in the weighted average similarity set as the optimal target expert.

具体来说,只需要对加权平均相似度集合中的所有专家按照加权平均相似度值由高到低进行排序,将相似度最高的专家选为最优目标专家即可。如果有多个专家的加权平均相似度相同且都是最高值,可以根据具体情况,考虑其他因素进行优先级排序,如专家的学术声誉、合作历史等,或者同时推荐多个最优目标专家供研发人员选择。Specifically, we only need to sort all the experts in the weighted average similarity set from high to low according to the weighted average similarity value, and select the expert with the highest similarity as the optimal target expert. If there are multiple experts with the same weighted average similarity and they are all the highest values, we can prioritize them according to the specific situation and consider other factors, such as the academic reputation and cooperation history of the experts, or recommend multiple optimal target experts for R&D personnel to choose from.

S213、将该最优目标专家对应的最优专家标签信息、该最优专利标签信息和该最优论文标签信息发送至客户端。S213, sending the optimal expert label information, the optimal patent label information and the optimal paper label information corresponding to the optimal target expert to the client.

可以理解的是,该步骤与步骤S106类似,此处不再赘述。It can be understood that this step is similar to step S106 and will not be described in detail here.

下面对本实施提供的方法进行进一步的更具体的流程叙述。请参阅图3,为本申请实施例中基于多源异构匹配的数据处理方法的另一个流程示意图。The following is a more detailed description of the process of the method provided by this embodiment. Please refer to Figure 3, which is another flowchart of the data processing method based on multi-source heterogeneous matching in the embodiment of this application.

S301、将该最优目标专家对应的最优专家标签信息、该最优专利标签信息和该最优论文标签信息发送至客户端。S301, sending the optimal expert label information, the optimal patent label information and the optimal paper label information corresponding to the optimal target expert to the client.

可以理解的是,该步骤与步骤S106类似,此处不再赘述。It can be understood that this step is similar to step S106 and will not be described in detail here.

S302、接收该客户端发送的需求文档信息。S302: Receive the demand document information sent by the client.

当企业用户在客户端完成研发需求的编辑录入后,就可以将形成的需求文档提交给服务器端。通常,需求文档是以结构化的电子文档形式存在的,文档内容通常包括对研发项目的背景描述、技术需求说明、产品功能列表、应用场景分析等多个方面。When enterprise users complete the editing and input of R&D requirements on the client side, they can submit the formed requirements document to the server side. Usually, the requirements document exists in the form of a structured electronic document, and the document content usually includes a background description of the R&D project, a description of technical requirements, a list of product features, an analysis of application scenarios, and other aspects.

S303、对该需求文档信息进行文本挖掘,得到技术关键词、产品特征关键词和应用场景关键词。S303: Perform text mining on the requirement document information to obtain technical keywords, product feature keywords, and application scenario keywords.

在获取了企业用户提交的研发需求文档后,需要从非结构化的文本描述中,自动提取对需求进行精准刻画的关键词信息,为后续的需求与资源匹配提供数据支持,文本挖掘是一种利用自然语言处理、机器学习等技术,从大规模文本数据中发现隐含知识和模式的分析方法。在该场景下,文本挖掘的目标是识别出能够体现需求文档核心语义的三类关键词:技术关键词、产品特征关键词、应用场景关键词。其中,技术关键词反映了需求所涉及的关键技术点,如算法、材料、工艺等;产品特征关键词体现了需求对产品功能、性能、外观等特性的要求;应用场景关键词则描述了产品的目标应用领域、使用环境、用户群体等。After obtaining the R&D requirements documents submitted by enterprise users, it is necessary to automatically extract keyword information that accurately describes the requirements from the unstructured text description to provide data support for subsequent matching of requirements and resources. Text mining is an analytical method that uses natural language processing, machine learning and other technologies to discover implicit knowledge and patterns from large-scale text data. In this scenario, the goal of text mining is to identify three types of keywords that can reflect the core semantics of the requirements document: technical keywords, product feature keywords, and application scenario keywords. Among them, technical keywords reflect the key technical points involved in the requirements, such as algorithms, materials, processes, etc.; product feature keywords reflect the requirements for product functions, performance, appearance and other characteristics; application scenario keywords describe the product's target application areas, usage environment, user groups, etc.

S304、将该技术关键词、该产品特征关键词和该应用场景关键词在该专利数据库中进行检索,得到专利文献合集,该专利文献合集内包含数量不少于一个的专利文献。S304: Search the technical keyword, the product feature keyword and the application scenario keyword in the patent database to obtain a patent document collection, wherein the patent document collection contains no less than one patent document.

在获得了研发需求文档的关键词表示后,需要在专利数据库中进行信息检索,以发现与需求相关的已有专利技术。专利数据库是一种收录了大量专利文献全文和元数据的结构化数据库,涵盖了各个技术领域和地域的专利信息。在该步骤中,将技术关键词、产品特征关键词、应用场景关键词分别作为检索词,在专利数据库的标题、摘要、权利要求、说明书等字段中进行全文检索。After obtaining the keyword representation of the R&D requirements document, it is necessary to perform information retrieval in the patent database to discover existing patent technologies related to the requirements. The patent database is a structured database that contains a large number of full-text patent documents and metadata, covering patent information in various technical fields and regions. In this step, technical keywords, product feature keywords, and application scenario keywords are used as search terms, and full-text searches are performed in the title, abstract, claims, instructions and other fields of the patent database.

具体地,可以采用布尔检索、向量空间检索等经典信息检索模型,将关键词通过逻辑运算符如AND、OR、NOT等组合成复杂的查询语句,然后在倒排索引等数据结构的支持下快速定位包含检索词的专利文献。考虑到仅匹配单个关键词可能无法充分体现专利与需求的相关性,还可以利用词项扩展、同义词扩展、上下位词扩展等查全率优化技术,拓展原始检索词的语义范围,以匹配更多潜在相关专利。Specifically, classic information retrieval models such as Boolean retrieval and vector space retrieval can be used to combine keywords into complex query statements through logical operators such as AND, OR, NOT, etc., and then quickly locate patent documents containing the search terms with the support of data structures such as inverted indexes. Considering that matching only a single keyword may not fully reflect the relevance of patents and needs, recall rate optimization technologies such as term expansion, synonym expansion, and hyponym expansion can also be used to expand the semantic scope of the original search terms to match more potential related patents.

S305、根据每个该专利文献的专利文献关键词与该需求文档信息计算风险值,得到风险专利合集,该风险专利合集内包含数量不少于一个的大于预设风险阈值的专利文献。S305. Calculate the risk value according to the patent document keywords of each patent document and the demand document information to obtain a risk patent collection, wherein the risk patent collection contains at least one patent document whose number is greater than a preset risk threshold.

在生成专利文献合集后,需要进一步评估每个专利文献对研发需求的风险程度,以识别出可能导致知识产权侵权或阻碍项目进展的高风险专利,风险评估的核心是计算专利文献与需求文档的相似性程度,相似性越高,则意味着专利保护的技术方案与研发内容重合度大,潜在风险越高。After generating a collection of patent documents, it is necessary to further evaluate the risk level of each patent document to R&D needs in order to identify high-risk patents that may lead to intellectual property infringement or hinder project progress. The core of risk assessment is to calculate the similarity between patent documents and demand documents. The higher the similarity, the greater the overlap between the patent-protected technical solution and the R&D content, and the higher the potential risk.

具体地,首先对专利文献合集中的每个专利,提取其关键技术信息,形成结构化的专利要素表示,如发明目的、技术领域、关键问题、解决方案、权利要求等。提取方法可以采用规则模板匹配、条件随机场序列标注等自然语言处理技术,也可以利用专利审查的领域知识对要素进行人工标引。然后,将专利要素表示转化为向量形式,通常是高维稀疏向量,如TF-IDF权重向量、主题分布向量等。类似地,也可以对研发需求文档进行要素抽取和向量表示。Specifically, for each patent in the patent document collection, its key technical information is first extracted to form a structured patent element representation, such as the purpose of the invention, technical field, key issues, solutions, claims, etc. The extraction method can use natural language processing technologies such as rule template matching and conditional random field sequence labeling, or it can use the domain knowledge of patent review to manually index the elements. Then, the patent element representation is converted into a vector form, usually a high-dimensional sparse vector, such as a TF-IDF weight vector, a topic distribution vector, etc. Similarly, the R&D requirements document can also be extracted and represented by a vector.

在获得专利和需求的向量表示后,采用余弦相似度、KL散度等向量距离度量方法,计算每个专利与需求的相似性分值,作为该专利的风险值。风险值越高,表明专利与需求的技术方案越接近,专利布局对研发形成的潜在阻碍也越大。通过设定一个经验性的风险阈值,如0.75,可以快速过滤出风险值较高的专利子集,形成一个风险专利合集。这些专利很可能覆盖了研发需求的关键技术点,需要引起企业的重点关注和规避。After obtaining the vector representation of patents and requirements, vector distance measurement methods such as cosine similarity and KL divergence are used to calculate the similarity score between each patent and the requirement as the risk value of the patent. The higher the risk value, the closer the technical solution of the patent is to the requirement, and the greater the potential obstacle to R&D posed by the patent layout. By setting an empirical risk threshold, such as 0.75, a subset of patents with higher risk values can be quickly filtered out to form a collection of risky patents. These patents are likely to cover the key technical points of R&D requirements, which need to attract the attention and avoidance of enterprises.

S306、将该风险专利合集发送至该客户端。S306: Send the risk patent collection to the client.

在识别出风险专利合集后,需要及时将这些高风险专利的信息反馈给企业用户,以支持其做出知识产权决策和研发规划。After identifying a collection of risky patents, information on these high-risk patents needs to be fed back to corporate users in a timely manner to support them in making intellectual property decisions and R&D plans.

具体地,首先对风险专利合集中的每个专利文献,自动生成一份结构化的专利摘要报告。该报告以专利说明书中的关键段落和权利要求为基础,提取并凝练专利的核心内容要素,如技术领域、问题背景、发明目的、技术方案、权利要求范围等,同时还可以包含专利的书目信息、法律状态、引用关系等元数据。专利摘要报告以HTML、PDF等富文本格式生成,并支持图文混排、表格展示、链接跳转等丰富的视觉元素。Specifically, a structured patent summary report is automatically generated for each patent document in the risk patent collection. The report is based on the key paragraphs and claims in the patent specification, extracting and condensing the core content elements of the patent, such as technical field, problem background, invention purpose, technical solution, claim scope, etc. It can also include metadata such as the patent's bibliographic information, legal status, citation relationship, etc. The patent summary report is generated in rich text formats such as HTML and PDF, and supports rich visual elements such as mixed text and image layout, table display, link jump, etc.

在生成专利摘要报告后,将所有报告打包压缩,作为附件通过电子邮件、即时通讯等方式发送给指定的企业用户。同时,还可以在客户端的图形界面上,以交互性更强的方式展示风险专利信息。例如,在界面左侧以树状结构或列表形式罗列出风险专利的标题和摘要,点击某个专利后,在右侧弹出对应的专利摘要报告供用户查阅。此外,界面还可以提供专利排序、筛选、批注、分享等辅助功能,方便用户对风险专利进行管理和协作。After generating the patent summary report, all reports are packaged and compressed and sent as attachments to designated corporate users via email, instant messaging, etc. At the same time, risk patent information can also be displayed in a more interactive way on the client's graphical interface. For example, the titles and summaries of risk patents are listed in a tree structure or list on the left side of the interface. After clicking on a patent, the corresponding patent summary report pops up on the right for users to review. In addition, the interface can also provide auxiliary functions such as patent sorting, screening, annotation, and sharing to facilitate users to manage and collaborate on risk patents.

S307、将不大于预设风险阈值的备选专利文献存储至该预设专利数据库中的备选专利栏目中。S307. Store the candidate patent documents that are not greater than the preset risk threshold into the candidate patent column in the preset patent database.

除了识别出对研发需求风险较高的专利文献外,在专利检索过程中还可能发现一些风险较低,但与需求技术方向相关的专利文献。这类专利虽然当前并不构成显著风险,但对企业了解行业最新技术动向、启发研发思路、规避专利布局等仍有一定参考价值。因此,将这些备选专利文献单独存储,供企业用户后续查阅和分析。In addition to identifying patent documents with higher risks for R&D needs, some patent documents with lower risks but related to the required technical direction may also be found during the patent search process. Although such patents do not currently pose a significant risk, they still have certain reference value for enterprises to understand the latest technological trends in the industry, inspire R&D ideas, and avoid patent layout. Therefore, these alternative patent documents are stored separately for subsequent review and analysis by enterprise users.

具体地,首先从专利文献合集中筛选出风险值低于预设阈值(如0.75)的专利,形成一个备选专利子集。由于这些专利与需求的相关性相对较弱,所以数量可能较多,直接全量呈现给用户的信息噪声比较大。因此,还需要对备选专利子集进行组织和管理,提高其可检索性和可读性。Specifically, first, patents with risk values lower than a preset threshold (such as 0.75) are screened out from the patent document collection to form an alternative patent subset. Since the correlation between these patents and demand is relatively weak, the number may be large, and the information noise is relatively large if the full amount is directly presented to the user. Therefore, it is also necessary to organize and manage the alternative patent subset to improve its searchability and readability.

在一些实施例中,可以将备选专利子集按照技术领域、功能应用等维度进行分类,构建一个层次化的分类体系。例如,可以参考IPC国际专利分类号、CPC合作专利分类号等现有的技术分类标准,对每个备选专利进行自动分类标引。之后,将备选专利子集写入预设的专利数据库,并在库中建立相应的备选专利栏目。该栏目采用目录树的组织形式,顶层是各大技术领域,每个技术领域下设若干子类,每个子类对应一个或多个备选专利。In some embodiments, the alternative patent subsets can be classified according to technical fields, functional applications and other dimensions to construct a hierarchical classification system. For example, existing technical classification standards such as the IPC International Patent Classification Number and the CPC Cooperative Patent Classification Number can be referred to to automatically classify and index each alternative patent. Afterwards, the alternative patent subsets are written into the preset patent database, and a corresponding alternative patent column is established in the database. The column is organized in the form of a directory tree, with the top level being the major technical fields, each technical field having several subcategories, and each subcategory corresponding to one or more alternative patents.

下面对本实施提供的方法进行进一步的更具体的流程叙述。请参阅图4,为本申请实施例中基于多源异构匹配的数据处理方法的另一个流程示意图。The following is a more detailed description of the process of the method provided by this embodiment. Please refer to Figure 4, which is another flowchart of the data processing method based on multi-source heterogeneous matching in the embodiment of this application.

S401、对目标公司内产品研发的需求信息进行文本拆分处理,得到该目标公司内产品研发的研发需求标签信息。S401, performing text splitting processing on the demand information of product development within the target company to obtain research and development demand label information of the product development within the target company.

可以理解的是,该步骤与步骤S101类似,此处不作限定。It can be understood that this step is similar to step S101 and is not limited here.

S402、获取该客户端反馈的用户满意度评分,该用户满意度评分包括对该最优专家标签信息的第一满意度评分、对该最优专利标签信息的第二满意度评分和对该最优论文标签信息的第三满意度评分。S402, obtaining a user satisfaction score fed back by the client, wherein the user satisfaction score includes a first satisfaction score for the optimal expert label information, a second satisfaction score for the optimal patent label information, and a third satisfaction score for the optimal paper label information.

在向企业用户推送了研发需求相关的专家、专利、论文等信息后,还需要获取用户对推送结果的满意程度。满意度评估通常以主观评分的形式进行,由用户在客户端界面上对不同类型的推送结果分别打分,表达其主观感受。After pushing information such as experts, patents, and papers related to R&D needs to enterprise users, it is also necessary to obtain the user's satisfaction with the push results. Satisfaction evaluation is usually conducted in the form of subjective scoring, with users scoring different types of push results on the client interface to express their subjective feelings.

在用户完成一轮满意度评分后,通过自动获取客户端上传的评分结果,其中包括三个维度的评分值:专家标签信息满意度、专利标签信息满意度、论文标签信息满意度,将这三个评分值分别记录在后台数据库中,并与对应的企业用户账号、研发需求编号、推送批次等信息进行关联存储,以便后续开展满意度分析。After the user completes a round of satisfaction rating, the rating results uploaded by the client are automatically obtained, which include rating values in three dimensions: expert label information satisfaction, patent label information satisfaction, and paper label information satisfaction. These three rating values are recorded in the background database respectively, and associated with the corresponding enterprise user account, R&D demand number, push batch and other information for subsequent satisfaction analysis.

S403、若存在该用户满意度评分低于预设评分阈值,则获取该客户端发送的反馈信息。S403: If the user satisfaction score is lower than a preset score threshold, feedback information sent by the client is obtained.

除了获取用户满意度评分外,还需要关注评分较低的异常情况,主动向用户收集详细的反馈意见,以便有针对性地改进推荐效果。由于满意度评分本身无法反映用户不满意的具体原因,因此需要辅以开放式的反馈渠道,鼓励用户提供更多建设性意见。In addition to obtaining user satisfaction scores, it is also necessary to pay attention to abnormal situations with low scores and actively collect detailed feedback from users in order to improve the recommendation effect in a targeted manner. Since the satisfaction score itself cannot reflect the specific reasons for user dissatisfaction, it is necessary to supplement it with open feedback channels to encourage users to provide more constructive opinions.

具体地,首先设定一个满意度评分阈值,如3分(满分5分)或6分(满分10分),作为判断用户满意度是否达标的标准。当获取到用户提交的评分结果后,会自动判断各维度的评分值是否低于预设阈值。如果专家、专利、论文三类信息中存在任一类的满意度评分未达标,则视为该用户对本次推送结果存在不满意倾向,需要进一步了解其不满意的原因。Specifically, a satisfaction score threshold is first set, such as 3 points (out of 5 points) or 6 points (out of 10 points), as a standard for judging whether the user satisfaction meets the standard. After obtaining the scoring results submitted by the user, it will automatically determine whether the score value of each dimension is lower than the preset threshold. If the satisfaction score of any of the three types of information, experts, patents, and papers, does not meet the standard, it is considered that the user is dissatisfied with the push result this time, and the reason for his dissatisfaction needs to be further understood.

这时,会向客户端发送满意度提醒,告知用户其中存在满意度评分较低的情况,并请用户提供更多反馈意见。客户端在收到提醒后,自动在界面中弹出反馈意见输入框,引导用户填写对推送结果的意见建议,如专家研究方向不够契合、推荐专利数量太少、论文创新性不足等。用户输入反馈意见并提交后,客户端将反馈内容打包发送给后台。At this time, a satisfaction reminder will be sent to the client to inform the user that there is a low satisfaction score and ask the user to provide more feedback. After receiving the reminder, the client will automatically pop up a feedback input box in the interface to guide the user to fill in their opinions and suggestions on the push results, such as the expert's research direction is not consistent, the number of recommended patents is too small, and the paper is not innovative enough. After the user enters the feedback and submits it, the client will package the feedback content and send it to the backend.

S404、基于大语言模型提取该反馈信息中的标签筛选条件。S404: extracting label screening conditions from the feedback information based on the large language model.

在获取到用户对推荐结果的反馈意见后,需要自动理解反馈内容,提取其中隐含的标签筛选条件,以便用于后续的标签信息优化,由于用户反馈通常以自然语言文本的形式提供,因此需要借助自然语言处理技术,特别是大语言模型,对反馈信息进行语义理解和关键信息抽取。After obtaining user feedback on the recommendation results, it is necessary to automatically understand the feedback content and extract the implicit label filtering conditions for subsequent label information optimization. Since user feedback is usually provided in the form of natural language text, it is necessary to use natural language processing technology, especially large language models, to semantically understand the feedback information and extract key information.

具体地,首先将用户反馈的文本内容输入到预训练的大语言模型中,如BERT、GPT等。大语言模型是一种基于海量文本语料和深度学习算法训练得到的语言表示模型,能够学习文本的上下文语义,生成高质量的词嵌入表示。通过大语言模型的编码器部分,可以将反馈文本转化为一个高维度的语义向量,蕴含了反馈内容的关键语义信息。Specifically, the text content of user feedback is first input into a pre-trained large language model, such as BERT, GPT, etc. A large language model is a language representation model trained based on massive text corpus and deep learning algorithms. It can learn the contextual semantics of text and generate high-quality word embedding representations. Through the encoder part of the large language model, the feedback text can be converted into a high-dimensional semantic vector, which contains the key semantic information of the feedback content.

在得到反馈文本的语义向量后,进一步利用命名实体识别、关键词提取等技术,从语义向量中识别出与标签筛选相关的关键词和短语。After obtaining the semantic vector of the feedback text, we further use named entity recognition, keyword extraction and other technologies to identify keywords and phrases related to label screening from the semantic vector.

举例来说,针对"新型锂电池隔膜材料"的研发需求,某用户在反馈意见中填写到:"推荐专利数量太少,而且大部分是英文专利,很难看懂。希望能推荐更多相关的中文专利,并提供专利技术要点翻译。另外,推荐的专家似乎大多是高分子材料领域的,与锂电池领域不太相关,希望能推荐一些在锂电池领域有更多研究经历的专家。"For example, in response to the R&D needs of "new lithium battery separator materials", a user filled in the feedback: "The number of recommended patents is too small, and most of them are in English, which is difficult to understand. I hope to recommend more relevant Chinese patents and provide translations of patent technical key points. In addition, the recommended experts seem to be mostly in the field of polymer materials, which is not very relevant to the field of lithium batteries. I hope to recommend some experts with more research experience in the field of lithium batteries."

将该反馈文本输入到预训练的大语言模型中,通过模型的编码器部分将文本转化为语义向量。然后,利用命名实体识别和关键词提取技术,从语义向量中识别出与标签筛选相关的关键词和短语,如"中文"、"专利"、"锂电池"、"研究经历"等。The feedback text is input into the pre-trained large language model, and the text is converted into a semantic vector through the encoder part of the model. Then, named entity recognition and keyword extraction technology are used to identify keywords and phrases related to label screening from the semantic vector, such as "Chinese", "patent", "lithium battery", "research experience", etc.

接下来,将提取出的关键词和短语与预定义的标签体系进行匹配,识别出对应的标签类型和取值。例如,将"中文"关键词映射为"专利语言"标签,取值为"中文";将"锂电池"关键词映射为"专利领域"和"专家领域"标签,取值均为"锂电池";将"研究经历"关键词映射为"专家职称"标签,取值为"教授"或"研究员"等高级职称。Next, the extracted keywords and phrases are matched with the predefined tag system to identify the corresponding tag type and value. For example, the keyword "Chinese" is mapped to the tag "Patent Language" with the value "Chinese"; the keyword "Lithium Battery" is mapped to the tags "Patent Field" and "Expert Field", both with the value "Lithium Battery"; the keyword "Research Experience" is mapped to the tag "Expert Title", with the value "Professor" or "Researcher" and other senior titles.

经过以上步骤,从用户反馈中提取出了一组结构化的标签筛选条件,主要包括:{"专利语言":"中文"},{"专利领域":"锂电池"},{"专家领域":"锂电池"},{"专家职称":"教授/研究员"}。After the above steps, a set of structured label filtering conditions were extracted from user feedback, mainly including: {"patent language": "Chinese"}, {"patent field": "lithium battery"}, {"expert field": "lithium battery"}, {"expert title": "professor/researcher"}.

这些标签筛选条件清晰地反映了用户对推荐结果的具体诉求,即希望获得更多与锂电池领域相关的中文专利和资深专家信息。可以利用这些条件对原有的专利标签库和专家标签库进行二次筛选,得到更加精准匹配的推荐结果。These label screening conditions clearly reflect the specific demands of users for recommendation results, that is, they hope to obtain more Chinese patents and senior expert information related to the field of lithium batteries. These conditions can be used to conduct a secondary screening of the original patent label library and expert label library to obtain more accurately matched recommendation results.

当然,由于用户反馈的表述可能不够规范和完整,提取出的标签筛选条件可能存在遗漏或歧义。例如,用户虽然提到"专利技术要点翻译",但并未明确是机器翻译还是人工翻译。因此,在提取标签条件后,还需要人工进行必要的校验和补全,如将"专利技术要点翻译"细化为"专利机器翻译:是"或"专利人工翻译:是"等更明确的标签取值。Of course, since the expression of user feedback may not be standardized and complete, the extracted label screening conditions may contain omissions or ambiguities. For example, although the user mentioned "translation of key points of patent technology", it was not clear whether it was machine translation or manual translation. Therefore, after extracting the label conditions, it is still necessary to manually perform necessary verification and completion, such as refining "translation of key points of patent technology" into "patent machine translation: yes" or "patent manual translation: yes" and other clearer label values.

S405、根据该标签筛选条件对该专家数据库中的初始专家标签信息进行筛选,得到该专家标签信息,对该专利数据库中的初始专利标签信息进行筛选得到该专利标签信息,对该论文数据库中的初始论文标签信息进行筛选,得到该论文标签信息。S405. Filter the initial expert label information in the expert database according to the label screening condition to obtain the expert label information, filter the initial patent label information in the patent database to obtain the patent label information, and filter the initial paper label information in the paper database to obtain the paper label information.

在提取出用户反馈中隐含的标签筛选条件后,需要据此对原有的专家、专利、论文标签信息进行二次筛选,得到更加精准和满足用户需求的标签子集。通过标签筛选,可以从海量的初始标签信息中过滤出与用户研发需求最相关、质量最高的标签,提高推荐结果的针对性和有效性。After extracting the implicit label screening conditions in user feedback, it is necessary to conduct a secondary screening of the original expert, patent, and paper label information to obtain a more accurate label subset that meets user needs. Through label screening, the most relevant and highest quality labels for user R&D needs can be filtered out from the massive amount of initial label information, improving the pertinence and effectiveness of the recommendation results.

S406、若不存在该用户满意度评分低于预设评分阈值,则向该客户端推送该最优目标专家的详细介绍信息、该最优专利标签信息对应的专利详细信息和该最优论文标签信息对应的论文详细信息。S406. If there is no user satisfaction score lower than the preset score threshold, the detailed introduction information of the optimal target expert, the patent detailed information corresponding to the optimal patent label information, and the paper detailed information corresponding to the optimal paper label information are pushed to the client.

判断出用户满意度评分均达标后,就会自动触发详细信息的推送流程。将从专家、专利、论文三个维度,分别筛选出与最优标签信息相对应的详细介绍和关键内容,形成一套个性化、精准化的知识套餐,并通过客户端主动推送给用户。Once it is determined that the user satisfaction scores meet the standards, the detailed information push process will be automatically triggered. The detailed introductions and key contents corresponding to the optimal tag information will be screened out from the three dimensions of experts, patents, and papers, forming a set of personalized and precise knowledge packages, which will be actively pushed to users through the client.

举例来说,如果用户对此前推荐的锂电池领域专家A的满意度为4.5分,对锂电池隔膜专利B的满意度为4.2分,对锂电池材料综述论文C的满意度为4.7分,均高于预设的4分阈值,那么在本轮推荐中,就会自动向用户推送专家A的教育背景、研究成果、联系方式等详细介绍,专利B的摘要、权利要求、同族专利等关键信息,以及论文C的核心观点、研究方法、参考文献等重点内容。For example, if the user's satisfaction with the previously recommended lithium battery expert A is 4.5 points, the satisfaction with the lithium battery separator patent B is 4.2 points, and the satisfaction with the lithium battery material review paper C is 4.7 points, all of which are higher than the preset 4-point threshold, then in this round of recommendations, the user will automatically be given a detailed introduction to expert A's educational background, research results, contact information, etc., key information such as the abstract, claims, and similar patents of patent B, as well as key content such as the core ideas, research methods, and references of paper C.

下面从硬件处理的角度对本发明申请实施例中的基于多源异构匹配的数据处理系统进行描述,请参阅图5,为本申请实施例中基于多源异构匹配的数据处理系统的一种实体装置结构示意图。The following describes the data processing system based on multi-source heterogeneous matching in the embodiment of the present application from the perspective of hardware processing. Please refer to Figure 5, which is a schematic diagram of the physical device structure of the data processing system based on multi-source heterogeneous matching in the embodiment of the present application.

需要说明的是,图5示出的基于多源异构匹配的数据处理系统的结构仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。It should be noted that the structure of the data processing system based on multi-source heterogeneous matching shown in FIG5 is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present invention.

如图5所示,基于多源异构匹配的数据处理系统包括中央处理单元(CentralProcessing Unit,CPU)501,其可以根据存储在只读存储器(Read-Only Memory,ROM)502中的程序或者从存储部分508加载到随机访问存储器(Random Access Memory,RAM)503中的程序而执行各种适当的动作和处理,例如执行上述实施例中所述的方法。在RAM 503中,还存储有系统操作所需的各种程序和数据。CPU 501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(Input /Output,I/O)接口505也连接至总线504。As shown in FIG5 , the data processing system based on multi-source heterogeneous matching includes a central processing unit (CPU) 501, which can perform various appropriate actions and processes according to the program stored in the read-only memory (ROM) 502 or the program loaded from the storage part 508 to the random access memory (RAM) 503, such as executing the method described in the above embodiment. Various programs and data required for system operation are also stored in the RAM 503. The CPU 501, the ROM 502, and the RAM 503 are connected to each other through the bus 504. The input/output (I/O) interface 505 is also connected to the bus 504.

以下部件连接至I/O接口505:包括音频输入装置、按钮开关等的输入部分506;包括液晶显示器(Liquid Crystal Display,LCD)以及音频输出装置、指示灯等的输出部分507;包括硬盘等的存储部分508;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分509。通信部分509经由诸如因特网的网络执行通信处理。驱动器510也根据需要连接至I/O接口505。可拆卸介质511,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器510上,以便于从其上读出的计算机程序根据需要被安装入存储部分508。The following components are connected to the I/O interface 505: an input section 506 including an audio input device, a button switch, etc.; an output section 507 including a liquid crystal display (LCD) and an audio output device, an indicator light, etc.; a storage section 508 including a hard disk, etc.; and a communication section 509 including a network interface card such as a LAN (Local Area Network) card, a modem, etc. The communication section 509 performs communication processing via a network such as the Internet. A drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 510 as needed so that a computer program read therefrom is installed into the storage section 508 as needed.

特别地,根据本发明的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本发明的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的计算机程序。在这样的实施例中,该计算机程序可以通过通信部分509从网络上被下载和安装,和/或从可拆卸介质511被安装。在该计算机程序被中央处理单元(CPU)501执行时,执行本发明中限定的各种功能。In particular, according to an embodiment of the present invention, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present invention includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes a computer program for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network through a communication part 509, and/or installed from a removable medium 511. When the computer program is executed by a central processing unit (CPU) 501, various functions defined in the present invention are performed.

需要说明的是,计算机可读存储介质的具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本发明中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。It should be noted that specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a flash memory, an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof. In the present invention, a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, apparatus, or device.

附图中的流程图和框图,图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。其中,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。The flowcharts and block diagrams in the accompanying drawings illustrate the possible architecture, functions and operations of the systems, methods and computer program products according to various embodiments of the present invention. Each box in the flowchart or block diagram may represent a module, a program segment, or a part of a code, and the above-mentioned module, program segment, or a part of a code contains one or more executable instructions for implementing the specified logical functions. It should also be noted that in some alternative implementations, the functions marked in the box may also occur in an order different from that marked in the accompanying drawings.

具体的,本实施例的基于多源异构匹配的数据处理系统包括处理器和存储器,存储器上存储有计算机程序,计算机程序被处理器执行时,实现上述实施例提供的基于多源异构匹配的数据处理方法。Specifically, the data processing system based on multi-source heterogeneous matching of this embodiment includes a processor and a memory, and a computer program is stored in the memory. When the computer program is executed by the processor, the data processing method based on multi-source heterogeneous matching provided in the above embodiment is implemented.

作为另一方面,本发明还提供了一种计算机可读的存储介质,该存储介质可以是上述实施例中描述的基于多源异构匹配的数据处理系统中所包含的;也可以是单独存在,而未装配入该基于多源异构匹配的数据处理系统中。上述存储介质承载有一个或者多个计算机程序,当上述一个或者多个计算机程序被一个该基于多源异构匹配的数据处理系统的处理器执行时,使得该基于多源异构匹配的数据处理系统实现上述实施例中提供的基于多源异构匹配的数据处理方法。As another aspect, the present invention further provides a computer-readable storage medium, which may be included in the data processing system based on multi-source heterogeneous matching described in the above embodiment; or may exist independently without being assembled into the data processing system based on multi-source heterogeneous matching. The above storage medium carries one or more computer programs, and when the above one or more computer programs are executed by a processor of the data processing system based on multi-source heterogeneous matching, the data processing system based on multi-source heterogeneous matching implements the data processing method based on multi-source heterogeneous matching provided in the above embodiment.

以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。As described above, the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present application.

上述实施例中所用,根据上下文,术语"当…时"可以被解释为意思是"如果…"或"在…后"或"响应于确定…"或"响应于检测到…"。类似地,根据上下文,短语"在确定…时"或"如果检测到(所陈述的条件或事件)"可以被解释为意思是"如果确定…"或"响应于确定…"或"在检测到(所陈述的条件或事件)时"或"响应于检测到(所陈述的条件或事件)"。As used in the above embodiments, the term "when..." may be interpreted to mean "if..." or "after..." or "in response to determining..." or "in response to detecting...", depending on the context. Similarly, the phrases "upon determining..." or "if (the stated condition or event) is detected" may be interpreted to mean "if determining..." or "in response to determining..." or "upon detecting (the stated condition or event)" or "in response to detecting (the stated condition or event)", depending on the context.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。A person skilled in the art can understand that to implement all or part of the processes in the above-mentioned embodiments, the processes can be completed by a computer program to instruct the relevant hardware, and the program can be stored in a computer-readable storage medium. When the program is executed, it can include the processes of the above-mentioned method embodiments. The aforementioned storage medium includes: ROM or random access memory RAM, magnetic disk or optical disk and other media that can store program codes.

Claims (10)

1.一种基于多源异构匹配的数据处理方法,其特征在于,应用于专利推荐系统,所述方法包括:1. A data processing method based on multi-source heterogeneous matching, characterized in that it is applied to a patent recommendation system, and the method comprises: 对目标公司内产品研发的需求信息进行文本拆分处理,得到所述目标公司内产品研发的研发需求标签信息;Perform text splitting processing on the demand information of product development within the target company to obtain research and development demand label information of product development within the target company; 获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息,所述专家数据库内包含每个专家的专家标签信息和所述专家对应的专家专利标签信息和专家论文标签信息;Obtain expert label information in an expert database, patent label information in a patent database, and paper label information in a paper database, wherein the expert database contains the expert label information of each expert and the expert patent label information and expert paper label information corresponding to the expert; 分别计算所述研发需求标签信息和所述专利标签信息的第一相似度,选取最大第一相似度对应的最优专利标签信息;Calculate the first similarity between the R&D demand label information and the patent label information respectively, and select the optimal patent label information corresponding to the maximum first similarity; 分别计算所述最优专利标签信息与所述论文标签信息的第二相似度,选取最大第二相似度对应的最优论文标签信息;Calculate the second similarity between the optimal patent label information and the paper label information respectively, and select the optimal paper label information corresponding to the maximum second similarity; 分别计算每个专家的专家专利标签信息与所述最优专利标签信息的第三相似度和所述专家论文标签信息与所述最优论文标签信息的第四相似度,将加权平均相似度最高的专家作为最优目标专家,所述加权平均相似度为所述第三相似度和所述第四相似度的加权平均值;Calculate the third similarity between the expert patent label information of each expert and the optimal patent label information and the fourth similarity between the expert paper label information and the optimal paper label information respectively, and take the expert with the highest weighted average similarity as the optimal target expert, where the weighted average similarity is the weighted average of the third similarity and the fourth similarity; 将所述最优目标专家对应的最优专家标签信息、所述最优专利标签信息和所述最优论文标签信息发送至客户端。The optimal expert label information, the optimal patent label information and the optimal paper label information corresponding to the optimal target expert are sent to the client. 2.根据权利要求1所述的方法,其特征在于,所述分别计算所述研发需求标签信息和所述专利标签信息的第一相似度,选取最大第一相似度对应的最优专利标签信息的步骤,具体包括:2. The method according to claim 1 is characterized in that the step of respectively calculating the first similarity between the R&D demand label information and the patent label information and selecting the optimal patent label information corresponding to the maximum first similarity specifically comprises: 对所述研发需求标签信息进行文本预处理,得到研发需求关键词集合;Performing text preprocessing on the research and development demand tag information to obtain a research and development demand keyword set; 对所述专利标签信息进行文本预处理,得到专利关键词集合;Performing text preprocessing on the patent tag information to obtain a patent keyword set; 分别计算所述研发需求关键词集合和所述专利关键词集合的第一相似度,得到第一相似度合集;Calculating first similarities of the R&D demand keyword set and the patent keyword set respectively to obtain a first similarity collection; 对所述第一相似度合集按照第一相似度值由大到小进行排序,将第一相似度值最大的专利标签信息作为最优专利标签信息。The first similarity collection is sorted from large to small according to the first similarity value, and the patent label information with the largest first similarity value is used as the optimal patent label information. 3.根据权利要求1所述的方法,其特征在于,所述分别计算每个专家的专家专利标签信息与所述最优专利标签信息的第三相似度和所述专家论文标签信息与所述最优论文标签信息的第四相似度,将加权平均相似度最高的专家作为最优目标专家,所述加权平均相似度为所述第三相似度和所述第四相似度的加权平均值的步骤,具体包括:3. The method according to claim 1 is characterized in that the steps of respectively calculating the third similarity between the expert patent label information of each expert and the optimal patent label information and the fourth similarity between the expert paper label information and the optimal paper label information, taking the expert with the highest weighted average similarity as the optimal target expert, and the weighted average similarity being the weighted average of the third similarity and the fourth similarity, specifically include: 对所述最优专利标签信息、最优论文标签信息、每个专家的专家专利标签信息和专家论文标签信息分别进行向量化表示,得到最优专利特征向量、最优论文特征向量、专家专利特征向量和专家论文特征向量;The optimal patent label information, the optimal paper label information, the expert patent label information of each expert, and the expert paper label information are respectively vectorized to obtain the optimal patent feature vector, the optimal paper feature vector, the expert patent feature vector, and the expert paper feature vector; 分别计算所述最优专利特征向量与每个所述专家专利特征向量的第三相似度,得到第三相似度集合;Calculating the third similarity between the optimal patent feature vector and each of the expert patent feature vectors respectively to obtain a third similarity set; 分别计算所述最优论文特征向量与每个所述专家论文特征向量的第四相似度,得到第四相似度集合;Calculating the fourth similarity between the optimal paper feature vector and each of the expert paper feature vectors respectively to obtain a fourth similarity set; 对每个专家对应的第三相似度和第四相似度进行加权求和,得到每个所述专家的加权平均相似度集合,所述加权相似度集合包含若干个加权平均相似度,所述加权平均相似度为所述第三相似度和所述第四相似度的加权平均值;Performing weighted summation on the third similarity and the fourth similarity corresponding to each expert to obtain a weighted average similarity set of each expert, wherein the weighted similarity set includes a plurality of weighted average similarities, and the weighted average similarity is a weighted average of the third similarity and the fourth similarity; 将所述加权平均相似度集合中加权平均相似度最高的专家确定为最优目标专家。The expert with the highest weighted average similarity in the weighted average similarity set is determined as the optimal target expert. 4.根据权利要求1所述的方法,其特征在于,在所述将所述最优目标专家对应的最优专家标签信息、所述最优专利标签信息和所述最优论文标签信息发送至客户端的步骤之后,所述方法还包括:4. The method according to claim 1, characterized in that after the step of sending the optimal expert label information, the optimal patent label information and the optimal paper label information corresponding to the optimal target expert to the client, the method further comprises: 接收所述客户端发送的需求文档信息;Receiving the demand document information sent by the client; 对所述需求文档信息进行文本挖掘,得到技术关键词、产品特征关键词和应用场景关键词;Performing text mining on the demand document information to obtain technical keywords, product feature keywords, and application scenario keywords; 将所述技术关键词、所述产品特征关键词和所述应用场景关键词在所述专利数据库中进行检索,得到专利文献合集,所述专利文献合集内包含数量不少于一个的专利文献;Searching the technical keywords, the product feature keywords, and the application scenario keywords in the patent database to obtain a patent document collection, wherein the patent document collection includes no less than one patent document; 根据每个所述专利文献的专利文献关键词与所述需求文档信息计算风险值,得到风险专利合集,所述风险专利合集内包含数量不少于一个的大于预设风险阈值的专利文献;Calculate the risk value according to the patent document keywords of each of the patent documents and the demand document information to obtain a risk patent collection, wherein the risk patent collection contains at least one patent document with a value greater than a preset risk threshold; 将所述风险专利合集发送至所述客户端。The risk patent collection is sent to the client. 5.根据权利要求4所述的方法,其特征在于,所述将所述风险专利合集发送至所述客户端的步骤之后,所述方法还包括:5. The method according to claim 4, characterized in that after the step of sending the risk patent collection to the client, the method further comprises: 将不大于预设风险阈值的备选专利文献存储至所述预设专利数据库中的备选专利栏目中。The alternative patent documents whose risk is not greater than a preset risk threshold are stored in the alternative patent column in the preset patent database. 6.根据权利要求1所述的方法,其特征在于,所述获取专家数据库中的专家标签信息、专利数据库中的专利标签信息和论文数据库中的论文标签信息的步骤,具体包括:6. The method according to claim 1 is characterized in that the step of obtaining expert label information in the expert database, patent label information in the patent database, and paper label information in the paper database specifically comprises: 获取所述客户端反馈的用户满意度评分,所述用户满意度评分包括对所述最优专家标签信息的第一满意度评分、对所述最优专利标签信息的第二满意度评分和对所述最优论文标签信息的第三满意度评分;Obtaining a user satisfaction score fed back by the client, wherein the user satisfaction score includes a first satisfaction score for the optimal expert label information, a second satisfaction score for the optimal patent label information, and a third satisfaction score for the optimal paper label information; 若存在所述用户满意度评分低于预设评分阈值,则获取所述客户端发送的反馈信息;If the user satisfaction score is lower than a preset score threshold, obtaining feedback information sent by the client; 基于大语言模型提取所述反馈信息中的标签筛选条件;Extracting label screening conditions in the feedback information based on a large language model; 根据所述标签筛选条件对所述专家数据库中的初始专家标签信息进行筛选,得到所述专家标签信息,对所述专利数据库中的初始专利标签信息进行筛选得到所述专利标签信息,对所述论文数据库中的初始论文标签信息进行筛选,得到所述论文标签信息。The initial expert label information in the expert database is screened according to the label screening conditions to obtain the expert label information, the initial patent label information in the patent database is screened to obtain the patent label information, and the initial paper label information in the paper database is screened to obtain the paper label information. 7.根据权利要求6所述的方法,其特征在于,在所述获取所述客户端反馈的用户满意度评分,所述用户满意度评分包括对所述最优专家标签信息的第一满意度评分、对所述最优专利标签信息的第二满意度评分和对所述最优论文标签信息的第三满意度评分的步骤之后,所述方法还包括:7. The method according to claim 6 is characterized in that after the step of obtaining the user satisfaction score fed back by the client, the user satisfaction score including a first satisfaction score for the optimal expert label information, a second satisfaction score for the optimal patent label information and a third satisfaction score for the optimal paper label information, the method further comprises: 若不存在所述用户满意度评分低于预设评分阈值,则向所述客户端推送所述最优目标专家的详细介绍信息、所述最优专利标签信息对应的专利详细信息和所述最优论文标签信息对应的论文详细信息。If there is no user satisfaction score lower than the preset score threshold, the detailed introduction information of the optimal target expert, the patent details corresponding to the optimal patent label information and the paper details corresponding to the optimal paper label information are pushed to the client. 8.一种基于多源异构匹配的数据处理系统,其特征在于,所述基于多源异构匹配的数据处理系统包括:一个或多个处理器和存储器;所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述基于多源异构匹配的数据处理系统执行如权利要求1-7中任一项所述的方法。8. A data processing system based on multi-source heterogeneous matching, characterized in that the data processing system based on multi-source heterogeneous matching comprises: one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions to enable the data processing system based on multi-source heterogeneous matching to execute the method as described in any one of claims 1-7. 9.一种计算机可读存储介质,包括指令,其特征在于,当所述指令在基于多源异构匹配的数据处理系统上运行时,使得所述基于多源异构匹配的数据处理系统执行如权利要求1-7中任一项所述的方法。9. A computer-readable storage medium, comprising instructions, characterized in that when the instructions are executed on a data processing system based on multi-source heterogeneous matching, the data processing system based on multi-source heterogeneous matching executes the method as described in any one of claims 1 to 7. 10.一种计算机程序产品,其特征在于,当所述计算机程序产品在基于多源异构匹配的数据处理系统上运行时,使得所述基于多源异构匹配的数据处理系统执行如权利要求1-7中任一项所述的方法。10. A computer program product, characterized in that when the computer program product is run on a data processing system based on multi-source heterogeneous matching, the data processing system based on multi-source heterogeneous matching executes the method according to any one of claims 1 to 7.
CN202410618584.2A 2024-05-17 2024-05-17 A data processing method and system based on multi-source heterogeneous matching Withdrawn CN118484527A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410618584.2A CN118484527A (en) 2024-05-17 2024-05-17 A data processing method and system based on multi-source heterogeneous matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410618584.2A CN118484527A (en) 2024-05-17 2024-05-17 A data processing method and system based on multi-source heterogeneous matching

Publications (1)

Publication Number Publication Date
CN118484527A true CN118484527A (en) 2024-08-13

Family

ID=92194556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410618584.2A Withdrawn CN118484527A (en) 2024-05-17 2024-05-17 A data processing method and system based on multi-source heterogeneous matching

Country Status (1)

Country Link
CN (1) CN118484527A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119719177A (en) * 2025-02-25 2025-03-28 北京亦庄科技创新有限公司 Expert matching and management method and system based on modern database technology
CN120045904A (en) * 2025-04-23 2025-05-27 浙江大学 Multi-source data fusion-based technological achievement transformation potential prediction method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119719177A (en) * 2025-02-25 2025-03-28 北京亦庄科技创新有限公司 Expert matching and management method and system based on modern database technology
CN120045904A (en) * 2025-04-23 2025-05-27 浙江大学 Multi-source data fusion-based technological achievement transformation potential prediction method and system

Similar Documents

Publication Publication Date Title
Gambhir et al. Recent automatic text summarization techniques: a survey
US9715493B2 (en) Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
Ceri et al. Web information retrieval
US6732090B2 (en) Meta-document management system with user definable personalities
US6778979B2 (en) System for automatically generating queries
US6820075B2 (en) Document-centric system with auto-completion
Raza et al. A taxonomy and survey of semantic approaches for query expansion
US20030061200A1 (en) System with user directed enrichment and import/export control
US20060248076A1 (en) Automatic expert identification, ranking and literature search based on authorship in large document collections
US20050022114A1 (en) Meta-document management system with personality identifiers
US20030061201A1 (en) System for propagating enrichment between documents
CN118484527A (en) A data processing method and system based on multi-source heterogeneous matching
Al-Ghuribi et al. A comprehensive overview of recommender system and sentiment analysis
Ather The fusion of multilingual semantic search and large language models: A new paradigm for enhanced topic exploration and contextual search
Sarkar et al. NLP algorithm based question and answering system
Hu et al. Embracing information explosion without choking: Clustering and labeling in microblogging
Rogushina Use of Semantic Similarity Estimates for Unstructured Data Analysis.
Timonen Term weighting in short documents for document categorization, keyword extraction and query expansion
Allahim et al. Semantic approaches for query expansion: taxonomy, challenges, and future research directions
Moya et al. Integrating web feed opinions into a corporate data warehouse
CN110688559A (en) Retrieval method and device
Werner et al. Precision difference management using a common sub-vector to extend the extended VSM method
Zhang et al. A semantics-based method for clustering of Chinese web search results
Ma et al. Api prober–a tool for analyzing web api features and clustering web apis
Kamath et al. Semantic similarity based context-aware web service discovery using nlp techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20240813