CN116108181A

CN116108181A - Customer information processing method, device and electronic equipment

Info

Publication number: CN116108181A
Application number: CN202310114425.4A
Authority: CN
Inventors: 周晴
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-05-12

Abstract

The application discloses a processing method and device of customer information and electronic equipment, wherein the method is applied to the field of big data and comprises the following steps: coding the client information in a plurality of incoming call lists of a plurality of clients to obtain a plurality of documents corresponding to the incoming call lists; acquiring a weighting matrix model of each document in a plurality of documents; acquiring a subject word set of each document in a plurality of documents; calculating the target similarity between each document and other documents in the plurality of documents according to the subject word set of each document and the weighted matrix model of each document; clustering a plurality of documents based on a PAM clustering algorithm according to the target similarity to obtain a clustered document set; and acquiring target information of a plurality of clients from the clustered document set. According to the method and the device, the problem that the information extracted from the electrician bill is inaccurate due to the fact that the value information of the electrician bill is extracted by adopting manual labeling or natural language processing technology in the related technology is solved.

Description

Customer information processing method, device and electronic equipment

技术领域technical field

本申请涉及大数据领域，具体而言，涉及一种客户信息的处理方法、装置及电子设备。The present application relates to the field of big data, and in particular, relates to a method, device and electronic equipment for processing customer information.

背景技术Background technique

在提取多个客户的多个来电工单中的潜在价值信息时，常常采取人工分类和人工标注的方式提取信息，但由于人工标注多个来电工单的方法效率低下，并且受不同的标注者的主观影响较大，导致标注结果的质量和准确性较差，无法客观反映多个来电工单中的潜在价值信息。When extracting the potential value information in multiple incoming work orders of multiple customers, manual classification and manual labeling are often used to extract information. The subjective influence of , resulting in poor quality and accuracy of labeling results, cannot objectively reflect the potential value information in multiple call work orders.

目前，也会使用自然语言处理技术计算文本相似度，提取多个来电工单中的潜在价值信息，将非结构化的文本转化为便于计算机识别处理的结构化信息，以实现对文本形式信息的挖掘。传统的文本相似度计算模型主要可以分为三类：向量空间模型(VectorSpace Model，VSM)、广义向量空间模型(Generalized Vector Space Model，GVSM)以及隐性语义索引模型(Latent Semantic Indexing，LSI)或称为潜在语义分析(LatentSemantic Analysis，LSA)。这三种模型在提取特征时一般采用特征选择函数提取关键词作为文本特征，比如单词贡献度TC、TFIDF、信息熵/信息增益、互信息、CHI统计等方法。然而，传统文本相似度模型需要大规模语料库，并且经常忽略文本中的语法和组织结构以及语义信息。例如，向量空间模型利用词袋模型(Bag-of-words)来构建特征空间，而这种模型在特征匹配中通常采用“硬匹配”方法，无法解决“一义多词”和“一词多义”问题。At present, natural language processing technology is also used to calculate text similarity, extract potential value information in multiple call orders, and convert unstructured text into structured information that is easy for computer recognition and processing, so as to realize the recognition of text information. dig. Traditional text similarity calculation models can be divided into three categories: Vector Space Model (VectorSpace Model, VSM), Generalized Vector Space Model (Generalized Vector Space Model, GVSM), and Latent Semantic Indexing (LSI) or It is called Latent Semantic Analysis (LSA). These three models generally use feature selection functions to extract keywords as text features when extracting features, such as word contribution TC, TFIDF, information entropy/information gain, mutual information, CHI statistics and other methods. However, traditional text similarity models require large-scale corpora and often ignore the grammatical and organizational structure as well as semantic information in the text. For example, the vector space model uses the bag-of-words model (Bag-of-words) to construct the feature space, and this model usually uses the "hard matching" method in feature matching, which cannot solve the problems of "multiple words with one meaning" and "multiple words with one word". righteousness" issue.

针对相关技术中采取人工标注或自然语言处理技术提取来电工单的价值信息，导致从来电工单中提取的信息不准确的问题，目前尚未提出有效的解决方案。Aiming at the problem of inaccurate information extracted from the incoming work order by using manual annotation or natural language processing technology to extract the value information of the incoming work order in related technologies, no effective solution has been proposed so far.

发明内容Contents of the invention

本申请的主要目的在于提供一种客户信息的处理方法、装置及电子设备，以解决相关技术中采取人工标注或自然语言处理技术提取来电工单的价值信息，导致从来电工单中提取的信息不准确的问题。The main purpose of this application is to provide a customer information processing method, device, and electronic equipment to solve the problem of extracting the value information of the incoming work order by manual labeling or natural language processing technology in related technologies, resulting in the inaccurate information extracted from the incoming work order. exact question.

为了实现上述目的，根据本申请的一个方面，提供了一种客户信息的处理方法，该方法包括：对多个客户的多个来电工单中的客户信息进行编码处理，得到所述多个来电工单对应的多个文档，其中，所述多个文档的编码格式相同；获取所述多个文档中每个文档的加权矩阵模型；获取所述多个文档中每个文档的主题词集合；依据所述每个文档的主题词集合和所述每个文档的加权矩阵模型，计算所述多个文档中每个文档与其它每个文档之间的目标相似度；依据所述目标相似度，基于PAM聚类算法对所述多个文档进行聚类，得到聚类后的文档集合；从所述聚类后的文档集合中获取多个客户的目标信息。In order to achieve the above object, according to one aspect of the present application, a method for processing customer information is provided. The method includes: encoding the customer information in multiple incoming work orders of multiple customers, and obtaining the multiple incoming work orders. A plurality of documents corresponding to the electrical work order, wherein the encoding formats of the plurality of documents are the same; obtaining a weighted matrix model of each of the plurality of documents; obtaining a set of subject words of each of the plurality of documents; According to the keyword set of each document and the weighted matrix model of each document, calculate the target similarity between each document in the plurality of documents and each other document; according to the target similarity, Clustering the multiple documents based on the PAM clustering algorithm to obtain a clustered document set; acquiring target information of multiple customers from the clustered document set.

进一步地，获取所述多个文档中每个文档的加权矩阵模型包括：删除所述多个文档中的预设字符串，并对所述多个文档中每个文档进行分词、过滤处理，得到第一语料库；将所述第一语料库代入word2vec模型中进行计算，得到每个文档的词向量；将所述多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；依据每个文档的关键词集合与其它每个文档的关键词集合的相似度，使用贪婪选择算法将每个文档的关键词集合中的关键词进行排序，获取每个文档的关键词向量；使用所述词向量替换所述关键词向量中的关键词，并使用TF-IDF特征选择函数计算所述关键词向量中的词向量的权值，得到替换后的关键词向量以及所述替换后的关键词向量中的词向量对应的词向量权值；依据所述替换后的关键词向量和所述词向量权值，获取每个文档的加权矩阵模型。Further, obtaining the weighted matrix model of each of the plurality of documents includes: deleting the preset character strings in the plurality of documents, and performing word segmentation and filtering on each of the plurality of documents, to obtain The first corpus; the first corpus is substituted into the word2vec model for calculation to obtain the word vector of each document; the plurality of documents are substituted into the TextRank model for calculation to obtain the keyword set of each document; according to each The similarity between the keyword set of the document and the keyword set of each other document, use the greedy selection algorithm to sort the keywords in the keyword set of each document, and obtain the keyword vector of each document; use the word The vector replaces the keywords in the keyword vector, and uses the TF-IDF feature selection function to calculate the weight of the word vector in the keyword vector to obtain the replaced keyword vector and the replaced keyword vector The word vector weight corresponding to the word vector in ; according to the replaced keyword vector and the word vector weight, the weighted matrix model of each document is obtained.

进一步地，将所述多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合包括：对所述多个文档中的每个文档进行第一处理，获取预设词性的词作为每个文档的候选关键词，其中，所述第一处理至少包括以下处理：分句处理、分词处理、过滤处理和标注词性处理；依据所述每个文档以及所述每个文档的候选关键词，获取第二语料库；使用TextRank模型将所述每个文档的候选关键词转换为关键词有向图，并计算所述候选关键词在每个文档中的权值，得到每个文档的加权关键词有向图；从所述加权关键词有向图中，获取所述权值高于预设权值的候选关键词，得到所述每个文档的关键词集合。Further, substituting the multiple documents into the TextRank model for calculation, and obtaining the keyword set of each document includes: performing a first process on each of the multiple documents, and obtaining a word with a preset part of speech as each Candidate keywords of documents, wherein the first processing includes at least the following processing: sentence segmentation processing, word segmentation processing, filtering processing and part-of-speech tagging processing; according to each document and the candidate keywords of each document, Obtain the second corpus; use the TextRank model to convert the candidate keywords of each document into a keyword directed graph, and calculate the weight of the candidate keywords in each document to obtain the weighted keywords of each document A directed graph: from the weighted keyword directed graph, obtain candidate keywords whose weight is higher than a preset weight, and obtain the keyword set of each document.

进一步地，获取所述多个文档中每个文档的主题词集合包括：依据TextRank模型配置词向量的词影响力得分的计算公式，计算公式如下所示：

其中，d代表预设的阻尼系数，a表示在所述加权关键词有向图中的关键词a，Out(b)表示在所述加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在所述加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在所述加权关键词有向图中关键词a和关键词b的权值，w_bc表示在所述加权关键词有向图中关键词b和关键词c的权值；对所述多个文档中的每个文档进行第二处理，获取第三语料库以及所述第三语料库中的主题词，其中，所述第二处理至少包括以下处理：分词处理、过滤处理、提取初始主题词处理和统计词频处理；依据所述词影响力得分的计算公式，迭代计算所述第三语料库中的主题词的词影响力得分，当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时停止迭代计算，从第N次迭代计算的计算结果中，获取所述词影响力得分高于预设词影响力得分的所述主题词，得到所述每个文档的主题词集合。Further, obtaining the subject word set of each document in the plurality of documents includes: configuring the calculation formula of the word influence score of the word vector according to the TextRank model, and the calculation formula is as follows:

Wherein, d represents the preset damping coefficient, a represents keyword a in the weighted keyword directed graph, and Out(b) represents all keywords pointed to by keyword a in the weighted keyword directed graph In(a) represents the set of all keywords pointing to keyword a in the weighted keyword directed graph, b represents keyword b pointing to keyword a, and c represents keyword c pointing to keyword a , S(b) represents the word influence of keyword b, tfidf _a represents the product of word frequency and inverse document frequency of keyword a, tfidf _c represents the product of word frequency and inverse document frequency of keyword c, w _ba represents The weight value of keyword a and keyword b in the weighted keyword directed graph, w _bc represents the weight value of keyword b and keyword c in the weighted keyword directed graph; Each document is subjected to the second processing to obtain the third corpus and the subject words in the third corpus, wherein the second process includes at least the following processing: word segmentation processing, filtering processing, extraction of initial subject words processing and statistical word frequency processing ; According to the calculation formula of the word influence score, iteratively calculate the word influence score of the subject words in the third corpus, when the difference between the Nth iteration and the N-1 iteration is less than the preset threshold stop the iterative calculation, and obtain the subject words whose word influence score is higher than the preset word influence score from the calculation result of the Nth iterative calculation, and obtain the subject subject set of each document.

进一步地，依据所述每个文档的主题词集合和所述每个文档的加权矩阵模型，计算所述多个文档中每个文档与其它每个文档之间的目标相似度包括：依据所述每个文档的主题词集合构造二部图模型，以计算所述每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度；依据矩阵的最小二乘距离公式，计算所述每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度；依据所述第一相似度和所述第二相似度，计算每个文档与其它每个文档之间的目标相似度。Further, according to the keyword set of each document and the weighted matrix model of each document, calculating the target similarity between each document in the plurality of documents and each other document includes: according to the The subject term set of each document constructs a bipartite graph model, to calculate the first similarity between the subject term set of each document and the subject term set of each other document; according to the least squares distance formula of the matrix, Calculate the second similarity between the weighted matrix model of each document and the weighted matrix model of each other document; calculate each document and each other according to the first similarity and the second similarity Target similarity between documents.

进一步地，依据所述每个文档的主题词集合构造二部图模型，以计算所述每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度包括：依据所述每个文档的主题词集合，构建每个文档和其它每个文档的所述二部图模型；使用匈牙利算法计算每个文档和其它每个文档之间的二部匹配最大权值，将所述二部匹配最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。Further, constructing a bipartite graph model according to the subject term set of each document, so as to calculate the first similarity between the subject term set of each document and each other document subject term set includes: according to the Describe the subject word set of each document, construct the bipartite graph model of each document and each other document; use the Hungarian algorithm to calculate the bipartite matching maximum weight between each document and each other document, and combine all The maximum weight of the bipartite matching is used as the first similarity between the keyword set of each document and the keyword set of each other document.

进一步地，从所述聚类后的文档集合中获取多个客户的目标信息包括：依据获取主题词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的主题词集合；依据获取关键词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的关键词集合；依据每个聚类后的文档集合的主题词集合和每个聚类后的文档集合的关键词集合，确定所述多个客户的目标信息。Further, obtaining the target information of multiple customers from the clustered document collection includes: processing each clustered document collection according to the method of obtaining the subject word collection, and obtaining each clustered document collection The keyword set of the set; according to the method of obtaining the keyword set, process each clustered document set to obtain the keyword set of each clustered document set; according to the method of each clustered document set The keyword set and the keyword set of each clustered document set determine the target information of the plurality of customers.

进一步地，在依据所述每个文档的主题词集合和所述每个文档的加权矩阵模型，计算所述多个文档中每个文档与其它每个文档之间的目标相似度之后，所述方法还包括：将所述多个文档中每个文档与其他每个文档之间的所述目标相似度相加，得到所述每个文档的第三相似度；从所述多个文档中，获取所述第三相似度高于预设相似度的目标文档，得到目标文档集合；将所述目标文档集合推送至目标对象。Further, after calculating the target similarity between each document in the multiple documents and each other document according to the keyword set of each document and the weighted matrix model of each document, the The method further includes: adding the target similarity between each document in the plurality of documents and each other document to obtain a third similarity of each document; from the plurality of documents, Acquiring the target documents whose third similarity is higher than the preset similarity to obtain a target document set; pushing the target document set to the target object.

为了实现上述目的，根据本申请的另一方面，提供了一种客户信息的处理装置，该装置包括：第一获取单元，用于对多个客户的多个来电工单中的客户信息进行编码处理，得到所述多个来电工单对应的多个文档，其中，所述多个文档的编码格式相同；第二获取单元，用于获取所述多个文档中每个文档的加权矩阵模型；第三获取单元，用于获取所述多个文档中每个文档的主题词集合；计算单元，用于依据所述每个文档的主题词集合和所述每个文档的加权矩阵模型，计算所述多个文档中每个文档与其它每个文档之间的目标相似度；第四获取单元，用于依据所述目标相似度，基于PAM聚类算法对所述多个文档进行聚类，得到聚类后的文档集合；第五获取单元，用于从所述聚类后的文档集合中获取多个客户的目标信息。In order to achieve the above object, according to another aspect of the present application, a customer information processing device is provided, the device includes: a first acquisition unit, used to encode the customer information in multiple incoming work orders of multiple customers Processing to obtain a plurality of documents corresponding to the plurality of incoming work orders, wherein the encoding formats of the plurality of documents are the same; the second acquisition unit is configured to acquire a weighted matrix model of each document in the plurality of documents; The third obtaining unit is used to obtain the subject heading set of each document in the plurality of documents; the calculation unit is used to calculate the subject heading set of each document according to the weighting matrix model of each document. The target similarity between each document in the plurality of documents and each other document; the fourth acquisition unit is used to perform clustering on the plurality of documents based on the PAM clustering algorithm according to the target similarity, to obtain A clustered document set; a fifth acquiring unit, configured to acquire multiple customer target information from the clustered document set.

进一步地，所述第二获取单元包括：第一处理子单元，用于删除所述多个文档中的预设字符串，并对所述多个文档中每个文档进行分词、过滤处理，得到第一语料库；第一计算子单元，用于将所述第一语料库代入word2vec模型中进行计算，得到每个文档的词向量；第二计算子单元，用于将所述多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；第一获取子单元，用于依据每个文档的关键词集合与其它每个文档的关键词集合的相似度，使用贪婪选择算法将每个文档的关键词集合中的关键词进行排序，获取每个文档的关键词向量；第三计算子单元，用于使用所述词向量替换所述关键词向量中的关键词，并使用TF-IDF特征选择函数计算所述关键词向量中的词向量的权值，得到替换后的关键词向量以及所述替换后的关键词向量中的词向量对应的词向量权值；第二获取子单元，用于依据所述替换后的关键词向量和所述词向量权值，获取每个文档的加权矩阵模型。Further, the second acquisition unit includes: a first processing subunit, configured to delete preset character strings in the plurality of documents, and perform word segmentation and filtering processing on each of the plurality of documents, to obtain The first corpus; the first calculation subunit is used to substitute the first corpus into the word2vec model for calculation to obtain the word vector of each document; the second calculation subunit is used to substitute the multiple documents into the TextRank model Calculate in to obtain the keyword set of each document; the first acquisition subunit is used to use the greedy selection algorithm to divide each document according to the similarity between the keyword set of each document and each other Sort the keywords in the keyword set to obtain the keyword vector of each document; the third calculation subunit is used to replace the keywords in the keyword vector with the word vector, and use the TF-IDF feature The selection function calculates the weight of the word vector in the keyword vector, obtains the word vector weight corresponding to the word vector in the keyword vector after the replacement and the word vector in the keyword vector after the replacement; the second acquisition subunit uses and obtaining a weighted matrix model of each document according to the replaced keyword vector and the word vector weight.

进一步地，所述第二计算子单元包括：处理模块，用于对所述多个文档中的每个文档进行第一处理，获取预设词性的词作为每个文档的候选关键词，其中，所述第一处理至少包括以下处理：分句处理、分词处理、过滤处理和标注词性处理；第一获取模块，用于依据所述每个文档以及所述每个文档的候选关键词，获取第二语料库；第一计算模块，用于使用TextRank模型将所述每个文档的候选关键词转换为关键词有向图，并计算所述候选关键词在每个文档中的权值，得到每个文档的加权关键词有向图；第二获取模块，用于从所述加权关键词有向图中，获取所述权值高于预设权值的候选关键词，得到所述每个文档的关键词集合。Further, the second calculation subunit includes: a processing module, configured to perform first processing on each of the plurality of documents, and obtain a word with a preset part of speech as a candidate keyword for each document, wherein, The first processing includes at least the following processing: sentence segmentation processing, word segmentation processing, filtering processing, and part-of-speech tagging processing; the first acquisition module is used to acquire the second document according to each document and the candidate keywords of each document. The second corpus; the first calculation module, for using the TextRank model to convert the candidate keywords of each document into a keyword directed graph, and calculate the weight of the candidate keywords in each document, and obtain each The weighted keyword directed graph of the document; the second acquisition module is used to obtain the candidate keywords whose weight is higher than the preset weight from the weighted keyword directed graph, and obtain the weight of each document collection of keywords.

进一步地，所述第三获取单元包括：配置子单元，用于依据TextRank模型配置词向量的词影响力得分的计算公式，计算公式如下所示：

其中，d代表预设的阻尼系数，a表示在所述加权关键词有向图中的关键词a，Out(b)表示在所述加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在所述加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在所述加权关键词有向图中关键词a和关键词b的权值，w_bc表示在所述加权关键词有向图中关键词b和关键词c的权值；第二处理子单元，用于对所述多个文档中的每个文档进行第二处理，获取第三语料库以及所述第三语料库中的主题词，其中，所述第二处理至少包括以下处理：分词处理、过滤处理、提取初始主题词处理和统计词频处理；第四计算子单元，用于依据所述词影响力得分的计算公式，迭代计算所述第三语料库中的主题词的词影响力得分，当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时停止迭代计算，从第N次迭代计算的计算结果中，获取所述词影响力得分高于预设词影响力得分的所述主题词，得到所述每个文档的主题词集合。Further, the third acquisition unit includes: a configuration subunit, configured to configure the calculation formula of the word influence score of the word vector according to the TextRank model, and the calculation formula is as follows:

Wherein, d represents the preset damping coefficient, a represents keyword a in the weighted keyword directed graph, and Out(b) represents all keywords pointed to by keyword a in the weighted keyword directed graph In(a) represents the set of all keywords pointing to keyword a in the weighted keyword directed graph, b represents keyword b pointing to keyword a, and c represents keyword c pointing to keyword a , S(b) represents the word influence of keyword b, tfidf _a represents the product of word frequency and inverse document frequency of keyword a, tfidf _c represents the product of word frequency and inverse document frequency of keyword c, w _ba represents The weight value of keyword a and keyword b in the weighted keyword directed graph, w _bc represents the weight value of keyword b and keyword c in the weighted keyword directed graph; the second processing subunit is used for Performing a second process on each of the plurality of documents to obtain a third corpus and subject words in the third corpus, wherein the second process at least includes the following processes: word segmentation processing, filtering processing, extraction Initial topic word processing and statistical word frequency processing; the fourth calculation subunit is used to iteratively calculate the word influence score of the topic words in the third corpus according to the calculation formula of the word influence score, when the Nth iteration When the difference between the calculation and the N-1th iteration is less than the preset threshold, the iterative calculation is stopped, and the word influence score higher than the preset word influence score is obtained from the calculation result of the Nth iteration calculation. subject words, to obtain a set of subject words of each document.

进一步地，所述计算单元包括：构造子单元，用于依据所述每个文档的主题词集合构造二部图模型，以计算所述每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度；第五计算子单元，用于依据矩阵的最小二乘距离公式，计算所述每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度；第六计算子单元，用于依据所述第一相似度和所述第二相似度，计算每个文档与其它每个文档之间的目标相似度。Further, the calculation unit includes: a construction subunit, configured to construct a bipartite graph model according to the keyword set of each document, so as to calculate the difference between the keyword set of each document and the keyword set of each other document. The first similarity between the collections; the fifth calculation subunit is used to calculate the second similarity between the weighted matrix model of each document and the weighted matrix model of each other document according to the least squares distance formula of the matrix Similarity: a sixth calculation subunit, configured to calculate a target similarity between each document and each other document according to the first similarity and the second similarity.

进一步地，所述第一构造子单元包括：构造模块，用于依据所述每个文档的主题词集合，构建每个文档和其它每个文档的所述二部图模型；第二计算模块，用于使用匈牙利算法计算每个文档和其它每个文档之间的二部匹配最大权值，将所述二部匹配最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。Further, the first construction subunit includes: a construction module, configured to construct the bipartite graph model of each document and each other document according to the keyword set of each document; a second calculation module, It is used to calculate the maximum weight of bipartite matching between each document and each other document using the Hungarian algorithm, and the maximum weight of bipartite matching is used as the set of subject words of each document and the set of subject words of each other document The first similarity between.

进一步地，所述第五获取单元包括：第三处理子单元，用于依据获取主题词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的主题词集合；第四处理子单元，用于依据获取关键词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的关键词集合；确定子单元，用于依据每个聚类后的文档集合的主题词集合和每个聚类后的文档集合的关键词集合，确定所述多个客户的目标信息。Further, the fifth acquisition unit includes: a third processing subunit, configured to process each clustered document set according to the method of acquiring the subject word set, and acquire the subject of each clustered document set word set; the fourth processing subunit is used to process each clustered document set according to the method for obtaining the keyword set, and obtain the keyword set of each clustered document set; determine the subunit, use The target information of the plurality of customers is determined according to the keyword set of each clustered document set and the keyword set of each clustered document set.

进一步地，所述装置还包括：第六获取单元，用于在依据所述主题词集合和所述加权矩阵模型，计算所述多个文档中每个文档与其它每个文档之间的目标相似度之后，将所述多个文档中每个文档与其他每个文档之间的所述目标相似度相加，得到所述每个文档的第三相似度；第七获取单元，用于从所述多个文档中，获取所述第三相似度高于预设相似度的目标文档，得到目标文档集合；推送单元，用于将所述目标文档集合推送至目标对象Further, the device further includes: a sixth acquisition unit, configured to calculate the target similarity between each document in the plurality of documents and each other document according to the subject word set and the weighted matrix model After the degree of similarity, add the target similarity between each of the multiple documents and each of the other documents to obtain the third similarity of each document; the seventh acquisition unit is used to obtain from the Among the plurality of documents, the target document whose third similarity is higher than the preset similarity is obtained to obtain a target document set; a push unit is used to push the target document set to the target object

为了实现上述目的，根据本申请的一个方面，提供了一种电子设备，包括一个或多个处理器和存储器，存储器用于存储一个或多个程序，其中，当一个或多个程序被一个或多个处理器执行时，使得一个或多个处理器实现上述任意一项所述客户信息的处理方法。In order to achieve the above object, according to one aspect of the present application, an electronic device is provided, including one or more processors and a memory, and the memory is used to store one or more programs, wherein, when one or more programs are used by one or more When executed by multiple processors, one or more processors can implement any one of the above-mentioned customer information processing methods.

通过本申请，采用以下步骤：对多个客户的多个来电工单中的客户信息进行编码处理，得到所述多个来电工单对应的多个文档，其中，所述多个文档的编码格式相同；获取所述多个文档中每个文档的加权矩阵模型；获取所述多个文档中每个文档的主题词集合；依据所述每个文档的主题词集合和所述每个文档的加权矩阵模型，计算所述多个文档中每个文档与其它每个文档之间的目标相似度；依据所述目标相似度，基于PAM聚类算法对所述多个文档进行聚类，得到聚类后的文档集合；从所述聚类后的文档集合中获取多个客户的目标信息，解决了采取人工标注或自然语言处理技术提取来电工单的价值信息，导致从来电工单中提取的信息不准确的问题。通过计算多个来电工单中每个文档的加权矩阵模型和每个文档的主题词集合，以计算多个文档之间的目标相似度，从而在多个文档中提取目标信息，实现了通过算法自动从多个文档中提取目标信息，避免了人工标注多个文档对目标信息的信息质量产生的影响，达到了从多个来电工单中提取更加准确的目标信息的效果。Through this application, the following steps are adopted: encode the customer information in multiple incoming work orders of multiple customers, and obtain multiple documents corresponding to the multiple incoming work orders, wherein the encoding format of the multiple documents Same; Obtain the weighted matrix model of each document in the plurality of documents; Obtain the set of subject terms of each document in the plurality of documents; According to the set of subject terms of each document and the weighting of each document A matrix model, calculating the target similarity between each document in the plurality of documents and each other document; according to the target similarity, clustering the plurality of documents based on the PAM clustering algorithm to obtain a cluster After the collection of documents; the target information of multiple customers is obtained from the document collection after the clustering, which solves the problem of extracting the value information of the call work order by manual labeling or natural language processing technology, which leads to the inconsistency of the information extracted from the call work order. exact question. By calculating the weighted matrix model of each document and the subject word set of each document in multiple incoming call tickets, to calculate the target similarity between multiple documents, so as to extract target information in multiple documents, the algorithm is implemented Automatically extract target information from multiple documents, avoiding the impact of manual labeling of multiple documents on the information quality of target information, and achieve the effect of extracting more accurate target information from multiple call work orders.

附图说明Description of drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings constituting a part of the application are used to provide further understanding of the application, and the schematic embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation to the application. In the attached picture:

图1是根据本申请实施例提供的客户信息的处理方法的流程图；FIG. 1 is a flow chart of a method for processing customer information provided according to an embodiment of the present application;

图2是根据本申请实施例提供的可选的客户信息的处理方法的示意图一；FIG. 2 is a first schematic diagram of an optional customer information processing method provided according to an embodiment of the present application;

图3是根据本申请实施例提供的可选的客户信息的处理方法的示意图二；FIG. 3 is a second schematic diagram of an optional customer information processing method provided according to an embodiment of the present application;

图4是根据本申请实施例提供的可选的客户信息的处理方法的示意图三；FIG. 4 is a third schematic diagram of an optional customer information processing method provided according to an embodiment of the present application;

图5是根据本申请实施例提供的客户信息的处理装置的示意图；Fig. 5 is a schematic diagram of a processing device for customer information provided according to an embodiment of the present application;

图6是根据本申请实施例提供的客户信息的处理电子设备的示意图。Fig. 6 is a schematic diagram of an electronic device for processing customer information provided according to an embodiment of the present application.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is an embodiment of a part of the application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present application and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It should be understood that the data so used may be interchanged under appropriate circumstances for the embodiments of the application described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据，客户的来电工单等)，均为经用户授权或者经过各方充分授权的信息和数据，并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准，并提供有相应的操作入口，供用户选择授权或者拒绝。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, and customer Electrician orders, etc.), are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions, and provide corresponding operation entrances , for the user to choose to authorize or deny.

下面结合优选的实施步骤对本发明进行说明，图1是根据本申请实施例提供的客户信息的处理方法的流程图，如图1所示，该方法包括如下步骤：The present invention will be described below in conjunction with preferred implementation steps. Fig. 1 is a flow chart of the processing method of customer information provided according to the embodiment of the present application. As shown in Fig. 1, the method includes the following steps:

步骤S101，对多个客户的多个来电工单中的客户信息进行编码处理，得到多个来电工单对应的多个文档，其中，多个文档的编码格式相同。Step S101 , encoding customer information in multiple incoming call work orders of multiple customers to obtain multiple documents corresponding to the multiple incoming call work orders, wherein the encoding formats of the multiple documents are the same.

目前，多个客户的多个来电工单一般是记录客户和工作人员的沟通过程的多个文本文件，在本实施例中，为了后续对客户信息进行进一步的处理，需要将多个文本文件转换为相同编码格式的多个文档。At present, the multiple call orders of multiple customers are generally multiple text files that record the communication process between the customer and the staff. In this embodiment, in order to further process the customer information, the multiple text files need to be converted Multiple documents in the same encoding format.

步骤S102，获取多个文档中每个文档的加权矩阵模型。Step S102, acquiring a weighted matrix model of each of the multiple documents.

步骤S103，获取多个文档中每个文档的主题词集合。Step S103, obtaining a set of subject words of each of the multiple documents.

步骤S104，依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度。Step S104, according to the keyword set of each document and the weighted matrix model of each document, calculate the target similarity between each document and each other document among the plurality of documents.

在本实施例中，为了计算多个文档中每个文档与其它每个文档之间的目标相似度，在得到编码格式相同的多个文档之后，需要对多个文档进行计算，得到多个文档中每个文档的加权矩阵模型和多个文档中每个文档的主题词集合。In this embodiment, in order to calculate the target similarity between each document in multiple documents and each other document, after obtaining multiple documents with the same encoding format, it is necessary to calculate multiple documents to obtain multiple documents A weighted matrix model for each document in , and a collection of topic terms for each of the multiple documents.

步骤S105，依据目标相似度，基于PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合。Step S105, according to the target similarity, cluster the multiple documents based on the PAM clustering algorithm to obtain a clustered document set.

步骤S106，从聚类后的文档集合中获取多个客户的目标信息。Step S106, acquiring target information of multiple customers from the clustered document collection.

现有技术中对多个客户的多个来电工单进行分析，提取工作人员感兴趣的信息时，一般使用人工标注的方式，对多个客户的多个来电工单进行标注。但人工标注多个来电工单的工作效率较低，无法应对目前大数据分析的需求，而且由于不同标注者的主观差异，导致标注效果没有统一标准，从多个来电工单中提取出的信息的质量较差。In the prior art, when analyzing multiple incoming work orders of multiple customers and extracting information that staff are interested in, manual labeling is generally used to mark the multiple incoming work orders of multiple customers. However, the work efficiency of manually labeling multiple incoming work orders is low, and it cannot meet the current needs of big data analysis. Moreover, due to the subjective differences of different annotators, there is no uniform standard for the marking effect. The information extracted from multiple incoming work orders of poor quality.

在本实施例中，为了提高标注多个来电工单的工作效率，同时提高从多个来电工单中提取出的信息的质量，需要通过PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合，再接着从聚类后的文档集合中获取多个客户的目标信息。其中，目标信息包含多个客户对公司产品和公司服务的体验反馈，同时目标信息还包含客户的潜在业务办理需求，从多个客户的多个来电工单中提取信息可以为提升客户服务和精准营销提供有利支持。In this embodiment, in order to improve the work efficiency of labeling multiple call work orders and improve the quality of information extracted from multiple call work orders, it is necessary to cluster multiple documents through the PAM clustering algorithm to obtain the clustered The clustered document collection, and then obtain the target information of multiple customers from the clustered document collection. Among them, the target information includes multiple customers’ experience feedback on the company’s products and services, and the target information also includes the potential business handling needs of customers. Extracting information from multiple incoming calls from multiple customers can improve customer service and accuracy. Marketing provides favorable support.

综上所述，本申请实施例提供的客户信息的处理方法，通过对多个客户的多个来电工单中的客户信息进行编码处理，得到多个来电工单对应的多个文档，其中，多个文档的编码格式相同；获取多个文档中每个文档的加权矩阵模型；获取多个文档中每个文档的主题词集合；依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度；依据目标相似度，基于PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合；从聚类后的文档集合中获取多个客户的目标信息，解决了采取人工标注或自然语言处理技术提取来电工单的价值信息，导致从来电工单中提取的信息不准确的问题。通过计算多个来电工单中每个文档的加权矩阵模型和每个文档的主题词集合，以计算多个文档之间的目标相似度，从而在多个文档中提取目标信息，实现了通过算法自动从多个文档中提取目标信息，避免了人工标注多个文档对目标信息的信息质量产生的影响，达到了从多个来电工单中提取更加准确的目标信息的效果。To sum up, the method for processing customer information provided by the embodiment of the present application obtains multiple documents corresponding to multiple incoming work orders by encoding the customer information in multiple incoming work orders of multiple customers, wherein, The encoding format of multiple documents is the same; obtain the weighted matrix model of each document in multiple documents; obtain the set of subject terms of each document in multiple documents; according to the set of subject terms of each document and the weighted matrix model of each document , to calculate the target similarity between each document in multiple documents and each other document; according to the target similarity, multiple documents are clustered based on the PAM clustering algorithm to obtain a clustered document set; from the clustering The target information of multiple customers is obtained from the final document collection, and the problem of inaccurate information extracted from the incoming work order is solved by using manual annotation or natural language processing technology to extract the value information of the incoming work order. By calculating the weighted matrix model of each document and the subject word set of each document in multiple incoming call tickets, to calculate the target similarity between multiple documents, so as to extract target information in multiple documents, the algorithm is implemented Automatically extract target information from multiple documents, avoiding the impact of manual labeling of multiple documents on the information quality of target information, and achieve the effect of extracting more accurate target information from multiple call work orders.

可选地，在本申请实施例提供的客户信息的处理方法中，获取多个文档中每个文档的加权矩阵模型包括：删除多个文档中的预设字符串，并对多个文档中每个文档进行分词、过滤处理，得到第一语料库；将第一语料库代入word2vec模型中进行计算，得到每个文档的词向量；将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；依据每个文档的关键词集合与其它每个文档的关键词集合的相似度，使用贪婪选择算法将每个文档的关键词集合中的关键词进行排序，获取每个文档的关键词向量；使用词向量替换关键词向量中的关键词，并使用TF-IDF特征选择函数计算关键词向量中的词向量的权值，得到替换后的关键词向量以及替换后的关键词向量中的词向量对应的词向量权值；依据替换后的关键词向量和词向量权值，获取每个文档的加权矩阵模型。Optionally, in the method for processing customer information provided in the embodiment of the present application, obtaining the weighted matrix model of each of the multiple documents includes: deleting the preset character strings in the multiple documents, and Segment and filter each document to obtain the first corpus; Substitute the first corpus into the word2vec model for calculation to obtain the word vector of each document; Substitute multiple documents into the TextRank model for calculation to obtain the keywords of each document Set; according to the similarity between the keyword set of each document and the keyword set of each other document, use the greedy selection algorithm to sort the keywords in the keyword set of each document, and obtain the keyword vector of each document ;Use the word vector to replace the keywords in the keyword vector, and use the TF-IDF feature selection function to calculate the weight of the word vector in the keyword vector, and obtain the replaced keyword vector and the words in the replaced keyword vector The word vector weight corresponding to the vector; according to the replaced keyword vector and word vector weight, the weighted matrix model of each document is obtained.

目前，通过自然语言解析技术从文本中提取核心内容的过程中，一般从文本中提取离散类别的小集合或使用word2vec模型计算文本对应的词向量。使用离散类别的小集合处理文本，即使用文本中的名词短语和动词短语表示文本，但是，通过离散小集合处理文本的方法无法完整地捕捉到文本中多个短语的丰富性，并且需要占用巨大的特征空间。使用word2vec模型计算词向量的方法与先前性能最好的基于不同类型神经网络的技术进行比较，发现使用word2vec模型在较低的计算成本下，准确率有着较大的改进，但是，使用词向量表示文本的一个固有限制是词向量不关心词在文本中的语序，并且词向量无法表示符合语言习惯的短语，例如，词向量无法很准确地表示成语所表达的含义，另外，现有使用词向量表示文本的模型，例如，词向量均值模型，词向量聚类模型以及doc2vec模型，上述使用词向量表示文本的模型没有考虑词在文本中的影响力。At present, in the process of extracting the core content from the text through the natural language analysis technology, a small set of discrete categories is generally extracted from the text or the word vector corresponding to the text is calculated using the word2vec model. Using a small set of discrete categories to process text, that is, using noun phrases and verb phrases in the text to represent the text, however, the method of processing text through discrete small sets cannot fully capture the richness of multiple phrases in the text, and requires a huge feature space. Comparing the method of calculating word vectors using the word2vec model with the previous technology based on different types of neural networks with the best performance, it is found that using the word2vec model has a greater improvement in accuracy at a lower computational cost. However, using the word vector representation An inherent limitation of text is that word vectors do not care about the order of words in the text, and word vectors cannot represent phrases that conform to language habits. For example, word vectors cannot accurately represent the meaning expressed by idioms. In addition, the existing use of word vectors Models that represent text, for example, word vector mean model, word vector clustering model, and doc2vec model, the above-mentioned models that use word vectors to represent text do not consider the influence of words in the text.

在本实施例中，为了获取能够更准确地表示文本的含义，首先需要对多个文档进行分词处理和过滤处理，得到第一语料库，以便于后续计算；接着将第一语料库代入word2vec模型中进行计算，得到每个文档对应的词向量；然后将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；再计算每个文档对应的关键词集合中每个关键词与其它每个文档对应的关键词集合中每个关键词的相似度，依据关键词之间的相似度数值的大小，使用贪婪选择算法将每个文档对应的关键词集合中的至少一个关键词进行排序，获取每个文档对应的关键词向量；使用word2vec模型计算得到的词向量替换关键词向量中的关键词，并使用TF-IDF特征选择函数计算关键词向量中的词向量的权值，得到替换后的关键词向量以及替换后的关键词向量中的词向量对应的词向量权值；依据替换后的关键词向量和词向量权值，获取每个文档的加权矩阵模型。In this embodiment, in order to obtain the meaning that can more accurately represent the text, it is first necessary to perform word segmentation and filtering on multiple documents to obtain the first corpus for subsequent calculations; then substitute the first corpus into the word2vec model for Calculate to obtain the word vector corresponding to each document; then substitute multiple documents into the TextRank model for calculation to obtain the keyword set of each document; then calculate the relationship between each keyword in the keyword set corresponding to each document and every other The similarity of each keyword in the keyword set corresponding to each document, according to the size of the similarity value between keywords, use the greedy selection algorithm to sort at least one keyword in the keyword set corresponding to each document, Obtain the keyword vector corresponding to each document; use the word vector calculated by the word2vec model to replace the keywords in the keyword vector, and use the TF-IDF feature selection function to calculate the weight of the word vector in the keyword vector, and get the replaced The keyword vector and the word vector weight corresponding to the word vector in the replaced keyword vector; according to the replaced keyword vector and word vector weight, the weighted matrix model of each document is obtained.

现有技术中常见的文本特征选择函数主要有单词贡献度TC、信息熵/信息增益、互信息(Mutual Information)和χ²统计，其中，信息熵/信息增益只考虑了特征对整体的贡献，无法具体到各个类别，因此信息熵/信息增益只适用于选择“全局”特征时使用，不适用于选择“局部”特征，导致无法结合与词相邻的上下文信息进行计算，不利于从文本中提取关键词；互信息则只对临界特征词比较敏感，因而经常倾向于罕见词对文本分类的影响，而来电工单中的词一般是日常用语并非罕见词，所以互信息也不适合计算本实施例中的关键词；χ²统计只有在满足文本的特征词与类别的相关性符合χ²分布的条件下，才能得到准确的结果，当这种假设不满足时，得到的结果与实际情况相差甚远，而来电工单中的词语无法保证一定满足χ²分布，所以χ²分布也不适合被使用在本实施例中提取关键词。基于以上分析，在本实施例中，使用单词贡献度TC中的TF-IDF值作为特征选择函数，以计算关键词向量中的词向量的权值。TF-IDF值中的TF(Term Frequency)是对一个词局部重要性的度量，表示该词在文档中的词频，TF值越大反映该词对文档的贡献越大；IDF(Inverse DocumentFrequency)指逆文本频率指数，公式为log(D/D_w)，D表示文本总数，D_w为词w出现过的文本数，IDF就是一个特定条件下关键词的概率分布交叉熵(Kullback-Leibler Divergence)。所以，TF-IDF值从词在文本中的频率和在语料库中的分布两方面衡量词的重要程度。Common text feature selection functions in the prior art mainly include word contribution TC, information entropy/information gain, mutual information (Mutual Information) and χ ² statistics, wherein, information entropy/information gain only considers the contribution of features to the whole, It cannot be specific to each category, so information entropy/information gain is only suitable for selecting "global" features, not suitable for selecting "local" features, resulting in the inability to combine contextual information adjacent to words for calculation, which is not conducive to extracting from text Extract keywords; mutual information is only sensitive to critical feature words, so it often tends to affect the impact of rare words on text classification, and the words in the electrical work sheet are generally everyday words and not rare words, so mutual information is not suitable for calculating this Keywords in the embodiment; χ ² statistics can only obtain accurate results under the condition of χ ² distribution that the correlation between the feature words and categories of the text is satisfied, and when this assumption is not satisfied, the results obtained are consistent with the actual situation It is very different, and the words in the incoming electrician's order cannot guarantee to satisfy the χ ² distribution, so the χ ² distribution is not suitable for being used to extract keywords in this embodiment. Based on the above analysis, in this embodiment, the TF-IDF value in the word contribution TC is used as the feature selection function to calculate the weight of the word vector in the keyword vector. TF (Term Frequency) in the TF-IDF value is a measure of the local importance of a word, indicating the word frequency of the word in the document. The larger the TF value, the greater the contribution of the word to the document; IDF (Inverse Document Frequency) refers to Inverse text frequency index, the formula is log(D/D _w ), D represents the total number of texts, D _w is the number of texts in which the word w has appeared, and IDF is the cross-entropy (Kullback-Leibler Divergence) of the probability distribution of keywords under a specific condition . Therefore, the TF-IDF value measures the importance of words from the frequency of words in the text and the distribution in the corpus.

通过使用word2vec模型计算得到的词向量替换使用TextRank模型计算得到的关键词向量中的关键词，并使用TF-IDF特征选择函数计算词向量的权值，得到了能够更准确表示文本的文本加权矩阵模型，不仅解决了一词多义的问题，同时结合了上下文的语义信息，从词在文档中的频率和在语料库中的分布两方面计算词向量的权值，提高了词向量表示文本的准确性，达到了使用文本加权矩阵模型表示文档的效果，进一步达到了提高多个文档之间的相似度的准确率的效果。By using the word vector calculated by the word2vec model to replace the keywords in the keyword vector calculated by the TextRank model, and using the TF-IDF feature selection function to calculate the weight of the word vector, a text weighting matrix that can more accurately represent the text is obtained The model not only solves the problem of polysemy, but also combines the semantic information of the context to calculate the weight of the word vector from the frequency of the word in the document and the distribution in the corpus, which improves the accuracy of the word vector to represent the text. It achieves the effect of using the text weighted matrix model to represent documents, and further achieves the effect of improving the accuracy of the similarity between multiple documents.

可选地，在本申请实施例提供的客户信息的处理方法中，将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合包括：对多个文档中的每个文档进行第一处理，获取预设词性的词作为每个文档的候选关键词，其中，第一处理至少包括以下处理：分句处理、分词处理、过滤处理和标注词性处理；依据每个文档以及每个文档的候选关键词，获取第二语料库；使用TextRank模型将每个文档的候选关键词转换为关键词有向图，并计算候选关键词在每个文档中的权值，得到每个文档的加权关键词有向图；从加权关键词有向图中，获取权值高于预设权值的候选关键词，得到每个文档的关键词集合。Optionally, in the method for processing customer information provided in the embodiment of the present application, substituting multiple documents into the TextRank model for calculation, and obtaining the keyword set of each document includes: performing the first step on each of the multiple documents The first process is to obtain the words with preset parts of speech as the candidate keywords of each document, wherein the first process includes at least the following processes: sentence segmentation processing, word segmentation processing, filtering processing and tagging part of speech processing; according to each document and each document Candidate keywords of each document to obtain the second corpus; use the TextRank model to convert the candidate keywords of each document into a keyword directed graph, and calculate the weight of the candidate keywords in each document to obtain the weighted key of each document Word directed graph; from the weighted keyword directed graph, candidate keywords whose weight is higher than the preset weight are obtained, and the keyword set of each document is obtained.

一般的，关键词能够体现整个文本的关键信息或核心思想。由于在使用基于词图模型的TextRank模型或LDA模型进行计算时，不需要事先对语料库进行训练，所以现有自然语言解析技术常使用这两种算法提取文本中的关键词，但是当新的文本加入计算时，因LDA模型的可扩展性较差，若继续使用LDA模型进行计算则需要重新训练模型，而TextRank模型在计算时显得更为简洁，所以，在本实施例中，使用TextRank模型作为从多个文档中提取关键词的方法。Generally, keywords can reflect the key information or core idea of the entire text. Since there is no need to train the corpus in advance when using the TextRank model or LDA model based on the word graph model, the existing natural language parsing technology often uses these two algorithms to extract keywords in the text, but when the new text When adding the calculation, because the scalability of the LDA model is poor, if you continue to use the LDA model for calculation, you need to retrain the model, and the TextRank model is more concise when calculating, so in this embodiment, the TextRank model is used as A method for extracting keywords from multiple documents.

在本实施例中，为了获取多个文档对应的关键词集合，需要首先对原始语料集中的多个文档进行分句处理、分词处理、过滤处理和标注词性处理，得到每个文档的候选关键词；由每个文档对应的候选关键词组成第二语料库；接着使用TextRank模型构建候选关键词图G＝(V，E)，将候选关键词作为关键词图的节点E，采用共现关系(co-occurrence)构造任意两个节点之间的边V，并迭代计算各节点的权值，直至关键词图中任意两个节点的误差率小于预设阈值，则停止迭代计算各节点权值，得到每个文档对应的加权关键词有向图；依据加权关键词有向图中的各节点的权值，将加权关键词有向图中的多个节点进行排序，从多个节点中获取节点权值大于预设权值的c个候选关键词，作为每个文档对应的关键词集合，其中，若文档的关键词集合中的c个关键词在来电工单的文本中是相邻词组，则组合相邻的关键词得到多词关键词。In this embodiment, in order to obtain the keyword sets corresponding to multiple documents, it is necessary to first perform sentence segmentation processing, word segmentation processing, filtering processing, and part-of-speech processing on multiple documents in the original corpus to obtain candidate keywords for each document ; form the second corpus by the candidate keywords corresponding to each document; then use the TextRank model to construct the candidate keyword graph G=(V, E), use the candidate keyword as the node E of the keyword graph, and adopt the co-occurrence relationship (co -occurrence) to construct the edge V between any two nodes, and iteratively calculate the weight of each node, until the error rate of any two nodes in the keyword graph is less than the preset threshold, then stop iteratively calculating the weight of each node, and get The weighted keyword directed graph corresponding to each document; according to the weight value of each node in the weighted keyword directed graph, the multiple nodes in the weighted keyword directed graph are sorted, and the node weights are obtained from multiple nodes C candidate keywords whose value is greater than the preset weight value are used as the keyword set corresponding to each document, wherein, if the c keywords in the keyword set of the document are adjacent phrases in the text of the call ticket, then Combine adjacent keywords to get multi-word keywords.

通过使用TextRank模型构建多个文档中每个文档对应的加权关键词有向图，得到了多个文档对应的关键词集合，提高了使用关键词表示文本信息的能力，实现了更加准确地表示文档的含义的效果，达到了提高多个文档之间的相似度的准确率的效果。By using the TextRank model to construct a weighted keyword directed graph corresponding to each document in multiple documents, the keyword set corresponding to multiple documents is obtained, which improves the ability to use keywords to represent text information, and achieves a more accurate representation of documents The effect of meaning, achieves the effect of improving the accuracy of the similarity between multiple documents.

可选地，在本申请实施例提供的客户信息的处理方法中，获取多个文档中每个文档的主题词集合包括：依据TextRank模型配置词向量的词影响力得分的计算公式，计算公式如下所示：

其中，d代表预设的阻尼系数，a表示在加权关键词有向图中的关键词a，Out(b)表示在加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在加权关键词有向图中关键词a和关键词b的权值，w_bc表示在加权关键词有向图中关键词b和关键词c的权值；对多个文档中的每个文档进行第二处理，获取第三语料库以及第三语料库中的主题词，其中，第二处理至少包括以下处理：分词处理、过滤处理、提取初始主题词处理和统计词频处理；依据词影响力得分的计算公式，迭代计算第三语料库中的主题词的词影响力得分，当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时停止迭代计算，从第N次迭代计算的计算结果中，获取词影响力得分高于预设词影响力得分的主题词，得到每个文档的主题词集合。Optionally, in the method for processing customer information provided in the embodiment of the present application, obtaining the keyword set of each of the multiple documents includes: configuring the calculation formula of the word influence score of the word vector according to the TextRank model, and the calculation formula is as follows Shown:

Among them, d represents the preset damping coefficient, a represents the keyword a in the weighted keyword directed graph, Out(b) represents the set of all keywords pointed to by keyword a in the weighted keyword directed graph, and In (a) represents the set of all keywords pointing to keyword a in the weighted keyword directed graph, b represents keyword b pointing to keyword a, c represents keyword c pointing to keyword a, S(b) represents The word influence of keyword b, tfidf _a represents the product of the word frequency of keyword a and the inverse document frequency, tfidf _c represents the product of the word frequency of keyword c and the inverse document frequency, w _ba represents the key word in the weighted keyword directed graph The weight of word a and keyword b, w _bc represents the weight of keyword b and keyword c in the weighted keyword directed graph; the second processing is performed on each document in the plurality of documents, and the third corpus is obtained And the subject words in the third corpus, wherein, the second processing includes at least the following processing: word segmentation processing, filtering processing, extracting initial subject words processing and statistical word frequency processing; according to the calculation formula of the word influence score, iteratively calculate the third corpus The word influence score of the subject word, when the difference between the Nth iterative calculation and the N-1th iteration is less than the preset threshold, the iterative calculation is stopped, and the word influence is obtained from the calculation result of the Nth iterative calculation The subject words whose scores are higher than the preset word influence score are obtained the set of subject words of each document.

现如今常用提取主题模型的方法识别大规模文档中的主题信息，但主题模型面对动态增长的文本时难以找到合适的主题投射纬数，导致无法得到准确表示文档的主题词，同时，主题模型目前使用Wordnet语义词典和HowNet语义词典，但Wordnet语义词典和HowNet语义词典是中英文领域的语义词典，当面对专业性很强的文本处理领域时，会因大量专业词汇没有及时被收录至词典内，导致无法计算部分词语的相似性和无法很好的解决一词多义的问题。而在词向量的训练过程中，捕获了词的上下文信息，词的上下文信息不仅可以绑定词与词的关系，同时还能够很好地解决因词典缺失专业词汇带来的语义空白问题，所以，在本实施例中，使用词向量来衡量词与词之间的语义关系，以得到多个文档中每个文档对应的主题词集合。Nowadays, the method of extracting topic models is commonly used to identify topic information in large-scale documents, but it is difficult for topic models to find suitable topic projection latitudes in the face of dynamically growing texts, resulting in the inability to accurately represent the topic words of documents. At the same time, topic models At present, Wordnet Semantic Dictionary and HowNet Semantic Dictionary are used, but Wordnet Semantic Dictionary and HowNet Semantic Dictionary are semantic dictionaries in the Chinese and English fields. When facing the highly professional text processing field, a large number of professional words will not be included in the dictionary in time As a result, the similarity of some words cannot be calculated and the problem of polysemy cannot be solved well. In the training process of the word vector, the context information of the word is captured. The context information of the word can not only bind the relationship between the word and the word, but also solve the semantic blank problem caused by the lack of professional vocabulary in the dictionary. Therefore, , in this embodiment, word vectors are used to measure the semantic relationship between words, so as to obtain a set of subject words corresponding to each document in multiple documents.

在本实施例中，因为在得到文档对应的主题词集合的过程中，引入了词向量和关键词，所以，在计算词与词之间的词影响力时，需要对原始的词影响力公式做出改进，改进后的词影响力得分公式如下所示：In this embodiment, because word vectors and keywords are introduced in the process of obtaining the subject word set corresponding to the document, when calculating the word influence between words, the original word influence formula Making improvements, the improved word influence score formula is as follows:

其中，d代表预设的阻尼系数，a表示在加权关键词有向图中的关键词a，Out(b)表示在加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在加权关键词有向图中关键词a和关键词b的权值，w_bc表示在加权关键词有向图中关键词b和关键词c的权值。在确定词影响力的计算公式后，对多个文档中进行分词处理、过滤处理、提取初始主题词处理和统计词频处理，以获取第三语料库以及第三语料库中的主题词；最后依据词影响力得分的计算公式，迭代计算第三语料库中每个主题词的词影响力得分，直到当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时，或者达到最大迭代次数时，停止迭代计算，从第N次迭代计算的计算结果中，在每个文档中获取词影响力得分高于预设词影响力得分的主题词，得到每个文档对应的主题词集合。Among them, d represents the preset damping coefficient, a represents the keyword a in the weighted keyword directed graph, Out(b) represents the set of all keywords pointed to by keyword a in the weighted keyword directed graph, and In (a) represents the set of all keywords pointing to keyword a in the weighted keyword directed graph, b represents keyword b pointing to keyword a, c represents keyword c pointing to keyword a, S(b) represents The word influence of keyword b, tfidf _a represents the product of the word frequency of keyword a and the inverse document frequency, tfidf _c represents the product of the word frequency of keyword c and the inverse document frequency, w _ba represents the key word in the weighted keyword directed graph The weight of word a and keyword b, w _bc represents the weight of keyword b and keyword c in the weighted keyword directed graph. After determining the calculation formula of word influence, word segmentation processing, filtering processing, extraction of initial subject words and statistical word frequency processing are performed on multiple documents to obtain the third corpus and the subject words in the third corpus; finally, based on word influence The calculation formula of the power score, iteratively calculates the word influence score of each topic word in the third corpus until the difference between the Nth iteration calculation and the N-1th iteration is less than the preset threshold, or reaches the maximum iteration number of times, the iterative calculation is stopped, and from the calculation results of the Nth iterative calculation, the keyword whose word influence score is higher than the preset word influence score is obtained in each document, and the corresponding keyword set of each document is obtained.

通过在计算主题词的词影响力时引入词向量和关键词的计算公式，使用词向量和关键词衡量主题词之间的关系，并改进计算词影响力的计算公式以计算主题词的权值，得到了每个文档对应的主题词集合，计算得到的主题词结合了上下文语义信息，达到了更准确地提取多个文档中主题词的效果，进一步达到了提高多个文档之间相似度的准确率的效果。By introducing the calculation formula of word vectors and keywords when calculating the word influence of keywords, using word vectors and keywords to measure the relationship between keywords, and improving the calculation formula for calculating the influence of words to calculate the weight of keywords , the set of keywords corresponding to each document is obtained, and the calculated keywords are combined with the contextual semantic information to achieve the effect of more accurately extracting the keywords in multiple documents, and further achieve the goal of improving the similarity between multiple documents effect on accuracy.

可选地，在本申请实施例提供的客户信息的处理方法中，依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度包括：依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度；依据矩阵的最小二乘距离公式，计算每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度；依据第一相似度和第二相似度，计算每个文档与其它每个文档之间的目标相似度。Optionally, in the method for processing customer information provided in the embodiment of the present application, according to the subject word set of each document and the weighted matrix model of each document, the relationship between each document and each other document among multiple documents is calculated The target similarity includes: constructing a bipartite graph model based on the keyword set of each document to calculate the first similarity between the keyword set of each document and the keyword set of each other document; according to the minimum The square distance formula calculates the second similarity between the weighted matrix model of each document and the weighted matrix model of each other document; calculates the relationship between each document and each other document based on the first similarity and the second similarity similarity between targets.

为了计算每个文档与其它每个文档之间的目标相似度，需要首先使用每个文档的主题词集合构造二部图模型，并计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。然后使用矩阵的最小二乘距离公式，计算每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度。例如，计算文本矩阵doc₁(N×m维矩阵，N为词向量维度)和doc₂(N×n维矩阵)之间的相似度的具体步骤为：对矩阵doc₁进行正交三角分解，得到正交矩阵X；计算X^Tdoc₂并进行正交三角分解，得到矩阵doc₁和矩阵doc₂的差矩阵D；计算矩阵doc₁和矩阵doc₂之间的距离(即第二相似度)，计算公式如下：In order to calculate the target similarity between each document and each other document, it is necessary to first construct a bipartite graph model using the subject term set of each document, and calculate the subject term set of each document and each other document subject term The first degree of similarity between sets. The second similarity between the weighted matrix model of each document and the weighted matrix model of each other document is then calculated using the least squares distance formula of the matrix. For example, the specific steps to calculate the similarity between the text matrix doc ₁ (N×m dimensional matrix, N is the word vector dimension) and doc ₂ (N×n dimensional matrix) are: perform orthogonal triangular decomposition on the matrix doc ₁ , Obtain the orthogonal matrix X; calculate X ^T doc ₂ and perform orthogonal triangular decomposition to obtain the difference matrix D between matrix doc ₁ and matrix doc ₂ ; calculate the distance between matrix doc ₁ and matrix doc ₂ (ie, the second similarity) ,Calculated as follows:

其中，a_ij是差矩阵D的第i行第j列的元素。最后，依据第一相似度和第二相似度，计算每个文档与其它每个文档之间的目标相似度，目标相似度的计算公式如下所示：Among them, a _ij is the element of row i and column j of difference matrix D. Finally, according to the first similarity and the second similarity, calculate the target similarity between each document and each other document, the calculation formula of the target similarity is as follows:

similarity(doc₁,doc₂)＝αDSM(doc₁,doc₂)+βTSM(doc₁,doc₂)similarity(doc ₁ ,doc ₂ )=αDSM(doc ₁ ,doc ₂ )+βTSM(doc ₁ ,doc ₂ )

其中，doc₁,doc₂是任意两个文档，DSM(doc₁,doc₂)表示矩阵doc₁和矩阵doc₂之间的第一相似度，TSM(doc₁,doc₂)表示矩阵doc₁和矩阵doc₂之间的第二相似度，α和β分别为第一相似度和第二相似度的权重。Among them, doc ₁ and doc ₂ are any two documents, DSM(doc ₁ ,doc ₂ ) represents the first similarity between matrix doc ₁ and matrix doc ₂ , TSM(doc ₁ ,doc ₂ ) represents matrix doc ₁ and The second similarity between the matrices doc ₂ , α and β are the weights of the first similarity and the second similarity respectively.

通过结合每个文档的主题词集合和加权矩阵模型，计算每个文档与其它每个文档之间的目标相似度，不仅避免了语义空白产生的影响，同时结合了文档中的上下文语义信息，使每个文档与其它每个文档之间的目标相似度更加准确，达到了提高多个文档之间相似度的准确率的效果。By combining the keyword set of each document and the weighted matrix model, the target similarity between each document and each other document is calculated, which not only avoids the influence of semantic gaps, but also combines the contextual semantic information in the document, so that The target similarity between each document and each other document is more accurate, and the effect of improving the accuracy of the similarity between multiple documents is achieved.

可选地，在本申请实施例提供的客户信息的处理方法中，依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度包括：依据每个文档的主题词集合，构建每个文档和其它每个文档的二部图模型；使用匈牙利算法计算每个文档和其它每个文档之间的二部匹配最大权值，将二部匹配最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。Optionally, in the method for processing customer information provided in the embodiment of the present application, a bipartite graph model is constructed according to the set of subject terms of each document, so as to calculate the difference between the set of subject terms of each document and the set of subject terms of each other document The first similarity between each document includes: constructing a bipartite graph model between each document and each other document based on the keyword set of each document; using the Hungarian algorithm to calculate the bipartite graph model between each document and each other document Matching maximum weight, the bipartite matching maximum weight is used as the first similarity between the subject word set of each document and the subject word set of each other document.

在本实施例中，为了计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度，在得到每个文档的主题词集合之后，使用每个文档的主题词集合构建每个文档和其它每个文档之间的二部图模型，即使用两个文档之间的二部图模型表示两个文档之间的相关性。构建两个文档之间的二部图模型的具体过程如下：给定两个文档的两个主题集T_d1，T_d2构建二部图B(T_d1,T_d2)，|V(T_d1)|表示主题集T_d1的节点，|V(T_d2)|表示主题集T_d2的节点，使用b(u)表示B(T_d1,T_d2)中相关节点u；对于二部图B(T_d1,T_d2)中的每个节点u∈V(T_d1)，在节点v∈V(T_d2)中选取与u相似度最大的节点v形成一条边，若存在多个节点v使得节点u和节点v之间的相似度w(b(u),b(v))最大，则形成多条边；对于每个节点v∈V(T_d2)，同样在节点u∈V(T_d1)中选取与v相似度最大的节点形成连线，若连线已存在，则不需要再次添加连线；然后使用匈牙利算法在已连接的边中寻找相似度最大的边，即从一个未匹配点出发，不停地寻找增广路径，若找到增广路径则将增广路径上的匹配边和匹配点进行匹配，若找不到增广路径时，则确认该二部图已达到最大匹配；在遍历一次二部图中所有节点之后，获取未匹配的节点，构建新的二部图再次进行匹配，直到二部图中不存在未匹配的节点，则从匹配完成的二部图中获取最大相似度w(b(u),b(v))(即最大权值)。最后，将经过二部图匹配后的最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。In this embodiment, in order to calculate the first similarity between the keyword set of each document and the keyword set of each other document, after obtaining the keyword set of each document, use the keyword set of each document The collection builds a bipartite graph model between each document and every other document, that is, uses the bipartite graph model between two documents to represent the correlation between two documents. The specific process of building a bipartite graph model between two documents is as follows: Given two topic sets T_d1 and T_d2 of two documents, construct a bipartite graph B(T_d1, T_d2), |V(T_d1)| represents the topic set T_d1 , |V(T_d2)| represents the node of the topic set T_d2, use b(u) to represent the relevant node u in B(T_d1, T_d2); for each node u∈ in the bipartite graph B(T_d1, T_d2) V(T_d1), in the node v∈V(T_d2), select the node v with the greatest similarity with u to form an edge. If there are multiple nodes v, the similarity between node u and node v is w(b(u) , b(v)) is the largest, then multiple edges are formed; for each node v∈V(T_d2), select the node with the highest similarity with v in the node u∈V(T_d1) to form a connection, if the connection If it already exists, there is no need to add a connection again; then use the Hungarian algorithm to find the edge with the highest similarity among the connected edges, that is, start from an unmatched point, and keep looking for the augmented path. If the augmented path is found, then Match the matching edges and matching points on the augmented path. If the augmented path cannot be found, confirm that the bipartite graph has reached the maximum match; after traversing all the nodes in the bipartite graph, obtain unmatched nodes , build a new bipartite graph and perform matching again until there are no unmatched nodes in the bipartite graph, then obtain the maximum similarity w(b(u),b(v)) from the matched bipartite graph (ie maximum value). Finally, the maximum weight after the bipartite graph matching is used as the first similarity between the subject term set of each document and the subject term set of each other document.

通过引入二部图模型和匈牙利算法，计算多个文档中每个文档的主题词集合之间的第一相似度，使多个文档中每个文档的主题词集合之间的第一相似度更加准确，达到了提高多个文档之间相似度的准确率的效果。By introducing the bipartite graph model and the Hungarian algorithm, the first similarity between the subject word sets of each document in multiple documents is calculated, so that the first similarity between the subject word sets of each document in multiple documents is more accurate. Accurate, achieving the effect of improving the accuracy of the similarity between multiple documents.

可选地，在本申请实施例提供的客户信息的处理方法中，从聚类后的文档集合中获取多个客户的目标信息包括：依据获取主题词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的主题词集合；依据获取关键词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的关键词集合；依据每个聚类后的文档集合的主题词集合和每个聚类后的文档集合的关键词集合，确定多个客户的目标信息。Optionally, in the method for processing customer information provided in the embodiment of the present application, obtaining the target information of multiple customers from the clustered document collection includes: according to the method of obtaining the subject word collection, for each clustered The document collection is processed to obtain the subject word collection of each clustered document collection; according to the method of obtaining the keyword collection, each clustered document collection is processed to obtain the key word of each clustered document collection A word set; according to the subject word set of each clustered document set and the keyword set of each clustered document set, target information of multiple customers is determined.

为了从多个文档中挖掘客户的潜在价值信息(即目标信息)，需要首先依据从文档中获取主题词集合的方法，从多个簇类文档中的每个簇类文档中提取主题词，将每个簇类文档对应的主题词集合作为客户来电信息业务子类的客户信息。再依据从文档中获取关键词集合的方法，从多个簇类文档中的每个簇类文档中提取关键词，将每个簇类文档对应的关键词集合作为客户来电信息问题概要类的客户信息。最后结合客户来电信息业务子类的客户信息和客户来电信息问题概要类的客户信息，得到从多个客户的多个来电工单中提取到的目标信息。In order to mine customers’ potential value information (i.e. target information) from multiple documents, it is first necessary to extract the subject terms from each cluster document in multiple cluster documents according to the method of obtaining the subject term set from the document. The set of subject words corresponding to each cluster document is used as the customer information of the customer call information business subclass. Then, according to the method of obtaining keyword sets from documents, keywords are extracted from each cluster document in multiple cluster documents, and the keyword set corresponding to each cluster document is used as the customer of the customer call information question summary class information. Finally, combined with the customer information of the customer call information business subclass and the customer information of the customer call information problem summary class, the target information extracted from multiple call work orders of multiple customers is obtained.

通过从文档中获取主题词集合和获取关键词集合的方法，对聚类后文档进行处理，得到客户来电信息业务子类的客户信息和客户来电信息问题概要类的客户信息(即目标信息)，实现了在提取目标信息时充分考虑到来电工单中的上下文信息，达到了从来电工单中挖掘客户更重视的目标信息的效果，进一步达到了从来电工单中提取更准确的客户潜在价值信息的效果。By obtaining the subject word set and the keyword set from the document, the clustered document is processed to obtain the customer information of the customer call information business subclass and the customer information of the customer call information problem summary class (ie target information), It realizes the full consideration of the context information in the incoming work order when extracting the target information, achieves the effect of mining the target information that customers pay more attention to from the incoming work order, and further achieves the effect of extracting more accurate customer potential value information from the incoming work order .

可选地，在本申请实施例提供的客户信息的处理方法中，在依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度之后，上述方法还包括：将多个文档中每个文档与其他每个文档之间的目标相似度相加，得到每个文档的第三相似度；从多个文档中，获取第三相似度高于预设相似度的目标文档，得到目标文档集合；将目标文档集合推送至目标对象。Optionally, in the method for processing customer information provided in the embodiment of the present application, according to the keyword set of each document and the weighted matrix model of each document, calculate the difference between each document and each other document in multiple documents After the target similarity among multiple documents, the above method also includes: adding the target similarity between each document and every other document in the multiple documents to obtain the third similarity of each document; from the multiple documents, Acquiring target documents with a third similarity higher than the preset similarity to obtain a target document set; pushing the target document set to the target object.

在计算出多个文档中每个文档之间的目标相似度之后，本实施例实现的方案还包括从多个来电工单中获取最热门的至少一个来单工单，即工单核心内容在多个来电工单中出现最频繁的至少一个来单工单：首先计算多个文档中每个文档与其他每个文档之间的目标相似度的总和，得到每个文档对应的第三相似度；再从多个文档中，获取第三相似度高于预设相似度的目标文档，得到目标文档集合，即为多个来电工单中最热门的至少一个来单工单；最后将目标文档集合推送至相关工作人员(即目标对象)，以提升客户服务的质量，并为客户提供更精准营销，以满足客户的迫切需要。After calculating the target similarity between each of the multiple documents, the solution implemented in this embodiment also includes obtaining at least one of the most popular incoming work orders from the multiple incoming work orders, that is, the core content of the work order is in At least one incoming work order that appears most frequently among multiple incoming work orders: first calculate the sum of the target similarities between each document and each other document in multiple documents, and obtain the third similarity corresponding to each document ; From multiple documents, obtain the target document whose third similarity is higher than the preset similarity, and obtain the target document set, which is at least one of the most popular work orders among the multiple call work orders; finally, the target document The collection is pushed to the relevant staff (that is, the target audience) to improve the quality of customer service and provide customers with more precise marketing to meet the urgent needs of customers.

通过计算多个来电工单的第三相似度，得到了客户比较重视的来电工单，并将客户比较重视的来电工单推送至相关工作人员，有益于提升工作人员对客户的服务质量，以及有利于为客户提供精准营销以满足客户的迫切需要，达到了更准确地提取客户比较重视的来电工单的效果，进一步达到了提升客户满意度的效果。By calculating the third similarity of multiple incoming call work orders, the incoming call work orders that customers pay more attention to are obtained, and the incoming call work orders that customers pay more attention to are pushed to relevant staff, which is beneficial to improve the service quality of staff to customers, and It is conducive to providing customers with precise marketing to meet the urgent needs of customers, achieving the effect of more accurately extracting the incoming call work orders that customers pay more attention to, and further achieving the effect of improving customer satisfaction.

可选地，在本实施例中，本方案计算多个加权矩阵模型之间的相似度过程可以如图2所示，首先对语料库中的多个文档进行预处理和分词处理等处理，得到的第一语料库，然后使用TextRank模型从第一语料库中提取多个文档中每个文档对应的关键词集合，使用Word2vec模型对第一语料库进行训练得到词向量，然后将每个文档对应的关键词集合中的每个关键词使用对应的词向量进行替换，得到关键词向量，并基于TF-IDF值计算关键词向量的权值，得到每个文档的文本矩阵模型，使用最小矩阵二乘距离多个文本矩阵模型之间的相似度，最后将多个文本矩阵模型之间的相似度代入计算文本相似度的计算公式中，计算并得到多个文档之间的文本相似度值。Optionally, in this embodiment, the process of calculating the similarity between multiple weighted matrix models in this solution can be shown in Figure 2. First, preprocessing and word segmentation processing are performed on multiple documents in the corpus, and the obtained The first corpus, then use the TextRank model to extract the keyword set corresponding to each document in multiple documents from the first corpus, use the Word2vec model to train the first corpus to obtain word vectors, and then use the keyword set corresponding to each document Each keyword in is replaced with the corresponding word vector to obtain the keyword vector, and the weight of the keyword vector is calculated based on the TF-IDF value, and the text matrix model of each document is obtained, using the minimum matrix square distance multiple The similarity between text matrix models, and finally, the similarity between multiple text matrix models is substituted into the calculation formula for calculating text similarity, and the text similarity value between multiple documents is calculated and obtained.

可选地，在本实施例中，本方案从多个客户的多个来电工单中挖掘价值信息的主要工作流程可以如图3所示。第一步，基于Word2vec的词向量模型对预处理后来电工单的进行语义检索；第二步，基于TextRank模型提取多个来电工单中的关键词，以及基于关键词的词影响力挖掘多个来电工单中的主题词；第三步，基于文本加权矩阵模型表示多个来电工单的文本内容，以及基于文本图模型的主题词表示多个来电工单的文本内容；第四步，基于矩阵最小二乘距离算法计算多个文本加权矩阵之间的关键词模型相似度，以及基于Hungarian的二部图匹配算法计算多个主题词集合之间的主题模型相似度；第五步，基于聚类分析算法挖掘客户的潜在意图，以及基于计算得到的文本相似度挖掘多个来电工单中的热门工单。Optionally, in this embodiment, the main workflow of mining value information from multiple call orders of multiple customers in this solution may be as shown in FIG. 3 . In the first step, the word vector model based on Word2vec performs semantic retrieval on the electrical work orders after preprocessing; in the second step, based on the TextRank model, the keywords in multiple incoming work orders are extracted, and the influence of words based on keywords is mined for multiple The subject words in the call work order; the third step, based on the text weighted matrix model to represent the text content of multiple call work orders, and the subject words based on the text graph model to represent the text content of multiple call work orders; the fourth step, based on The matrix least squares distance algorithm calculates the keyword model similarity between multiple text weighted matrices, and the bipartite graph matching algorithm based on Hungarian calculates the topic model similarity between multiple topic word sets; the fifth step, based on aggregation The class analysis algorithm mines potential intentions of customers, and mines popular work orders among multiple incoming work orders based on the calculated text similarity.

可选地，在本实施例中，本方案从多个客户的多个来电工单中挖掘价值信息的工作流程可以如图4所示。第一步，对原始语料库进行文本预处理，使用基于Word2vec的词向量模型对处理后的语料库进行计算，得到语料库对应的词向量。第二步，使用TextRank模型提取处理后的语料库中的关键词，得到文本关键词集合，并依据关键词的词频/逆文档词频迭代计算关键词的加权评分，得到文本主题集合。第三步，依据文本关键词集合构建加权文本矩阵，同时依据文本主题集合构建二部图模型和计算主题相似度。第四步，使用矩阵最小二乘距离算法计算文本加权矩阵之间的文本语义相似度，使用文本主题相似性匹配计算文本主题相似度，再根据文本语义相似度和文本主题相似度计算文本之间的相似度。第五步，对多个来电工单进行预处理，使用PAM聚类算法对预处理后的多个来电工单进行分析计算，从多个来电工单中提取主题词和关键词，最后，依据从多个来电工单中提取主题词和关键词挖掘客户来电意图，依据第四步中的文本相似度挖掘多个来电工单中的Top工单(即热门工单)。Optionally, in this embodiment, the workflow of mining value information from multiple call orders of multiple customers in this solution may be as shown in FIG. 4 . The first step is to perform text preprocessing on the original corpus, and use the word vector model based on Word2vec to calculate the processed corpus to obtain the word vector corresponding to the corpus. The second step is to use the TextRank model to extract the keywords in the processed corpus to obtain a text keyword set, and iteratively calculate the weighted score of keywords according to the keyword frequency/inverse document word frequency to obtain a text topic set. In the third step, the weighted text matrix is constructed according to the text keyword set, and the bipartite graph model is constructed and the topic similarity is calculated according to the text topic set. The fourth step is to use the matrix least squares distance algorithm to calculate the text semantic similarity between text weighted matrices, use the text topic similarity matching to calculate the text topic similarity, and then calculate the text semantic similarity and text topic similarity according to the text semantic similarity and text topic similarity. similarity. The fifth step is to preprocess multiple call tickets, use the PAM clustering algorithm to analyze and calculate the preprocessed multiple call tickets, extract subject words and keywords from multiple call tickets, and finally, according to Extract subject words and keywords from multiple incoming work orders to mine customer call intentions, and mine Top work orders (that is, popular work orders) among multiple incoming work orders according to the text similarity in the fourth step.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, The steps shown or described may be performed in an order different than here.

本申请实施例还提供了一种客户信息的处理装置，需要说明的是，本申请实施例的客户信息的处理装置可以用于执行本申请实施例所提供的用于客户信息的处理方法。以下对本申请实施例提供的客户信息的处理装置进行介绍。The embodiment of the present application also provides a client information processing device. It should be noted that the client information processing device in the embodiment of the present application can be used to execute the method for processing client information provided in the embodiment of the present application. The following introduces the client information processing device provided by the embodiment of the present application.

图5是根据本申请实施例的客户信息的处理装置的示意图。如图5所示，该装置包括：第一获取单元501、第二获取单元502、第三获取单元503、计算单元504、第四获取单元505和第五获取单元506。Fig. 5 is a schematic diagram of an apparatus for processing customer information according to an embodiment of the present application. As shown in FIG. 5 , the apparatus includes: a first acquisition unit 501 , a second acquisition unit 502 , a third acquisition unit 503 , a calculation unit 504 , a fourth acquisition unit 505 and a fifth acquisition unit 506 .

具体的，第一获取单元501，用于对多个客户的多个来电工单中的客户信息进行编码处理，得到多个来电工单对应的多个文档，其中，多个文档的编码格式相同。Specifically, the first acquisition unit 501 is configured to encode customer information in multiple incoming work orders of multiple customers to obtain multiple documents corresponding to multiple incoming work orders, wherein the encoding formats of the multiple documents are the same .

第二获取单元502，用于获取多个文档中每个文档的加权矩阵模型。The second obtaining unit 502 is configured to obtain a weighted matrix model of each document in the plurality of documents.

第三获取单元503，用于获取多个文档中每个文档的主题词集合。The third obtaining unit 503 is configured to obtain a subject word set of each document in the plurality of documents.

计算单元504，用于依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度。The calculation unit 504 is configured to calculate the target similarity between each document and each other document among the plurality of documents according to the keyword set of each document and the weighted matrix model of each document.

第四获取单元505，用于依据目标相似度，基于PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合。The fourth acquiring unit 505 is configured to cluster multiple documents based on the PAM clustering algorithm according to the target similarity to obtain a clustered document set.

第五获取单元506，用于从聚类后的文档集合中获取多个客户的目标信息。The fifth obtaining unit 506 is configured to obtain target information of multiple customers from the clustered document collection.

本申请实施例提供的客户信息的处理装置，通过第一获取单元501，用于对多个客户的多个来电工单中的客户信息进行编码处理，得到多个来电工单对应的多个文档，其中，多个文档的编码格式相同；第二获取单元502，用于获取多个文档中每个文档的加权矩阵模型；第三获取单元503，用于获取多个文档中每个文档的主题词集合；计算单元504，用于依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度；第四获取单元505，用于依据目标相似度，基于PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合；第五获取单元506，用于从聚类后的文档集合中获取多个客户的目标信息，解决了采取人工标注或自然语言处理技术提取来电工单的价值信息，导致从来电工单中提取的信息不准确的问题。通过计算多个来电工单中每个文档的加权矩阵模型和每个文档的主题词集合，以计算多个文档之间的目标相似度，从而在多个文档中提取目标信息，实现了通过算法自动从多个文档中提取目标信息，避免了人工标注多个文档对目标信息的信息质量产生的影响，达到了从多个来电工单中提取更加准确的目标信息的效果。The customer information processing device provided in the embodiment of the present application is used to encode the customer information in the multiple call work orders of multiple customers through the first acquisition unit 501, and obtain multiple documents corresponding to the multiple call work orders , wherein the encoding formats of multiple documents are the same; the second acquisition unit 502 is used to acquire the weighted matrix model of each document in the multiple documents; the third acquisition unit 503 is used to acquire the subject of each document in the multiple documents word set; calculation unit 504, for calculating the target similarity between each document and each other document in multiple documents according to the subject word set of each document and the weighted matrix model of each document; the fourth acquisition unit 505, for clustering multiple documents based on the PAM clustering algorithm according to the target similarity, to obtain a clustered document set; the fifth acquisition unit 506, for acquiring multiple customers from the clustered document set The target information solves the problem that manual labeling or natural language processing technology is used to extract the value information of the incoming work order, resulting in inaccurate information extracted from the incoming work order. By calculating the weighted matrix model of each document and the subject word set of each document in multiple incoming call tickets, to calculate the target similarity between multiple documents, so as to extract target information in multiple documents, the algorithm is implemented Automatically extract target information from multiple documents, avoiding the impact of manual labeling of multiple documents on the information quality of target information, and achieve the effect of extracting more accurate target information from multiple call work orders.

可选地，在本申请实施例提供的客户信息的处理装置中，上述第二获取单元502包括：第一处理子单元，用于删除多个文档中的预设字符串，并对多个文档中每个文档进行分词、过滤处理，得到第一语料库；第一计算子单元，用于将第一语料库代入word2vec模型中进行计算，得到每个文档的词向量；第二计算子单元，用于将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；第一获取子单元，用于依据每个文档的关键词集合与其它每个文档的关键词集合的相似度，使用贪婪选择算法将每个文档的关键词集合中的关键词进行排序，获取每个文档的关键词向量；第三计算子单元，用于使用词向量替换关键词向量中的关键词，并使用TF-IDF特征选择函数计算关键词向量中的词向量的权值，得到替换后的关键词向量以及替换后的关键词向量中的词向量对应的词向量权值；第二获取子单元，用于依据替换后的关键词向量和词向量权值，获取每个文档的加权矩阵模型。Optionally, in the client information processing device provided in the embodiment of the present application, the above-mentioned second acquisition unit 502 includes: a first processing subunit, configured to delete preset character strings in multiple documents, and Carry out word segmentation and filtering processing for each document to obtain the first corpus; the first calculation subunit is used to substitute the first corpus into the word2vec model for calculation to obtain the word vector of each document; the second calculation subunit is used for Substituting multiple documents into the TextRank model for calculation to obtain the keyword set of each document; the first acquisition subunit is used to use The greedy selection algorithm sorts the keywords in the keyword set of each document to obtain the keyword vector of each document; the third calculation subunit is used to replace the keywords in the keyword vector with the word vector, and use TF -IDF feature selection function calculates the weight of the word vector in the keyword vector, and obtains the word vector weight corresponding to the word vector in the replaced keyword vector and the replaced keyword vector; the second acquisition subunit is used for Obtain the weighted matrix model of each document according to the replaced keyword vector and word vector weights.

可选地，在本申请实施例提供的客户信息的处理装置中，上述第二计算子单元包括：处理模块，用于对多个文档中的每个文档进行第一处理，获取预设词性的词作为每个文档的候选关键词，其中，第一处理至少包括以下处理：分句处理、分词处理、过滤处理和标注词性处理；第一获取模块，用于依据每个文档以及每个文档的候选关键词，获取第二语料库；第一计算模块，用于使用TextRank模型将每个文档的候选关键词转换为关键词有向图，并计算候选关键词在每个文档中的权值，得到每个文档的加权关键词有向图；第二获取模块，用于从加权关键词有向图中，获取权值高于预设权值的候选关键词，得到每个文档的关键词集合。Optionally, in the client information processing device provided in the embodiment of the present application, the above-mentioned second calculation subunit includes: a processing module, configured to perform the first processing on each of the multiple documents, and obtain the preset part-of-speech Words are used as candidate keywords for each document, wherein the first processing includes at least the following processing: sentence segmentation processing, word segmentation processing, filtering processing, and tagging part-of-speech processing; the first acquisition module is used for each document and each document according to The candidate keywords are obtained from the second corpus; the first calculation module is used to convert the candidate keywords of each document into a keyword directed graph using the TextRank model, and calculate the weight of the candidate keywords in each document to obtain The weighted keyword directed graph of each document; the second obtaining module is used to obtain candidate keywords whose weight value is higher than a preset weight value from the weighted keyword directed graph, and obtain a keyword set of each document.

可选地，在本申请实施例提供的客户信息的处理装置中，上述第三获取单元503包括：配置子单元，用于依据TextRank模型配置词向量的词影响力得分的计算公式，计算公式如下所示：

其中，d代表预设的阻尼系数，a表示在加权关键词有向图中的关键词a，Out(b)表示在加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在加权关键词有向图中关键词a和关键词b的权值，w_bc表示在加权关键词有向图中关键词b和关键词c的权值；第二处理子单元，用于对多个文档中的每个文档进行第二处理，获取第三语料库以及第三语料库中的主题词，其中，第二处理至少包括以下处理：分词处理、过滤处理、提取初始主题词处理和统计词频处理；第四计算子单元，用于依据词影响力得分的计算公式，迭代计算第三语料库中的主题词的词影响力得分，当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时停止迭代计算，从第N次迭代计算的计算结果中，获取词影响力得分高于预设词影响力得分的主题词，得到每个文档的主题词集合。Optionally, in the client information processing device provided in the embodiment of the present application, the above-mentioned third acquisition unit 503 includes: a configuration subunit for configuring the calculation formula of the word influence score of the word vector according to the TextRank model, and the calculation formula is as follows Shown:

Among them, d represents the preset damping coefficient, a represents the keyword a in the weighted keyword directed graph, Out(b) represents the set of all keywords pointed to by keyword a in the weighted keyword directed graph, and In (a) represents the set of all keywords pointing to keyword a in the weighted keyword directed graph, b represents keyword b pointing to keyword a, c represents keyword c pointing to keyword a, S(b) represents The word influence of keyword b, tfidf _a represents the product of the word frequency of keyword a and the inverse document frequency, tfidf _c represents the product of the word frequency of keyword c and the inverse document frequency, w _ba represents the key word in the weighted keyword directed graph The weight value of word a and keyword b, w _bc represents the weight value of keyword b and keyword c in the weighted keyword directed graph; the second processing subunit is used for each document in a plurality of documents The second process is to obtain the third corpus and the subject words in the third corpus, wherein the second process at least includes the following processing: word segmentation processing, filtering processing, extracting initial subject words processing and statistical word frequency processing; the fourth calculation subunit, using Based on the calculation formula of the word influence score, iteratively calculate the word influence score of the subject words in the third corpus, and stop the iterative calculation when the difference between the Nth iterative calculation and the N-1th iteration is less than the preset threshold , from the calculation results of the Nth iterative calculation, obtain the keyword whose word influence score is higher than the preset word influence score, and obtain the keyword set of each document.

可选地，在本申请实施例提供的客户信息的处理装置中，上述计算单元504包括：构造子单元，用于依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度；第五计算子单元，用于依据矩阵的最小二乘距离公式，计算每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度；第六计算子单元，用于依据第一相似度和第二相似度，计算每个文档与其它每个文档之间的目标相似度。Optionally, in the client information processing device provided in the embodiment of the present application, the calculation unit 504 includes: a construction subunit, configured to construct a bipartite graph model according to the subject word set of each document, so as to calculate the The first similarity between the subject word set and the subject term set of each other document; the fifth calculation subunit is used to calculate the weighted matrix model of each document and each other document according to the least squares distance formula of the matrix The second similarity between the weighted matrix models; the sixth calculation subunit is used to calculate the target similarity between each document and each other document according to the first similarity and the second similarity.

可选地，在本申请实施例提供的客户信息的处理装置中，上述第一构造子单元包括：构造模块，用于依据每个文档的主题词集合，构建每个文档和其它每个文档的二部图模型；第二计算模块，用于使用匈牙利算法计算每个文档和其它每个文档之间的二部匹配最大权值，将二部匹配最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。Optionally, in the client information processing device provided in the embodiment of the present application, the above-mentioned first construction subunit includes: a construction module, configured to construct each document and each other document according to the subject word set of each document Bipartite graph model; the second calculation module is used to calculate the maximum weight of bipartite matching between each document and each other document using the Hungarian algorithm, and use the maximum weight of bipartite matching as the keyword set of each document and The first degree of similarity between the set of subject terms of each of the other documents.

可选地，在本申请实施例提供的客户信息的处理装置中，上述第五获取单元506包括：第三处理子单元，用于依据获取主题词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的主题词集合；第四处理子单元，用于依据获取关键词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的关键词集合；确定子单元，用于依据每个聚类后的文档集合的主题词集合和每个聚类后的文档集合的关键词集合，确定多个客户的目标信息。Optionally, in the client information processing device provided in the embodiment of the present application, the above-mentioned fifth acquisition unit 506 includes: a third processing subunit, configured to, according to the method of acquiring a set of subject words, classify each clustered document The set is processed to obtain the keyword set of each clustered document set; the fourth processing subunit is used to process each clustered document set according to the method of obtaining the keyword set, and obtain each clustered The keyword set of the document collection after the classification; determine the subunit, which is used to determine the target information of multiple customers according to the keyword collection of each clustered document collection and the keyword collection of each clustered document collection .

可选地，在本申请实施例提供的客户信息的处理装置中，上述装置还包括：第六获取单元，用于在依据主题词集合和加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度之后，将多个文档中每个文档与其他每个文档之间的目标相似度相加，得到每个文档的第三相似度；第七获取单元，用于从多个文档中，获取第三相似度高于预设相似度的目标文档，得到目标文档集合；推送单元，用于将目标文档集合推送至目标对象。Optionally, in the client information processing device provided in the embodiment of the present application, the above-mentioned device further includes: a sixth acquisition unit, which is used to calculate the relationship between each document among the multiple documents and other documents according to the set of subject terms and the weighted matrix model. After the target similarity between each document, add the target similarity between each document and each other document in multiple documents to obtain the third similarity of each document; the seventh acquisition unit is used to From the plurality of documents, a target document with a third similarity higher than a preset similarity is obtained to obtain a target document set; a pushing unit is configured to push the target document set to a target object.

所述客户信息的处理装置包括处理器和存储器，上述第一获取单元501、第二获取单元502、第三获取单元503、计算单元504、第四获取单元505和第五获取单元506等均作为程序单元存储在存储器中，由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The customer information processing device includes a processor and a memory, and the above-mentioned first acquisition unit 501, second acquisition unit 502, third acquisition unit 503, calculation unit 504, fourth acquisition unit 505, and fifth acquisition unit 506 all serve as The program unit is stored in the memory, and the processor executes the program unit stored in the memory to realize corresponding functions.

处理器中包含内核，由内核去存储器中调取相应的程序单元。内核可以设置一个或以上，通过调整内核参数来提高多个文档之间的相似度的准确性。The processor includes a kernel, and the kernel fetches corresponding program units from the memory. One or more kernels can be set, and the accuracy of the similarity between multiple documents can be improved by adjusting kernel parameters.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)，存储器包括至少一个存储芯片。Memory may include non-permanent memory in computer-readable media, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory including at least one memory chip.

本发明实施例提供了一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时实现客户信息的处理方法。An embodiment of the present invention provides a computer-readable storage medium on which a program is stored, and when the program is executed by a processor, a method for processing customer information is realized.

本发明实施例提供了一种处理器，处理器用于运行程序，其中，程序运行时执行客户信息的处理方法。An embodiment of the present invention provides a processor, and the processor is used to run a program, wherein the method for processing customer information is executed when the program is running.

如图6所示，本发明实施例提供了一种电子设备，设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序，处理器执行程序时实现以下步骤：对多个客户的多个来电工单中的客户信息进行编码处理，得到多个来电工单对应的多个文档，其中，多个文档的编码格式相同；获取多个文档中每个文档的加权矩阵模型；获取多个文档中每个文档的主题词集合；依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度；依据目标相似度，基于PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合；从聚类后的文档集合中获取多个客户的目标信息。As shown in Figure 6, an embodiment of the present invention provides an electronic device, which includes a processor, a memory, and a program stored on the memory and operable on the processor. When the processor executes the program, the following steps are implemented: Coding the customer information in the customer's multiple incoming work orders to obtain multiple documents corresponding to the multiple incoming work orders, wherein the encoding format of the multiple documents is the same; obtain the weighted matrix model of each document in the multiple documents; Acquiring the keyword set of each document in the plurality of documents; calculating the target similarity between each document in the plurality of documents and each other document according to the keyword set of each document and the weighted matrix model of each document; According to the target similarity, multiple documents are clustered based on the PAM clustering algorithm to obtain a clustered document set; the target information of multiple customers is obtained from the clustered document set.

处理器执行程序时还实现以下步骤：获取多个文档中每个文档的加权矩阵模型包括：删除多个文档中的预设字符串，并对多个文档中每个文档进行分词、过滤处理，得到第一语料库；将第一语料库代入word2vec模型中进行计算，得到每个文档的词向量；将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；依据每个文档的关键词集合与其它每个文档的关键词集合的相似度，使用贪婪选择算法将每个文档的关键词集合中的关键词进行排序，获取每个文档的关键词向量；使用词向量替换关键词向量中的关键词，并使用TF-IDF特征选择函数计算关键词向量中的词向量的权值，得到替换后的关键词向量以及替换后的关键词向量中的词向量对应的词向量权值；依据替换后的关键词向量和词向量权值，获取每个文档的加权矩阵模型。When the processor executes the program, the following steps are also implemented: obtaining the weighted matrix model of each of the multiple documents includes: deleting the preset character strings in the multiple documents, and performing word segmentation and filtering for each of the multiple documents, Obtain the first corpus; Substituting the first corpus into the word2vec model for calculation to obtain the word vector of each document; Substituting multiple documents into the TextRank model for calculation to obtain the keyword set of each document; according to the key words of each document The similarity between the word set and the keyword set of each other document, use the greedy selection algorithm to sort the keywords in the keyword set of each document, and obtain the keyword vector of each document; use the word vector to replace the keyword vector keywords in , and use the TF-IDF feature selection function to calculate the weight of the word vector in the keyword vector, and obtain the replaced keyword vector and the word vector weight corresponding to the word vector in the replaced keyword vector; Obtain the weighted matrix model of each document according to the replaced keyword vector and word vector weights.

处理器执行程序时还实现以下步骤：将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合包括：对多个文档中的每个文档进行第一处理，获取预设词性的词作为每个文档的候选关键词，其中，第一处理至少包括以下处理：分句处理、分词处理、过滤处理和标注词性处理；依据每个文档以及每个文档的候选关键词，获取第二语料库；使用TextRank模型将每个文档的候选关键词转换为关键词有向图，并计算候选关键词在每个文档中的权值，得到每个文档的加权关键词有向图；从加权关键词有向图中，获取权值高于预设权值的候选关键词，得到每个文档的关键词集合。When the processor executes the program, the following steps are also implemented: Substituting multiple documents into the TextRank model for calculation, and obtaining the keyword set of each document includes: first processing each document in the multiple documents, and obtaining the preset part-of-speech Words are used as candidate keywords for each document, wherein the first processing includes at least the following processing: sentence segmentation processing, word segmentation processing, filtering processing, and part-of-speech tagging processing; according to each document and the candidate keywords of each document, obtain the second Corpus; use the TextRank model to convert the candidate keywords of each document into a keyword directed graph, and calculate the weight of the candidate keywords in each document to obtain the weighted keyword directed graph of each document; from the weighted key In the word directed graph, candidate keywords whose weights are higher than the preset weights are obtained, and the keyword set of each document is obtained.

处理器执行程序时还实现以下步骤：获取多个文档中每个文档的主题词集合包括：依据TextRank模型配置词向量的词影响力得分的计算公式，计算公式如下所示：

其中，d代表预设的阻尼系数，a表示在加权关键词有向图中的关键词a，Out(b)表示在加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在加权关键词有向图中关键词a和关键词b的权值，w_bc表示在加权关键词有向图中关键词b和关键词c的权值；对多个文档中的每个文档进行第二处理，获取第三语料库以及第三语料库中的主题词，其中，第二处理至少包括以下处理：分词处理、过滤处理、提取初始主题词处理和统计词频处理；依据词影响力得分的计算公式，迭代计算第三语料库中的主题词的词影响力得分，当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时停止迭代计算，从第N次迭代计算的计算结果中，获取词影响力得分高于预设词影响力得分的主题词，得到每个文档的主题词集合。When the processor executes the program, the following steps are also implemented: obtaining the subject word set of each document in the plurality of documents includes: configuring the calculation formula of the word influence score of the word vector according to the TextRank model, and the calculation formula is as follows:

处理器执行程序时还实现以下步骤：依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度包括：依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度；依据矩阵的最小二乘距离公式，计算每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度；依据第一相似度和第二相似度，计算每个文档与其它每个文档之间的目标相似度。When the processor executes the program, the following steps are also implemented: according to the subject word set of each document and the weighted matrix model of each document, calculating the target similarity between each document in the multiple documents and each other document includes: according to each A bipartite graph model is constructed from the subject term sets of documents to calculate the first similarity between the subject term sets of each document and the subject term sets of each other document; according to the least square distance formula of the matrix, calculate each The second similarity between the weighted matrix model of the document and the weighted matrix model of each other document; according to the first similarity and the second similarity, calculate the target similarity between each document and each other document.

处理器执行程序时还实现以下步骤：依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度包括：依据每个文档的主题词集合，构建每个文档和其它每个文档的二部图模型；使用匈牙利算法计算每个文档和其它每个文档之间的二部匹配最大权值，将二部匹配最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。When the processor executes the program, the following steps are also implemented: constructing a bipartite graph model according to the subject term set of each document, to calculate the first similarity between the subject term set of each document and the subject term set of each other document, including : According to the keyword set of each document, construct a bipartite graph model of each document and each other document; use the Hungarian algorithm to calculate the bipartite matching maximum weight between each document and each other document, and divide the bipartite The maximum matching weight is used as the first similarity between the subject word set of each document and the subject word set of each other document.

处理器执行程序时还实现以下步骤：从聚类后的文档集合中获取多个客户的目标信息包括：依据获取主题词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的主题词集合；依据获取关键词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的关键词集合；依据每个聚类后的文档集合的主题词集合和每个聚类后的文档集合的关键词集合，确定多个客户的目标信息。When the processor executes the program, the following steps are also implemented: obtaining the target information of multiple customers from the clustered document collection includes: processing each clustered document collection according to the method of obtaining the subject word collection, and obtaining each The keyword set of the clustered document set; according to the method of obtaining the keyword set, process each clustered document set to obtain the keyword set of each clustered document set; according to each cluster The subject word set of the final document set and the keyword set of each clustered document set are used to determine the target information of multiple customers.

处理器执行程序时还实现以下步骤：在依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度之后，上述方法还包括：将多个文档中每个文档与其他每个文档之间的目标相似度相加，得到每个文档的第三相似度；从多个文档中，获取第三相似度高于预设相似度的目标文档，得到目标文档集合；将目标文档集合推送至目标对象。The following steps are also implemented when the processor executes the program: after calculating the target similarity between each document and each other document in the plurality of documents according to the subject word set of each document and the weighted matrix model of each document, the above-mentioned The method also includes: adding the target similarity between each document and each other document in the plurality of documents to obtain a third similarity of each document; from the plurality of documents, obtaining the third similarity higher than the preset Set the target document of similarity to obtain the target document set; push the target document set to the target object.

本文中的设备可以是服务器、PC、PAD、手机等。The devices in this article can be servers, PCs, PADs, mobile phones, etc.

本申请还提供了一种计算机程序产品，当在数据处理设备上执行时，适于执行初始化有如下方法步骤的程序：对多个客户的多个来电工单中的客户信息进行编码处理，得到多个来电工单对应的多个文档，其中，多个文档的编码格式相同；获取多个文档中每个文档的加权矩阵模型；获取多个文档中每个文档的主题词集合；依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度；依据目标相似度，基于PAM聚类算法对多个文档进行聚类，得到聚类后的文档集合；从聚类后的文档集合中获取多个客户的目标信息。The present application also provides a computer program product, which, when executed on a data processing device, is suitable for executing a program that is initialized with the following method steps: encoding and processing customer information in multiple incoming call orders of multiple customers to obtain Multiple documents corresponding to multiple incoming work orders, wherein the encoding formats of the multiple documents are the same; obtain the weighted matrix model of each document in the multiple documents; obtain the subject word set of each document in the multiple documents; according to each The keyword set of the document and the weighted matrix model of each document calculate the target similarity between each document in multiple documents and each other document; according to the target similarity, multiple documents are clustered based on the PAM clustering algorithm class to obtain a clustered document collection; and obtain target information of multiple customers from the clustered document collection.

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：获取多个文档中每个文档的加权矩阵模型包括：删除多个文档中的预设字符串，并对多个文档中每个文档进行分词、过滤处理，得到第一语料库；将第一语料库代入word2vec模型中进行计算，得到每个文档的词向量；将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合；依据每个文档的关键词集合与其它每个文档的关键词集合的相似度，使用贪婪选择算法将每个文档的关键词集合中的关键词进行排序，获取每个文档的关键词向量；使用词向量替换关键词向量中的关键词，并使用TF-IDF特征选择函数计算关键词向量中的词向量的权值，得到替换后的关键词向量以及替换后的关键词向量中的词向量对应的词向量权值；依据替换后的关键词向量和词向量权值，获取每个文档的加权矩阵模型。When executed on a data processing device, it is also suitable for executing a program initialized with the following method steps: obtaining a weighted matrix model of each document in a plurality of documents includes: deleting preset character strings in a plurality of documents, and Each document in the document is segmented and filtered to obtain the first corpus; the first corpus is substituted into the word2vec model for calculation to obtain the word vector of each document; multiple documents are substituted into the TextRank model for calculation to obtain each document The keyword set of each document; according to the similarity between the keyword set of each document and the keyword set of each other document, use the greedy selection algorithm to sort the keywords in the keyword set of each document, and obtain the keywords of each document Keyword vector; use the word vector to replace the keywords in the keyword vector, and use the TF-IDF feature selection function to calculate the weight of the word vector in the keyword vector to obtain the replaced keyword vector and the replaced keyword vector The word vector weights corresponding to the word vectors in ; obtain the weighted matrix model of each document according to the replaced keyword vectors and word vector weights.

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：将多个文档代入TextRank模型中进行计算，得到每个文档的关键词集合包括：对多个文档中的每个文档进行第一处理，获取预设词性的词作为每个文档的候选关键词，其中，第一处理至少包括以下处理：分句处理、分词处理、过滤处理和标注词性处理；依据每个文档以及每个文档的候选关键词，获取第二语料库；使用TextRank模型将每个文档的候选关键词转换为关键词有向图，并计算候选关键词在每个文档中的权值，得到每个文档的加权关键词有向图；从加权关键词有向图中，获取权值高于预设权值的候选关键词，得到每个文档的关键词集合。When executing on a data processing device, it is also suitable for executing a program that is initialized with the following method steps: Substituting multiple documents into the TextRank model for calculation, obtaining the keyword set of each document includes: for each of the multiple documents The first processing is performed on the document, and the words of the preset part of speech are obtained as the candidate keywords of each document, wherein the first processing includes at least the following processing: sentence segmentation processing, word segmentation processing, filtering processing and tagging part of speech processing; according to each document and The candidate keywords of each document are obtained from the second corpus; the candidate keywords of each document are converted into a keyword directed graph using the TextRank model, and the weights of the candidate keywords in each document are calculated to obtain each document The weighted keyword directed graph; from the weighted keyword directed graph, obtain the candidate keywords whose weight value is higher than the preset weight value, and obtain the keyword set of each document.

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：获取多个文档中每个文档的主题词集合包括：依据TextRank模型配置词向量的词影响力得分的计算公式，计算公式如下所示：

其中，d代表预设的阻尼系数，a表示在加权关键词有向图中的关键词a，Out(b)表示在加权关键词有向图中关键词a指向的所有关键词的集合，In(a)表示在加权关键词有向图中所有指向关键词a的关键词的集合，b表示指向关键词a的关键词b，c表示关键词a指向的关键词c，S(b)表示关键词b的词影响力，tfidf_a表示关键词a的词频和逆文档频率的乘积，tfidf_c表示关键词c的词频和逆文档频率的乘积，w_ba表示在加权关键词有向图中关键词a和关键词b的权值，w_bc表示在加权关键词有向图中关键词b和关键词c的权值；对多个文档中的每个文档进行第二处理，获取第三语料库以及第三语料库中的主题词，其中，第二处理至少包括以下处理：分词处理、过滤处理、提取初始主题词处理和统计词频处理；依据词影响力得分的计算公式，迭代计算第三语料库中的主题词的词影响力得分，当第N次迭代计算中与第N-1次迭代的差值小于预设阈值时停止迭代计算，从第N次迭代计算的计算结果中，获取词影响力得分高于预设词影响力得分的主题词，得到每个文档的主题词集合。When executing on the data processing device, it is also suitable for executing a program that is initialized with the following method steps: obtaining the set of subject words of each document in a plurality of documents includes: configuring the calculation formula of the word influence score of the word vector according to the TextRank model, The calculation formula is as follows:

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度包括：依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度；依据矩阵的最小二乘距离公式，计算每个文档的加权矩阵模型与其他每个文档的加权矩阵模型之间的第二相似度；依据第一相似度和第二相似度，计算每个文档与其它每个文档之间的目标相似度。When executed on a data processing device, it is also suitable for executing a program that is initialized with the following method steps: according to the subject word set of each document and the weighted matrix model of each document, calculate the ratio of each document in multiple documents to each other The target similarity between documents includes: constructing a bipartite graph model based on the subject term set of each document to calculate the first similarity between the subject term set of each document and the subject term set of each other document; The least squares distance formula of the matrix calculates the second similarity between the weighted matrix model of each document and the weighted matrix model of each other document; according to the first similarity and second similarity, calculates the relationship between each document and other Target similarity between each document.

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：依据每个文档的主题词集合构造二部图模型，以计算每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度包括：依据每个文档的主题词集合，构建每个文档和其它每个文档的二部图模型；使用匈牙利算法计算每个文档和其它每个文档之间的二部匹配最大权值，将二部匹配最大权值作为每个文档的主题词集合与其它每个文档的主题词集合之间的第一相似度。When executed on the data processing device, it is also suitable for executing a program initialized with the following method steps: constructing a bipartite graph model according to the subject term set of each document, to calculate the difference between the subject term set of each document and the subject term set of each other document The first similarity between the subject word sets includes: constructing a bipartite graph model between each document and each other document based on the subject term set of each document; using the Hungarian algorithm to calculate the difference between each document and each other document The maximum weight of bipartite matching is the first similarity between the subject word set of each document and the subject word set of each other document.

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：从聚类后的文档集合中获取多个客户的目标信息包括：依据获取主题词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的主题词集合；依据获取关键词集合的方法，对每个聚类后的文档集合进行处理，获取每个聚类后的文档集合的关键词集合；依据每个聚类后的文档集合的主题词集合和每个聚类后的文档集合的关键词集合，确定多个客户的目标信息。When executed on a data processing device, it is also suitable for executing a program that is initialized with the following method steps: obtaining the target information of multiple customers from the clustered document collection includes: according to the method for obtaining the subject word collection, for each clustered Process the document collection after the class, and obtain the subject word collection of each clustered document collection; process each clustered document collection according to the method of obtaining the keyword collection, and obtain each clustered document A keyword set of the set; according to the subject word set of each clustered document set and the keyword set of each clustered document set, target information of multiple customers is determined.

当在数据处理设备上执行时，还适于执行初始化有如下方法步骤的程序：在依据每个文档的主题词集合和每个文档的加权矩阵模型，计算多个文档中每个文档与其它每个文档之间的目标相似度之后，上述方法还包括：将多个文档中每个文档与其他每个文档之间的目标相似度相加，得到每个文档的第三相似度；从多个文档中，获取第三相似度高于预设相似度的目标文档，得到目标文档集合；将目标文档集合推送至目标对象。When executed on a data processing device, it is also suitable for executing a program that is initialized with the following method steps: according to the set of subject terms of each document and the weighted matrix model of each document, calculate the relationship between each document in multiple documents After the target similarity between documents, the above method also includes: adding the target similarity between each document and every other document in multiple documents to obtain the third similarity of each document; In the document, the target document whose third similarity is higher than the preset similarity is obtained to obtain a target document set; and the target document set is pushed to the target object.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read only memory (ROM) or flash RAM. The memory is an example of a computer readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.

本领域技术人员应明白，本申请的实施例可提供为方法、系统或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems or computer program products. Accordingly, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

以上仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, various modifications and changes may occur in this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.

Claims

1. A method for processing customer information, comprising:

Encoding the customer information in multiple incoming work orders of multiple customers to obtain multiple documents corresponding to the multiple incoming electric work orders, wherein the encoding formats of the multiple documents are the same;

obtaining a weighted matrix model for each of the plurality of documents;

Obtain a set of subject terms of each document in the plurality of documents;

calculating the target similarity between each of the plurality of documents and each of the other documents according to the keyword set of each document and the weighted matrix model of each document;

According to the target similarity, cluster the multiple documents based on the PAM clustering algorithm to obtain a clustered document set;

The target information of multiple customers is obtained from the clustered document collection.

2. The method according to claim 1, wherein obtaining the weighted matrix model of each document in the plurality of documents comprises:

Deleting the preset character strings in the plurality of documents, and performing word segmentation and filtering on each of the plurality of documents to obtain the first corpus;

Substituting the first corpus into the word2vec model for calculation to obtain the word vector of each document;

Substituting the multiple documents into the TextRank model for calculation, to obtain the keyword set of each document;

According to the similarity between the keyword set of each document and the keyword set of each other document, use the greedy selection algorithm to sort the keywords in the keyword set of each document, and obtain the keyword vector of each document;

Use the word vector to replace the keywords in the keyword vector, and use the TF-IDF feature selection function to calculate the weight of the word vector in the keyword vector, obtain the replaced keyword vector and the replaced The word vector weight corresponding to the word vector in the keyword vector;

A weighted matrix model of each document is obtained according to the replaced keyword vector and the word vector weight.

3. The method according to claim 2, characterized in that, the multiple documents are substituted into the TextRank model to calculate, and the keyword set obtained for each document comprises:

Perform the first processing on each document in the plurality of documents, and obtain the words with preset parts of speech as candidate keywords of each document, wherein the first processing includes at least the following processing: sentence segmentation processing, word segmentation processing, Filter processing and part-of-speech processing;

Acquiring a second corpus according to each document and the candidate keywords of each document;

Use the TextRank model to convert the candidate keywords of each document into a keyword directed graph, and calculate the weight of the candidate keywords in each document to obtain the weighted keyword directed graph of each document;

From the weighted keyword directed graph, the candidate keywords whose weights are higher than the preset weights are obtained to obtain the keyword set of each document.

4. The method according to claim 3, wherein obtaining the set of subject terms of each document in the plurality of documents comprises:

Configure the calculation formula of the word influence score of the word vector according to the TextRank model. The calculation formula is as follows:

Wherein, d represents the preset damping coefficient, a represents keyword a in the weighted keyword directed graph, and Out(b) represents all keywords pointed to by keyword a in the weighted keyword directed graph In(a) represents the set of all keywords pointing to keyword a in the weighted keyword directed graph, b represents keyword b pointing to keyword a, and c represents keyword c pointing to keyword a , D(b) represents the word influence of keyword b, tfidf _a represents the product of word frequency and inverse document frequency of keyword a, tfidf _c represents the product of word frequency and inverse document frequency of keyword c, w _ba represents The weight value of keyword a and keyword b in the weighted keyword directed graph, w _bc represents the weight value of keyword b and keyword c in the weighted keyword directed graph;

Performing a second process on each of the plurality of documents to obtain a third corpus and subject words in the third corpus, wherein the second process at least includes the following processes: word segmentation processing, filtering processing, extraction Initial subject word processing and statistical word frequency processing;

According to the calculation formula of the word influence score, iteratively calculate the word influence score of the subject words in the third corpus, when the difference between the Nth iterative calculation and the N-1th iteration is less than the preset threshold The iterative calculation is stopped, and the subject words whose word influence score is higher than the preset word influence score are obtained from the calculation result of the Nth iterative calculation, and the subject subject set of each document is obtained.

5. The method according to claim 1, characterized in that, according to the subject word set of each document and the weighted matrix model of each document, calculate the ratio of each document in the plurality of documents to each other Target similarities between documents include:

Constructing a bipartite graph model according to the subject term set of each document, to calculate the first similarity between the subject term set of each document and the subject term set of each other document;

calculating the second similarity between the weighted matrix model of each document and the weighted matrix model of each other document according to the least squares distance formula of the matrix;

A target similarity between each document and each other document is calculated according to the first similarity and the second similarity.

6. The method according to claim 5, characterized in that, constructing a bipartite graph model according to the subject term set of each document, to calculate the subject term set of each document and the subject term of each other document The first degree of similarity between sets includes:

Constructing the bipartite graph model of each document and each other document according to the keyword set of each document;

Use the Hungarian algorithm to calculate the maximum weight of bipartite matching between each document and each other document, and use the maximum weight of bipartite matching as the difference between the subject word set of each document and the subject word set of each other document first similarity.

7. The method according to claim 1, wherein obtaining target information of a plurality of customers from the document collection after the clustering comprises:

According to the method of obtaining the subject word set, each clustered document set is processed, and the subject term set of each clustered document set is obtained;

According to the method of obtaining the keyword set, each clustered document set is processed, and the keyword set of each clustered document set is obtained;

The target information of the plurality of customers is determined according to the keyword set of each clustered document set and the keyword set of each clustered document set.

8. The method according to claim 1, wherein, according to the subject word collection of each document and the weighted matrix model of each document, calculate the relationship between each document and each other in the plurality of documents After the target similarity between documents, the method further includes:

adding the target similarity between each of the multiple documents and each of the other documents to obtain a third similarity of each of the documents;

Obtaining target documents whose third similarity is higher than a preset similarity from the plurality of documents to obtain a target document set;

Push the target document collection to the target object.

9. A processing device for customer information, comprising:

The first acquisition unit is configured to encode customer information in multiple incoming work orders of multiple customers to obtain multiple documents corresponding to the multiple incoming work orders, wherein the encoding formats of the multiple documents are the same ;

a second obtaining unit, configured to obtain a weighted matrix model of each document in the plurality of documents;

a third obtaining unit, configured to obtain a set of subject words of each document in the plurality of documents;

A calculation unit, configured to calculate the target similarity between each of the multiple documents and each of the other documents according to the keyword set of each document and the weighted matrix model of each document;

The fourth acquisition unit is configured to cluster the plurality of documents based on the PAM clustering algorithm according to the target similarity to obtain a clustered document set;

The fifth obtaining unit is configured to obtain target information of multiple customers from the clustered document collection.

10. An electronic device, characterized in that it includes one or more processors and memory, and the memory is used to store one or more programs, wherein, when the one or more programs are used by the one or more When the processor is executed, the one or more processors are made to implement the customer information processing method in any one of claims 1-8.