WO2020114100A1

WO2020114100A1 - Information processing method and apparatus, and computer storage medium

Info

Publication number: WO2020114100A1
Application number: PCT/CN2019/111747
Authority: WO
Inventors: 李鹏; 牛国扬
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-12-06
Filing date: 2019-10-17
Publication date: 2020-06-11
Anticipated expiration: 2021-06-06
Also published as: CN111291177B; CN111291177A

Abstract

An information processing method, an information processing apparatus, and a computer storage medium. The method comprises: performing clustering processing on text in an original text set by using a preset clustering mode to obtain multiple first cluster sets (110); and performing clustering processing on text in each first cluster set by using a preset clustering mode to obtain multiple second cluster sets, the preset clustering mode being a first preset clustering mode or a second preset clustering mode (120).

Description

Information processing method, device and computer storage medium

Technical field

本发明涉及计算机技术领域，尤其涉及一种信息处理方法、装置和计算机存储介质。The present invention relates to the field of computer technology, and in particular, to an information processing method, device, and computer storage medium.

Background technique

在信息爆炸的时代里，人们对海量信息进行快速准确整理的需求与日俱增。为实现这一需求，许多应用应运而生，如信息检索、文献查重、个性推荐、智能问答等。在这些应用中，文本聚类技术是关键的核心技术。文本聚类技术已经成为对文本信息进行有效组织、摘要和导航的重要手段。In the era of information explosion, there is an increasing demand for rapid and accurate collation of massive amounts of information. To meet this demand, many applications have emerged, such as information retrieval, literature review, personality recommendation, and intelligent question answering. In these applications, text clustering technology is the key core technology. Text clustering technology has become an important means of effective organization, summary and navigation of text information.

无监督机器学习提供了一些聚类技术，包括基于划分的方法、层次聚类的方法、基于密度的方法、基于网格的方法、基于模型的方法、自组织映射神经网络的方法、基于蚁群的方法等。这些方法复杂度相对较高，难以处理大规模文本的聚类。Unsupervised machine learning provides some clustering techniques, including partition-based methods, hierarchical clustering methods, density-based methods, grid-based methods, model-based methods, self-organizing mapping neural network methods, ant colony-based methods Method etc. The complexity of these methods is relatively high, and it is difficult to deal with large-scale text clustering.

若采用稍简单聚类算法进行聚类处理，目前的处理方案是先采用某种聚类算法进行聚类处理，进一步对上一次未成功聚类的残余文本进行处理。这种聚类方法没有性能上的互补或者递进，只是简单的“烟囱式”处理，且每次聚类实际采用了不同的方法或者标准，使得最终聚类结果存在不一致性。If a slightly simpler clustering algorithm is used for clustering, the current processing scheme is to first use a clustering algorithm for clustering, and further process the residual text of the last unsuccessful clustering. This clustering method has no complementary or progressive performance, but is simply a "chimney" process, and each clustering actually uses a different method or standard, making the final clustering results inconsistent.

发明内容Summary of the invention

为解决现有存在的技术问题，本发明实施例提供一种信息处理方法、装置和计算机存储介质。To solve the existing technical problems, embodiments of the present invention provide an information processing method, device, and computer storage medium.

为达到上述目的，本发明实施例的技术方案是这样实现的：To achieve the above objective, the technical solutions of the embodiments of the present invention are implemented as follows:

本发明实施例提供了一种信息处理方法，所述方法包括：对原始文本集合中的文本采用预设聚类方式进行聚类处理，获得多个第一聚类集合；对每个第一聚类集合中的文本采用预设聚类方式进行聚类处理，获得多个第二聚类集合；所述预设聚类方式为第一预设聚类方式或第二预设聚类方式。An embodiment of the present invention provides an information processing method. The method includes: performing clustering on a text in an original text set by using a preset clustering method to obtain a plurality of first cluster sets; for each first cluster The text in the class set is clustered using a preset clustering method to obtain multiple second cluster sets; the preset clustering method is the first preset clustering method or the second preset clustering method.

本发明实施例还提供了一种信息处理装置，所述装置包括：第一聚类器，用于对原始文本集合中的文本采用预设聚类方式进行聚类处理，获得多个第一聚类集合；第二聚类器，用于对每个第一聚类集合中的文本采用预设聚类方式进行聚类处理，获得多个第二聚类集合；其中，所述预设聚类方式为第一预设聚类方式或第二预设聚类方式。An embodiment of the present invention also provides an information processing apparatus, the apparatus includes: a first clusterer, configured to perform clustering processing on the text in the original text set by using a preset clustering method to obtain multiple first clusters Cluster set; a second clusterer, configured to perform clustering on the text in each first cluster set by using a preset clustering method to obtain multiple second cluster sets; wherein, the preset cluster The mode is the first preset clustering mode or the second preset clustering mode.

本发明实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现本发明实施例所述方法的步骤。An embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method according to the embodiment of the present invention are implemented.

本发明实施例还提供了一种信息处理装置，包括：处理器和用于存储能够在处理器上运行的计算机程序的存储器，其中，所述处理器用于运行所述计算机程序时，执行本发明实施例所述方法的步骤。An embodiment of the present invention also provides an information processing apparatus, including: a processor and a memory for storing a computer program that can be run on the processor, where the processor is used to execute the present invention when the computer program is run The steps of the method described in the examples.

本发明实施例公开了一种信息处理方法，所述方法包括：对原始文本集合中的文本采用预设聚类方式进行聚类处理，获得多个第一聚类集合；对每个第一聚类集合中的文本采用预设聚类方式进行聚类处理，获得多个第二聚类集合；其中，所述预设聚类方式为第一预设聚类方式或第二预设聚类方式。An embodiment of the present invention discloses an information processing method. The method includes: performing clustering processing on a text in an original text set by using a preset clustering method to obtain a plurality of first cluster sets; for each first cluster The text in the cluster set is clustered using a preset clustering method to obtain multiple second cluster sets; wherein, the preset clustering method is the first preset clustering method or the second preset clustering method .

BRIEF DESCRIPTION

图1为本发明实施例信息处理方法的流程示意图一；FIG. 1 is a first schematic flowchart of an information processing method according to an embodiment of the present invention;

图2为本发明实施例信息处理方法的流程示意图二；2 is a second schematic flowchart of an information processing method according to an embodiment of the present invention;

图3为本发明实施例信息处理方法的流程示意图三；FIG. 3 is a third schematic flowchart of an information processing method according to an embodiment of the present invention;

图4为本发明实施例信息处理方法的流程示意图四；4 is a fourth schematic flowchart of an information processing method according to an embodiment of the present invention;

图5为本发明实施例信息处理装置的结构示意图。5 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present invention.

detailed description

以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。The present invention will be described in further detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, and are not intended to limit the present invention.

本发明实施例提供了一种信息处理方法，如图1所示，所述方法包括：An embodiment of the present invention provides an information processing method. As shown in FIG. 1, the method includes:

步骤110：对原始文本集合中的文本采用预设聚类方式进行聚类处理，获得多个第一聚类集合。Step 110: Perform clustering on the text in the original text set by using a preset clustering method to obtain multiple first cluster sets.

在本发明的可选实施例中，原始文本集合中的文本可以是在数字图书馆、信息检索的数据库等不同应用系统中获取到的海量的数据。所述文本可以根据预设的标准进行划分，作为一种示例，根据应用的场景，将一句话、10句话或者一段话划分为一个文本。In an alternative embodiment of the present invention, the text in the original text collection may be a massive amount of data acquired in different application systems such as a digital library, an information retrieval database, and so on. The text may be divided according to a preset standard. As an example, according to an application scenario, a sentence, ten sentences, or a paragraph are divided into one text.

作为一种可选的示例，对所述原始文本集合的文本进行聚类处理，获得多个第一聚类集合。第一聚类集合为原始文文本集合经过第一次聚类处理后得到的聚类集合。As an optional example, the text of the original text set is clustered to obtain multiple first cluster sets. The first clustering set is the clustering set obtained after the original text set undergoes the first clustering process.

步骤120：对每个第一聚类集合中的文本采用预设聚类方式进行聚类处理，获得多个第二聚类集合。Step 120: Perform a clustering process on the text in each first clustering set by using a preset clustering method to obtain multiple second clustering sets.

当经过两次聚类处理的流程如图2所示，作为一种可选的示例，分别对每个第一聚类集合中的文本进行聚类处理，得到多个第二聚类集合，将所述得到的多个第二聚类集合作为本发明信息处理后得到的聚类结果。第二聚类集合为第一聚类集合经过第二次聚类处理后得到的聚类集合。When the process of two clustering processes is shown in FIG. 2, as an optional example, the text in each first cluster set is clustered separately to obtain multiple second cluster sets. The obtained multiple second clustering sets are used as the clustering result obtained after the information processing of the present invention. The second clustering set is a clustering set obtained by the first clustering set after the second clustering process.

在本发明实施例中，所述预设聚类方式为第一预设聚类方式或第二预设聚类方式。也就是说，本发明实施例中可采用第一预设聚类方式或第二预设聚类方式对原始文本集合中的文本进行两次聚类处理，或者采用第一预设聚类方式对原始文本集合中的文本进行聚类处理获得第一聚类集合，再采用第二预设聚类方式对第一聚类集合中的文本进行聚类处理，或者，采用第二预设聚类方式对原始文本集合中的文本进行聚类处理获得第一聚类集合，再采用第一预设聚类方式对第一聚类集合中的文本进行聚类处理。In this embodiment of the present invention, the preset clustering mode is a first preset clustering mode or a second preset clustering mode. In other words, in this embodiment of the present invention, the first preset clustering method or the second preset clustering method may be used to perform two clustering processes on the text in the original text set, or the first preset clustering method may be used to Perform a clustering process on the text in the original text collection to obtain a first clustering set, and then use a second preset clustering method to cluster the text in the first clustering set, or use a second preset clustering method Perform a clustering process on the text in the original text set to obtain a first cluster set, and then perform a clustering process on the text in the first cluster set using a first preset clustering method.

本实施例中，所述第一预设聚类方式和所述第二预设聚类方式基于文本聚类的效率要求和/或精度要求确定。In this embodiment, the first preset clustering mode and the second preset clustering mode are determined based on the efficiency requirements and/or precision requirements of text clustering.

在本发明可选的实施例中，若文本聚类关注聚类效率，且第一预设聚类方式聚类效率高，则预设聚类方式可均采用第一预设聚类方式，即采用第一预设聚类方式对原始文本集合进行聚类处理，采用第一预设聚类方式对第一聚类集合进行聚类处理；若文本聚类关注聚类精度，且第二预设聚类方式聚类精度高，则剧社聚类方式可均采用第二预设聚类方式，即采用第二预设聚类方式对原始文本集合进行聚类处理，采用第二预设聚类方式对第一聚类集合进行聚类处理；若文本聚类同时关注聚类效率和聚类精度，则预设聚类方式可采用第一预设聚类方式和第二预设聚类方式，即采用第一预设聚类方式对原始文本集合进行聚类处理，采用第二预设聚类方式对第一聚类集合进行聚类处理；或者，采用第二预设聚类方式对原始文本集合进行聚类处理，采用第一预设聚类方式对第一聚类集合进行聚类处理。或者，若文本聚类对聚类结果没有要求时，可以任意选择聚类方式进行聚类处理。In an optional embodiment of the present invention, if text clustering focuses on clustering efficiency, and the first preset clustering method has high clustering efficiency, the preset clustering methods may all use the first preset clustering method, that is, The first preset clustering method is used to cluster the original text collection, and the first preset clustering method is used to cluster the first clustering collection; if text clustering concerns clustering accuracy, and the second preset The clustering method has high clustering accuracy, so the clustering method of the drama club can all adopt the second preset clustering method, that is, the second preset clustering method is used to cluster the original text collection, and the second preset clustering is used. Clustering the first clustering set; if text clustering focuses on clustering efficiency and clustering accuracy at the same time, the preset clustering method may use the first preset clustering mode and the second preset clustering mode, That is, the first preset clustering method is used to cluster the original text collection, and the second preset clustering method is used to cluster the first clustering collection; or, the second preset clustering method is used to cluster the original text The clusters are clustered, and the first clustering cluster is clustered using the first preset clustering method. Or, if text clustering does not require clustering results, you can arbitrarily choose the clustering method for clustering.

如此，可以根据聚类的需求，选择不同的聚类方法的组合，使得聚类处理更加灵活有效，可以满足不同应用场景的需求。In this way, different combinations of clustering methods can be selected according to clustering requirements, so that clustering processing is more flexible and effective, and can meet the needs of different application scenarios.

在本发明可选的实施例中，在预设聚类方式为第一预设聚类方式时，如图3所示，采用预设聚类方式进行聚类处理，包括：In an alternative embodiment of the present invention, when the preset clustering mode is the first preset clustering mode, as shown in FIG. 3, the clustering process is performed using the preset clustering mode, including:

步骤310：提取待处理文本集合中每个文本，将每个文本表示为签名向量。Step 310: Extract each text in the to-be-processed text set, and express each text as a signature vector.

在本发明的可选实施例中，待处理文本集合可以是所述原始文本集合或所述第一聚类集合，即本实施例的聚类处理是针对步骤110和/或步骤120中的聚类处理；也即根据所述第一预设聚类方式对原始文本集合进行第一次聚类处理或者是针对已进行聚类处理后的第一聚类集合进行第二次聚类处理。In an optional embodiment of the present invention, the to-be-processed text set may be the original text set or the first cluster set, that is, the cluster processing in this embodiment is directed to the clustering in step 110 and/or step 120 Class processing; that is, perform the first clustering process on the original text set according to the first preset clustering method or perform the second clustering process on the first cluster set after the clustering process has been performed.

在本发明的可选实施例中，对所述待处理文本集合进行处理，得到签名向量的步骤包括(说明书附图中未示出)：In an optional embodiment of the present invention, the step of processing the to-be-processed text set to obtain a signature vector includes (not shown in the drawings of the specification):

步骤3101：获得文本的词序列，以及获得所述词序列中每个词的权重，基于获得所述词序列中每个词的权重获得带权词序列。Step 3101: Obtain a word sequence of text, and obtain a weight of each word in the word sequence, and obtain a weighted word sequence based on obtaining the weight of each word in the word sequence.

对所述文本进行预处理，得到所述文本的词序列。所述预处理的步骤包括：对所述文本进行分词、去除停止词等操作。根据预设的权重算法，计算所述词序列中每个词的权重，得到带权词序列。作为一种可选的示例，采用TF-IDF算法，对所述文本进行权重计算，得到所述文本的带权词序列。Preprocessing the text to obtain the word sequence of the text. The pre-processing step includes: performing word segmentation and removing stop words on the text. According to a preset weight algorithm, the weight of each word in the word sequence is calculated to obtain a weighted word sequence. As an optional example, a TF-IDF algorithm is used to perform weight calculation on the text to obtain a sequence of weighted words of the text.

作为一种示例，获取待处理文本集合中的文本text1为“我想申请内购中兴手机了”，经过预处理后，得到的词序列为[申请|内购|中兴|手机]。采用TF-IDF算法，对所述得到的词序列中的每个词进行权重计算，得到text1的带权词序列为[申请,3.12|内购,8.90|中兴,5.54|手机,1.89]。As an example, the text1 in the text collection to be processed is "I want to apply for in-app purchase ZTE mobile phone". After preprocessing, the resulting word sequence is [apply|in-app purchase|ZTE|mobile phone]. The TF-IDF algorithm is used to calculate the weight of each word in the obtained word sequence, and the weighted word sequence of text1 is [application, 3.12|internal purchase, 8.90|ZTE, 5.54|mobile phone, 1.89].

步骤3102：将所述带权词序列进行哈希运算获得带权哈希值序列。Step 3102: Perform a hash operation on the weighted word sequence to obtain a weighted hash value sequence.

采用哈希算法，计算所述词序列中的每个词的哈希值。根据预设的将词序列中的每个词转化为二进制的位数N，将所述每个词的哈希值的每一位，分别与所述词的权重进行相乘，得到所述文本的带权哈希值序列。A hash algorithm is used to calculate the hash value of each word in the word sequence. Convert each word in the word sequence into a binary number of digits N according to a preset, and multiply each bit of the hash value of each word by the weight of the word to obtain the text Sequence of weighted hashes.

作为一种示例，设定N为128，即通过哈希算法将每个词转为128位，则text1对应的词序列中的每个词经过哈希算法后，得到的哈希值序列为[100101...010,3.12|...|000110...100,1.89]，将每个词的每一位与对应的权重相乘，得到所述文本的带权哈希值序列为[3.12 -3.12 -3.12 3.12 -3.12 3.12...-3.12 3.12 -3.12|...|-1.89 -1.89 -1.89 1.89 1.89 -1.89...]。As an example, if N is set to 128, that is, each word is converted into 128 bits by a hash algorithm, then after each word in the word sequence corresponding to text1 undergoes a hash algorithm, the resulting hash value sequence is [ 100101...010,3.12|...|000110...100,1.89], multiplying each bit of each word by the corresponding weight to obtain the weighted hash sequence of the text is [3.12 -3.12-3.12 3.12-3.12 3.12...-3.12 3.12-3.12|...|-1.89-1.89-1.891.891.89-1.89...].

步骤3103：对所述带权哈希值序列中的每个带权哈希值进行合并处理，获得对应于所述文本的加权哈希值。Step 3103: Perform merge processing on each weighted hash value in the weighted hash value sequence to obtain a weighted hash value corresponding to the text.

在本发明可选的实施例中，将所述带权哈希值序列中的每个哈希值按位进行相加，获得所述文本对应的加权哈希值。In an optional embodiment of the present invention, each hash value in the weighted hash value sequence is added bit by bit to obtain a weighted hash value corresponding to the text.

作为一种示例，将上述得到text1的带权哈希值序列[3.12 -3.12 -3.12 3.12 -3.12 3.12...-3.12 3.12-3.12|...|-1.89 -1.89 -1.89 1.89 1.89 -1.89...]，将四个带权哈希值按位相加，得到[5.74 3.91 -1.18 2.31 -12.34 -7.71...-3.64 -0.11 21.29]。As an example, the weighted hash value sequence [3.12-3.12-3.123.12-3.12...-3.123.12-3.12|...|-1.89-1.89-1.891.891.81.89-1.89. ..], add the four weighted hash values bit by bit to get [5.743.91-1.182.31-12.34-7.71...-3.64-0.1121.29].

步骤3104：对所述加权哈希值进行二值化处理，获得二进制签名向量。Step 3104: Perform binary processing on the weighted hash value to obtain a binary signature vector.

在本发明可选的实施例中，对上述步骤得到的加权哈希值进行二值化处理，得到所述文本的二进制签名向量。作为一种可选的示例，对所述带权哈希值按位进行处理，当该位的值为正数，则该位为1，当该位的值为负数，则该位为0。In an optional embodiment of the present invention, the weighted hash value obtained in the above step is binarized to obtain a binary signature vector of the text. As an optional example, the weighted hash value is processed bit by bit. When the value of the bit is positive, the bit is 1, and when the value of the bit is negative, the bit is 0.

作为一种示例，将text1的加权的哈希值[5.74 3.91 -1.18 2.31 -12.34 -7.71...-3.64 -0.11 21.29]进行二值化处理，得到text1对应的128位的二进制签名向量为[1 1 0 1 0 0...0 0 1]。As an example, the weighted hash value of text1 [5.74 3.91-1.18 2.31-12.34-7.71...-3.64-0.11 21.29] is binarized, and the 128-bit binary signature vector corresponding to text1 is [ 1 1 0 0 1 0 0 0 0 1].

步骤320：对每个文本的签名向量进行分段处理，获得多个签名向量分段，对每个签名向量分段进行聚类处理。Step 320: Perform segmentation processing on the signature vector of each text, obtain multiple signature vector segments, and perform clustering processing on each signature vector segment.

在本发明可选的实施例中，根据预设的参数将所述文本的签名向量进行分段处理，获得多个签名向量分段，对每个签名向量分段进行聚类处理。所述预设的参数可以根据计算需求继续努力设置。具体步骤(说明书附图中未示出)包括：In an optional embodiment of the present invention, the signature vectors of the text are segmented according to preset parameters to obtain multiple signature vector segments, and each signature vector segment is clustered. The preset parameters may continue to be set according to calculation requirements. The specific steps (not shown in the drawings in the specification) include:

步骤3201：将每个文本的二进制签名向量进行分段处理，获得多个二进制签名向量分段。Step 3201: Segment the binary signature vector of each text to obtain multiple binary signature vector segments.

在本发明可选的实施例中，设定所述二进制签名向量划分为b段，则N位的签名向量的每段向量包含有r位，其中，N＝b*r。In an optional embodiment of the present invention, if the binary signature vector is divided into b segments, then each segment of the N-bit signature vector contains r bits, where N=b*r.

步骤3202：对每个二进制签名向量分段进行哈希运算处理，获得二进制签名向量分段对应的哈希值。Step 3202: Perform hash operation processing on each binary signature vector segment to obtain a hash value corresponding to the binary signature vector segment.

在本发明可选的实施例中，分别对b段二进制签名向量进行哈希运算，获得每段二进制签名向量的哈希值。In an optional embodiment of the present invention, the b-stage binary signature vectors are respectively hashed to obtain the hash value of each stage of the binary signature vector.

步骤3203：将对应的哈希值相同的文本划分至相同的聚类集合中。Step 3203: Divide the corresponding text with the same hash value into the same clustering set.

在本发明可选的实施例中，对上述步骤得到的每段二进制签名向量的哈希值进行分类，将哈希值相同的对应文本划分至同一个聚类集合中。In an optional embodiment of the present invention, the hash value of each segment of the binary signature vector obtained in the above step is classified, and the corresponding text with the same hash value is divided into the same clustering set.

如此，当采用第一预设聚类方式，通过采用哈希算法对文本进行降维处理，同时采用加权算法，不仅降低了计算的难度，提高了效率，同时也确保计算的精确性。In this way, when the first preset clustering method is adopted, the text is reduced by using the hash algorithm, and the weighting algorithm is used, which not only reduces the difficulty of calculation, improves efficiency, but also ensures the accuracy of calculation.

在本发明可选的实施例中，在预设聚类方式为第二预设聚类方式时，如图4所示，采用预设聚类方式进行聚类处理，包括：In an alternative embodiment of the present invention, when the preset clustering mode is the second preset clustering mode, as shown in FIG. 4, the clustering process using the preset clustering mode includes:

步骤410：计算待处理文本集合中的任意两个文本之间的相似度。Step 410: Calculate the similarity between any two texts in the text set to be processed.

在本发明的可选实施例中，如上所述，待处理文本集合可以是所述原始文本集合或所述第一聚类集合，即本实施例的聚类处理是针对步骤110和/或步骤120中的聚类处理；也即根据所述第二预设聚类方式对原始文本集合进行第一次聚类处理或者是针对已进行聚类处理后的第一聚类集合进行第二次聚类处理中。In an optional embodiment of the present invention, as described above, the to-be-processed text set may be the original text set or the first cluster set, that is, the cluster processing in this embodiment is directed to step 110 and/or step The clustering process in 120; that is, the first clustering process is performed on the original text set according to the second preset clustering method or the second clustering is performed on the first clustering set after the clustering process has been performed Class processing.

基于预设的相似度算法，计算所述待处理文本集合中任意两个文本之间的相似度。Based on a preset similarity algorithm, the similarity between any two texts in the to-be-processed text set is calculated.

步骤420：基于相似度的计算结果对所述待处理文本集合中的文本进行聚类处理。Step 420: Perform clustering processing on the text in the to-be-processed text set based on the calculation result of the similarity.

在本发明的可选实施例中，根据预设的相似度分类算法，对上述得到的相似度的计算结果进行分类，得到符合相似度分类算法的聚类集合。In an optional embodiment of the present invention, according to a preset similarity classification algorithm, the calculation result of the similarity obtained above is classified to obtain a clustering set conforming to the similarity classification algorithm.

在本发明的可选实施例中，预设的相似度算法的步骤具体包括语义相似度算法和句法相似度算法，其中，任意两个文本之间的语义相似度算法具体包括(说明书附图中未示出)：In an alternative embodiment of the present invention, the steps of the preset similarity algorithm specifically include a semantic similarity algorithm and a syntactic similarity algorithm, wherein the semantic similarity algorithm between any two texts specifically includes (in the drawings of the specification (Not shown):

步骤A：利用预设的语料集对待处理文本集合中的文本进行训练，得到所述文本的词向量矩阵。Step A: Use the preset corpus to train the text in the text set to be processed to obtain the word vector matrix of the text.

作为一种示例，可以采用Word2Vec方法进行词向量的训练，设定向量长度为d _w(可选的，将d _w设置为400)，则经过训练得到的Word2Vec方法为矩阵

其中，

表示具有|V|行d _w列的矩阵；V为语料集中所有词汇构成的词汇表，|V|为该词汇表中的词汇个数。若单词w在矩阵中的次序为m，则由该模型得到的词向量可以表示为

其中，

为第m行的向量。 As an example, the Word2Vec method can be used for word vector training, and the vector length is set to d _w (optionally, d _{w is} set to 400), then the Word2Vec method obtained by training is a matrix

among them,

Represents a matrix with |V| rows and d _w columns; V is a vocabulary composed of all words in the corpus, and |V| is the number of vocabularies in the vocabulary. If the order of the word w in the matrix is m, the word vector obtained by the model can be expressed as

among them,

Is the vector of the mth row.

步骤B：针对任意两个文本，计算基于语义距离的语义相似度。Step B: For any two texts, calculate the semantic similarity based on the semantic distance.

步骤B1：对待处理文本集合中的文本进行预处理，得到所述文本的词序列。Step B1: Pre-process the text in the text set to be processed to obtain the word sequence of the text.

在本发明的可选实施例中，预处理的操作包括：对文本进行分词、去除停止词等。In an alternative embodiment of the present invention, the preprocessing operations include: word segmentation of the text, removal of stop words, and so on.

作为一种示例，选择任意两个文本t1和t2进行预处理，得到t1词序列为

t2的词序列为

其中，

为t1的第m个词，m为正整数；

为t2的第n个词，n为正整数。 As an example, select any two texts t1 and t2 for preprocessing to get the sequence of t1 words

The word sequence of t2 is

among them,

Is the m-th word of t1, m is a positive integer;

Is the nth word of t2, n is a positive integer.

步骤B2：计算所述文本集合任意两个文本中对应的两个词的语义相似度。Step B2: Calculate the semantic similarity of the corresponding two words in any two texts in the text set.

在本发明的可选实施例中，词义相似度的计算公式(1)为：In an optional embodiment of the present invention, the calculation formula (1) of the word sense similarity is:

其中，sim _cos(w ₁,w ₂)为词w ₁和词w ₂的词义相似度；v(w ₁)为单词w ₁的词向量；v(w ₂)为单词w ₂的词向量；|v(w ₁)|为单词w ₁的词向量的长度；|v(w ₂)|为单词w ₂的词向量的长度。 Among them, sim _cos (w ₁ , w ₂ ) is the word meaning similarity of the word w ₁ and the word w ₂ ; v(w ₁ ) is the word vector of the word w ₁ ; v(w ₂ ) is the word vector of the word w ₂ ; |v(w ₁ )| is the length of the word vector of the word w ₁ ; |v(w ₂ )| is the length of the word vector of the word w ₂ .

在本发明可选的实施例中，t1和t2之间的语义相似度的计算公式(2)为：In an optional embodiment of the present invention, the formula (2) for calculating the semantic similarity between t1 and t2 is:

其中，

为t1中词

与t2的词义距离；

为t2中词

与t1的词义距离；根据公式(1)计算t1和t2之间任意两个词的词义相似度

among them,

Is the word in t1

Meaning distance from t2;

Is a word in t2

The semantic distance from t1; calculate the semantic similarity of any two words between t1 and t2 according to formula (1)

作为一种示例，根据公式(2)，计算得到t1和t2之间的语义相似度score1。As an example, according to formula (2), the semantic similarity score1 between t1 and t2 is calculated.

在本发明的可选实施例中，任意两个文本之间的句法相似度算法具体包括(说明书附图中未示出)：In an alternative embodiment of the present invention, the syntax similarity algorithm between any two texts specifically includes (not shown in the drawings of the specification):

步骤A：对待处理文本集合中的文本进行预处理，得到所述文本的词序列。Step A: Pre-process the text in the text set to be processed to obtain the word sequence of the text.

作为一种示例，对文本text1和text2进行分词、去除停止词等操作，得到词序列t1和t2。As an example, the texts text1 and text2 are subjected to word segmentation, stop word removal and other operations to obtain word sequences t1 and t2.

步骤B：对所述文本对应的词序列进行依存句法分析，得到任意两个文本之间的句法相似度。Step B: Perform dependency syntax analysis on the word sequence corresponding to the text to obtain a syntactic similarity between any two texts.

步骤B1：采用预设的句法分析工具，对所述两个文本的词序列进行依存句法分析，得到所述两个文本之间的有效词搭配对的数量。Step B1: Use a preset syntax analysis tool to perform a dependency syntax analysis on the word sequence of the two texts to obtain the number of valid word collocation pairs between the two texts.

作为一种示例，采用斯坦福大学的自然语言处理开源包或复旦大学的自然语言处理开源包对t1和t2进行依存句法分析，计算得到t1和t2中有效词搭配对的数量，分别为p ₁和p ₂。 As an example, the natural language processing open source package of Stanford University or the natural language processing open source package of Fudan University is used to analyze the dependency syntax of t1 and t2, and the number of effective word collocation pairs in t1 and t2 is calculated, respectively p ₁ and p ₂ .

步骤B2：根据所述两个文本之间的有效词搭配对的数量，得到所述文本之间的句法相似度。Step B2: According to the number of valid word collocation pairs between the two texts, obtain the syntactic similarity between the texts.

在本发明的可选实施例中，根据计算公式(3)得到句法相似度为：In an optional embodiment of the present invention, the syntactic similarity according to the calculation formula (3) is:

score2＝|p ₁-p ₂| (3) score2=|p ₁ -p ₂ | (3)

作为一种示例，根据公式(3)对p ₁和p ₂进行计算，得到所述文本text1和text2的句法相似度score2。 As an example, p ₁ and p ₂ are calculated according to formula (3) to obtain the syntactic similarity score2 of the text1 and text2.

在本发明的可选实施例中，预设的相似度算法的步骤具体包括：In an optional embodiment of the present invention, the steps of the preset similarity algorithm specifically include:

步骤A：根据预设的语义相似度算法，得到任意两个文本之间的语义相似度。Step A: Obtain the semantic similarity between any two texts according to the preset semantic similarity algorithm.

步骤B：根据预设的句法相似度算法，得到任意两个文本之间的句法相似度。Step B: According to the preset syntactic similarity algorithm, the syntactic similarity between any two texts is obtained.

步骤C：基于计算得到的语义相似度和句法相似度，得到任意两个文本之间的相似度。Step C: Based on the calculated semantic similarity and syntactic similarity, a similarity between any two texts is obtained.

在本发明的可选实施例中，任意两个文本之间的相似度的计算公务(4)为：In an alternative embodiment of the present invention, the calculation of the similarity between any two texts (4) is:

score＝α*score1+β*score2 (4)score=α*score1+β*score2 (4)

其中，score为任意两个文本之间的相似度；score1为任意两个文本之间的语义相似度；score2任意两个文本之间的句法相似度；α为语义相似度的权重0<＝α<＝1，β为句法相似度的权重0<＝β<＝1，α+β＝1，α和β的值可以根据计算需求进行设置，作为一种可选的示例，将α和β都设置为0.5。Among them, score is the similarity between any two texts; score1 is the semantic similarity between any two texts; score2 is the syntactic similarity between any two texts; α is the weight of semantic similarity 0<=α <=1, β is the weight of syntactic similarity 0<=β<=1, α+β=1, the values of α and β can be set according to the calculation requirements, as an optional example, both α and β Set to 0.5.

在本发明的可选实施例中，预设的相似度分类算法包括：In an alternative embodiment of the present invention, the preset similarity classification algorithm includes:

步骤A：计算待处理文本集合中的两个第一文本之间的第一相似度，判断所述第一相似度是否超过第一预设阈值；所述第一文本为所述待处理文本集合中的任一文本。Step A: Calculate the first similarity between two first texts in the text set to be processed, and determine whether the first similarity exceeds a first preset threshold; the first text is the text set to be processed Any text in.

在本发明的可选实施例中，第一相似度即是待处理文本集合中任意两个文本之间的相似度，第一预设阈值可以根据计算需要进行设置，设置的范围可以为0至1。作为一种示例，将第一预设阈值设定为0.5。In an optional embodiment of the present invention, the first similarity is the similarity between any two texts in the text set to be processed, and the first preset threshold can be set according to calculation needs, and the setting range can be from 0 to 1. As an example, the first preset threshold is set to 0.5.

在本发明的可选实施例中，在步骤410中，基于上述预设的相似度算法，计算得到所述待处理文本集合中任意两个文本之间的相似度score。In an optional embodiment of the present invention, in step 410, based on the preset similarity algorithm, a similarity score between any two texts in the to-be-processed text set is calculated.

步骤B：当所述第一相似度超过所述第一预设阈值时，将所述两个第一文本划分至相同的聚类集合。Step B: When the first similarity exceeds the first preset threshold, divide the two first texts into the same clustering set.

在本发明的可选实施例中，当所述计算得到的score超过第一预设阈值时，将所述score对应的两个第一文本划分至同一个聚类集合中。当所述计算得到的score小于第一预设阈值时，则将所score对应的两个文本不划分至同一个聚类集合中。即所述聚类集合中的任意两个文本之间的相似度超过第一预设阈值。In an optional embodiment of the present invention, when the calculated score exceeds a first preset threshold, the two first texts corresponding to the score are divided into the same clustering set. When the calculated score is less than the first preset threshold, the two texts corresponding to the score are not divided into the same clustering set. That is, the similarity between any two texts in the clustering set exceeds the first preset threshold.

在本发明的可选实施例中，预设的相似度分类算法还可以包括：In an optional embodiment of the present invention, the preset similarity classification algorithm may further include:

步骤A：分别计算所述待处理文本集合中的第二文本与所述聚类集合中的第一文本之间的第二相似度，判断所述第二相似度是否超过所述第一预设阈值；所述第二文本为所述待处理文本中除所述第一文本外的其他文本。Step A: separately calculate a second similarity between the second text in the to-be-processed text set and the first text in the clustering set, and determine whether the second similarity exceeds the first preset Threshold; the second text is other text than the first text in the text to be processed.

在本发明的可选实施例中，第二相似度即是待处理文本集合中任意两个文本之间的相似度。所述聚类集合中的任意两个文本之间的相似度不小于第一预设阈值。In an optional embodiment of the present invention, the second similarity is the similarity between any two texts in the text set to be processed. The similarity between any two texts in the clustering set is not less than the first preset threshold.

在本发明的可选实施例中，在步骤410中，基于上述预设的相似度算法，计算得到所述待处理文本集合中第二文本与所述聚类集合中的任一文本之间的相似度，并判断所述多个相似度score是否都不小于第一预设阈值。In an optional embodiment of the present invention, in step 410, based on the preset similarity algorithm, the second text in the to-be-processed text set and any text in the clustering set are calculated. Similarity, and determine whether the multiple similarity scores are not less than a first preset threshold.

作为一种可选的示例，在步骤410中，基于上述预设的相似度算法，计算得到待处理文本集合中第二文本t ²与聚类集合P1中的文本

(1<＝i<＝m，m,i皆为正整数)的对应的相似度score _i ¹²，并判断所述相似度score _i ¹²与第一预设阈值的大小关系。 As an optional example, in step 410, based on the preset similarity algorithm, the second text t ^{2 in the} text set to be processed and the text in the cluster set P1 are calculated

(1<=i<=m, m, i are all positive integers) corresponding similarity score _i ¹² , and determine the magnitude relationship between the similarity score _i ¹² and the first preset threshold.

步骤B：当所述第二相似度均超过所述第一预设阈值时，获得所述第二文本与所述聚类集合中的文本之间的第二相似度均值，判断所述第二相似度均值是否超过第二预设阈值；当所述第二相似度均值超过所述第二预设阈值时，将所述聚类集合确定为候选聚类集合。Step B: When the second similarities all exceed the first preset threshold, obtain a second average value of the similarity between the second text and the text in the cluster set, and determine the second Whether the mean value of similarity exceeds a second preset threshold; when the mean value of the second similarity exceeds the second preset threshold, the cluster set is determined as a candidate cluster set.

在本发明的可选实施例中，当所述多个相似度score都超过第一预设阈值时，将所述多个score相加并计算得到对应的相似度均值

所述相似度均值即为第二相似度均值。判断所述第二相似度均值与第二预设阈值的大小关系，所述第二预设阈值可以根据计算需要进行设置，设置的范围可以为0至1。作为一种示例，将第二预设阈值设定为0.5。当所述第二相似度均值都超过第二预设阈值时，将所述聚类集合确定为候选聚类集合。 In an optional embodiment of the present invention, when the plurality of similarity scores all exceed the first preset threshold, the multiple scores are added and the corresponding average value of similarity is calculated

The mean value of similarity is the second mean value of similarity. Judging the magnitude relationship between the second similarity average and the second preset threshold, the second preset threshold can be set according to calculation needs, and the setting range can be 0 to 1. As an example, the second preset threshold is set to 0.5. When all the average values of the second similarities exceed the second preset threshold, the cluster set is determined as a candidate cluster set.

作为一种示例，当相似度score _i ¹²均超过第一预设阈值时，将所述多个score相加并计算得到对应的第二相似度均值

判断所述第二相似度均值与第二预设阈值的大小关系，当所述第二相似度均值都超过第二预设阈值时，将所述聚类集合C确定为候选聚类集合。 As an example, when the similarity score _i ¹² all exceeds the first preset threshold, the multiple scores are added and the corresponding second average similarity value is calculated

Judging the magnitude relationship between the second average similarity and the second preset threshold, and when all the second similarity averages exceed the second preset threshold, determine the cluster set C as a candidate cluster set.

步骤C：获得所述第二文本对应的所有候选聚类集合，将所述候选聚类集合中满足预设要求的候选聚类集合作为目标聚类集合，将所述第二文本划分至所述目标聚类集合。Step C: Obtain all candidate cluster sets corresponding to the second text, use the candidate cluster set that satisfies the preset requirements in the candidate cluster set as the target cluster set, and divide the second text into the Target cluster collection.

在本发明的可选实施例中，采用步骤A和B获得所述第二文本对应的所有候选聚类集合。根据预设要求在所述所有的候选集合中获得目标聚类集合，并将所述第二文本添加至所述目标聚类集合。In an optional embodiment of the present invention, steps A and B are used to obtain all candidate cluster sets corresponding to the second text. Obtain a target cluster set from all the candidate sets according to a preset requirement, and add the second text to the target cluster set.

在本发明的可选实施例中，预设要求可以是选择相似度均值最高的聚类集合作为目标聚类集合。In an alternative embodiment of the present invention, the preset requirement may be to select the cluster set with the highest average similarity as the target cluster set.

作为一种示例，采用步骤A和B获得所述第二文本对应的所有候选聚类集合分别为P1、P2…Pn,对应的第二相似度均值分别为

其中，第二相似度均值最大的为

根据预设要求，将候选聚类集合P1确定为目标聚类集合，将所述第二文本t ²加入至所述聚类集合P1中。 As an example, using steps A and B to obtain all candidate cluster sets corresponding to the second text are P1, P2...Pn, respectively, and the corresponding mean values of the second similarities are

Among them, the largest mean value of the second similarity is

According to a preset requirement, the candidate cluster set P1 is determined as the target cluster set, and the second text t ^{2 is} added to the cluster set P1.

如此，当采用第二预设聚类方式，通过计算文本之间的语义相似度和句法相似度，并且可以设置参数的值，来满足不同的应用场景的需求，提高了聚类的精度。In this way, when the second preset clustering method is adopted, the semantic similarity and syntactic similarity between the texts are calculated, and the value of the parameter can be set to meet the needs of different application scenarios and improve the accuracy of clustering.

具体示例一Specific example one

当应用场景为个性推荐场景时，对聚类的要求为：效率要求高、精度要求低，则本发明的实施例提供的方法包括：When the application scenario is a personality recommendation scenario, the clustering requirements are: high efficiency requirements and low precision requirements, then the methods provided by the embodiments of the present invention include:

S101：选择第一预设聚类方法进行两次聚类处理，将系统中的信息文本聚类为多个第二聚类集合。S101: Select the first preset clustering method to perform two clustering processes, and cluster the information text in the system into multiple second clustering sets.

S102：根据用户属性和历史数据，匹配与该用户相关的第二聚类集合；如该第二聚类集合中存在用户的历史访问信息文本，或者该第二聚类集合的标签信息与该用户属性相匹配，具体方式由个性推荐场景定义。S102: Match a second cluster set related to the user according to the user attributes and historical data; if the user's historical access information text exists in the second cluster set, or the label information of the second cluster set matches the user The attributes match, and the specific method is defined by the personality recommendation scenario.

S103：将匹配的第二聚类集合中的所有信息当做推荐信息，返还给用户。S103: Treat all information in the matched second cluster set as recommendation information and return it to the user.

如此，使得上述应用可以准确快速实现个性化推荐，具有良好的可控性。In this way, the above-mentioned applications can accurately and quickly implement personalized recommendations with good controllability.

具体示例二Specific example two

当应用场景为智能问答场景时，对聚类的要求为：效率要求低、精度要求高，则本发明的实施例提供的方法包括：When the application scenario is an intelligent question and answer scenario, the clustering requirements are: low efficiency requirements and high precision requirements, then the methods provided by the embodiments of the present invention include:

S201：选择第一预设聚类方法进行两次聚类处理，将系统中的问答库聚类为多个第二聚类集合。S201: Select the first preset clustering method to perform two clustering processes, and cluster the question and answer library in the system into multiple second clustering sets.

S202：对多个第二聚类集合中所有样本进行分析，选择该第二聚类集合的推荐样本，具体推荐方案由智能问答场景定义。S202: Analyze all samples in multiple second cluster sets, and select the recommended samples of the second cluster set. The specific recommendation scheme is defined by the intelligent question and answer scenario.

S203：为多个第二聚类集合中的推荐样本配置一条标准答案，从而将该多个第二聚类集合中的每条样本与该标准答案组成问答对，置入问答库。S203: Configure a standard answer for the recommended samples in the multiple second cluster sets, so that each sample in the multiple second cluster sets and the standard answer form a question and answer pair, and place them in the question and answer database.

如此，使得上述应用可以自动准确地扩展智能问答的问答库，避免大量人工操作。In this way, the above application can automatically and accurately expand the question and answer library of intelligent question and answer, avoiding a lot of manual operations.

具体示例三Specific example three

当应用场景为信息检索场景时，对聚类的要求为：效率要求高、精度要求高，则本发明的实施例提供的方法包括：When the application scenario is an information retrieval scenario, the clustering requirements are: high efficiency requirements and high precision requirements, then the methods provided by the embodiments of the present invention include:

S301：选择第一预设聚类方法和第二预设聚类方法进行聚类处理，系统中的索引文本聚类为多个第二聚类集合。S301: Select a first preset clustering method and a second preset clustering method for clustering processing, and cluster the index text in the system into multiple second clustering sets.

S302：收到外部检索请求时，将请求信息逐一和第二聚类集合中的一个或多个样本进行匹配，得到相匹配的第二聚类集合；聚义匹配方式由具体的信息检索场景定义。S302: When an external retrieval request is received, the request information is matched with one or more samples in the second cluster set one by one to obtain a matching second cluster set; the method of clustering matching is defined by a specific information retrieval scenario .

S303：将匹配第二聚类集合中所有样本与该外部检索请求进行之一匹配，匹配成功的样本作为检索结果返回用户。S303: Match all the samples in the matching second cluster set with the external search request, and the matched samples are returned to the user as a search result.

如此，使得上述应用可以准确快速实现信息检索，避免大规模索引信息逐一检索的计算开销。In this way, the above applications can accurately and quickly realize information retrieval, and avoid the computational overhead of large-scale index information retrieval one by one.

具体示例四Specific example four

当应用场景为数字图书馆场景时，对聚类的要求为：效率要求低、精度要求低，则本发明的实施例提供的方法包括：When the application scenario is a digital library scenario, the clustering requirements are: low efficiency requirements and low precision requirements, then the methods provided by the embodiments of the present invention include:

S401：可随意搭配两种聚类方式，如，先选择第一预设聚类方式，然后选择第二预设聚类方式，对数字图书馆中的所有资料样本进行聚类处理，得到多个第二聚类集合。S401: Two clustering methods can be freely matched, for example, first select the first preset clustering method, and then select the second preset clustering method, cluster all the data samples in the digital library to obtain multiple The second cluster collection.

S402：利用主题模型等方法，得到每个第二聚类集合的标签信息。S402: Use the subject model and other methods to obtain the label information of each second cluster set.

S403：第二聚类集合的样本使用该集合的标签信息，由于一个样本可能存在于多个第二聚类集合中，从而每个资料样本拥有了多个标签信息。S403: The samples of the second cluster set use the label information of the set. Since one sample may exist in multiple second cluster sets, each data sample has multiple label information.

如此，使得上述应用可以为大规模资料文本添加标签信息，如为图书添加类别信息，从而方便数字图书馆的系统管理。In this way, the above-mentioned applications can add tag information to large-scale material texts, such as adding category information to books, thereby facilitating the system management of the digital library.

本发明实施例还提供了一种信息处理装置，如图5所示，所述信息处理装置500包括：An embodiment of the present invention also provides an information processing apparatus. As shown in FIG. 5, the information processing apparatus 500 includes:

第一聚类器510，用于对原始文本集合中的文本采用预设聚类方式进行聚类处理，获得多个第一聚类集合。The first clusterer 510 is configured to perform clustering processing on the text in the original text set in a preset clustering manner to obtain multiple first cluster sets.

第二聚类器520，用于对每个第一聚类集合中的文本采用预设聚类方式进行聚类处理，获得多个第二聚类集合。The second clusterer 520 is configured to perform clustering processing on the text in each first cluster set by using a preset clustering method to obtain multiple second cluster sets.

其中，所述预设聚类方式为第一预设聚类方式或第二预设聚类方式。Wherein, the preset clustering mode is the first preset clustering mode or the second preset clustering mode.

在一实施例中，所述第一预设聚类方式和所述第二预设聚类方式基于文本聚类的效率要求和/或精度要求确定。In an embodiment, the first preset clustering mode and the second preset clustering mode are determined based on the efficiency requirements and/or precision requirements of text clustering.

在一实施例中，所述第一聚类器510或者第二聚类器520，用于提取待处理文本集合中每个文本，将每个文本表示为签名向量；对每个文本的签名向量进行分段处理，获得多个签名向量分段，对每个签名向量分段进行聚类处理。In an embodiment, the first clusterer 510 or the second clusterer 520 is used to extract each text in the set of text to be processed, and represent each text as a signature vector; for each text, the signature vector Perform segmentation processing to obtain multiple signature vector segments, and perform clustering processing on each signature vector segment.

在一实施例中，所述第一聚类器510或者第二聚类器520，用于获得文本的词序列，以及获得所述词序列中每个词的权重，基于获得所述词序列中每个词的权重获得带权词序列；将所述带权词序列进行哈希运算获得带权哈希值序列；对所述带权哈希值序列中的每个带权哈希值进行合并处理，获得对应于所述文本的加权哈希值；对所述加权哈希值进行二值化处理，获得二进制签名向量。In an embodiment, the first clusterer 510 or the second clusterer 520 is used to obtain a word sequence of text, and to obtain the weight of each word in the word sequence, based on obtaining the word sequence The weight of each word obtains a sequence of weighted words; hashing the sequence of weighted words to obtain a sequence of weighted hash values; combining each weighted hash value in the sequence of weighted hash values Processing to obtain a weighted hash value corresponding to the text; performing binarization processing on the weighted hash value to obtain a binary signature vector.

在一实施例中，所述第一聚类器510或者第二聚类器520，用于将每个文本的二进制签名向量进行分段处理，获得多个二进制签名向量分段；对每个二进制签名向量分段进行哈希运算处理，获得二进制签名向量分段对应的哈希值；将对应的哈希值相同的文本划分至相同的聚类集合中。In an embodiment, the first clusterer 510 or the second clusterer 520 is used to segment the binary signature vector of each text to obtain multiple binary signature vector segments; for each binary The signature vector segment is hashed to obtain the hash value corresponding to the binary signature vector segment; the corresponding text with the same hash value is divided into the same clustering set.

在一实施例中，所述第一聚类器510或者第二聚类器520，用于计算待处理文本集合中的任意两个文本之间的相似度，基于相似度的计算结果对所述待处理文本集合中的文本进行聚类处理。In an embodiment, the first clusterer 510 or the second clusterer 520 is used to calculate the similarity between any two texts in the set of text to be processed. The text in the text collection to be processed is clustered.

在一实施例中，所述第一聚类器510或者第二聚类器520，用于计算待处理文本集合中的两个第一文本之间的第一相似度，判断所述第一相似度是否超过第一预设阈值；所述第一文本为所述待处理文本集合中的任一文本；当所述第一相似度超过所述第一预设阈值时，将所述两个第一文本划分至相同的聚类集合。In an embodiment, the first clusterer 510 or the second clusterer 520 is used to calculate a first similarity between two first texts in the set of text to be processed, and determine the first similarity Whether the degree exceeds a first preset threshold; the first text is any text in the to-be-processed text set; when the first similarity exceeds the first preset threshold, the two A text is divided into the same clustering set.

在一实施例中，所述第一聚类器510或者第二聚类器520，用于分别计算所述待处理文本集合中的第二文本与所述聚类集合中的第一文本之间的第二相似度，判断所述第二相似度是否均超过所述第一预设阈值；所述第二文本为所述待处理文本中除所述第一文本外的其他文本；当所述第二相似度均超过所述第一预设阈值时，获得所述第二文本与所述聚类集合中的第一文本之间的第二相似度均值，判断所述第二相似度均值是否超过第二预设阈值；当所述第二相似度均值超过所述第二预设阈值时，将所述聚类集合确定为候选聚类集合；获得所述第二文本对应的所有候选聚类集合，将所述候选聚类集合中满足预设要求的候选聚类集合作为目标聚类集合，将所述第二文本划分至所述目标聚类集合。In an embodiment, the first clusterer 510 or the second clusterer 520 is respectively used to calculate between the second text in the to-be-processed text set and the first text in the cluster set The second similarity, determine whether all the second similarities exceed the first preset threshold; the second text is other text than the first text in the text to be processed; when the When all the second similarities exceed the first preset threshold, obtain a second average value of the second similarity between the second text and the first text in the cluster set, and determine whether the second average value of the similarity Exceeding a second preset threshold; when the mean value of the second similarity exceeds the second preset threshold, determining the cluster set as a candidate cluster set; obtaining all candidate clusters corresponding to the second text Set, taking the candidate cluster set that satisfies the preset requirements among the candidate cluster sets as the target cluster set, and dividing the second text into the target cluster set.

在一实施例中，所述待处理文本集合为所述原始文本集合或所述第一聚类集合。In an embodiment, the to-be-processed text set is the original text set or the first cluster set.

如此，通过预设聚类方式进行两次聚类处理，并且在后的聚类处理是针对第一聚类集合中的文本再次进行聚类处理，采用层次化的聚类处理方式，一方面避免了出现聚类结果不一致的情况，另一方面大大提升了聚类精度和聚类效率。另外，可以根据聚类的需求，选择不同的聚类方法的组合，使得聚类处理更加灵活有效，可以满足不同应用场景的需求。In this way, the clustering process is performed twice by the preset clustering method, and the subsequent clustering process is to perform clustering processing again on the text in the first clustering set, and the hierarchical clustering processing method is adopted to avoid In the case of inconsistent clustering results, on the other hand, the clustering accuracy and clustering efficiency are greatly improved. In addition, different combinations of clustering methods can be selected according to clustering requirements, which makes clustering processing more flexible and effective, and can meet the needs of different application scenarios.

本发明的装置实施例参照上述本发明的方法实施例。For the device embodiments of the present invention, refer to the method embodiments of the present invention described above.

上述发明实施例中，所述信息处理装置500中的第一聚类器510和第二聚类器520在实际应用中均可由CPU、DSP、MCU或FPGA实现。In the above embodiments of the invention, the first clusterer 510 and the second clusterer 520 in the information processing device 500 can be implemented by a CPU, DSP, MCU, or FPGA in practical applications.

需要说明的是：上述实施例提供的信息处理装置在进行信息处理时，仅以上述各程序模块的划分进行举例说明，实际应用中，可以根据需要而将上述处理分配由不同的程序模块完成，即将装置的内部结构划分成不同的程序模块，以完成以上描述的全部或者部分处理。另外，上述实施例提供的信息处理装置与信息处理方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that when the information processing apparatus provided in the above embodiment performs information processing, only the above-mentioned division of each program module is used as an example for illustration. In practical applications, the above-mentioned processing may be allocated by different program modules according to needs. That is, the internal structure of the device is divided into different program modules to complete all or part of the processing described above. In addition, the information processing apparatus and the information processing method embodiment provided in the above embodiments belong to the same concept. For the specific implementation process, see the method embodiment, and details are not described here.

本发明实施例还提供了一种计算机可读存储介质，其上存储有可执行程序，所述可执行程序被处理器执行时实现上述任一信息处理方法。An embodiment of the present invention also provides a computer-readable storage medium on which an executable program is stored, and when the executable program is executed by a processor, any of the foregoing information processing methods is implemented.

本发明实施例还提供了一种信息处理装置，该装置包括：处理器和用于存储能够在处理器上运行的计算机程序的存储器，其中，所述处理器运行用于运行所述计算机程序时，执行本发明实施例实现的任一信息处理方法。An embodiment of the present invention also provides an information processing apparatus including: a processor and a memory for storing a computer program that can be run on the processor, wherein the processor runs when the computer program is run To execute any information processing method implemented in the embodiments of the present invention.

可以理解，存储器可以由任何类型的易失性或非易失性存储设备、或者它们的组合来实现。其中，非易失性存储器可以是只读存储器(ROM，Read Only Memory)、可编程只读存储器(PROM，Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM，Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM，Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM，Ferromagnetic Random Access Memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM，Compact Disc Read-Only Memory)；磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM，Random Access Memory)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器(SRAM，Static Random Access Memory)、同步静态随机存取存储器(SSRAM，Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM，Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM，Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM，Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM，Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM，SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM，Direct Rambus Random Access Memory)。本发明实施例描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It is understood that the memory may be implemented by any type of volatile or non-volatile storage device, or a combination thereof. Among them, the non-volatile memory can be read-only memory (ROM, Read Only Memory), programmable read-only memory (PROM, Programmable Read-Only Memory), erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), Magnetic Random Access Memory (FRAM, Ferromagnetic Random Access Memory), Flash Memory (Flash Memory), Magnetic Surface Memory , Compact disc, or read-only compact disc (CD-ROM, Compact, Read-Only Memory); the magnetic surface memory can be a disk storage or a tape storage. The volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache. By way of example but not limitation, many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory), dynamic random access Memory (DRAM, Dynamic Random Access), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate, Synchronous Dynamic Random Access Random Access Memory), enhanced Type synchronous dynamic random access memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), synchronous connection dynamic random access memory (SLDRAM, SyncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, Direct Rambus Random Access Random Access Memory ). The memories described in the embodiments of the present invention are intended to include but are not limited to these and any other suitable types of memories.

上述本发明实施例揭示的方法可以应用于处理器中，或者由处理器实现。处理器可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、DSP，或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本发明实施例所公开的方法的步骤，可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中，该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成前述方法的步骤。The method disclosed in the foregoing embodiments of the present invention may be applied to a processor, or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The aforementioned processor may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor may implement or execute the disclosed methods, steps, and logical block diagrams in the embodiments of the present invention. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of the present invention may be directly implemented and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a storage medium. The storage medium is located in a memory. The processor reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.

在实施例中，信息处理装置可以被一个或多个应用专用集成电路(ASIC，Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD，Programmable Logic Device)、复杂可编程逻辑器件(CPLD，Complex Programmable Logic Device)、FPGA、通用处理器、控制器、MCU、微处理器(Microprocessor)、或其他电子元件实现，用于执行前述方法。In an embodiment, the information processing apparatus may be implemented by one or more application specific integrated circuits (ASIC, Application Integrated Circuit), DSP, programmable logic device (PLD, Programmable Logic Device), complex programmable logic device (CPLD, Complex Programmable Logic (Device), FPGA, general-purpose processor, controller, MCU, microprocessor (Microprocessor), or other electronic components to implement the aforementioned method.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，如：多个单元或组件可以结合，或可以集成到另一个系统，或一些特征可以忽略，或不执行。另外，所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口，设备或单元的间接耦合或通信连接，可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. The device embodiments described above are only schematics. For example, the division of the units is only a division of logical functions. In actual implementation, there may be other division methods, such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling or direct coupling or communication connection between the displayed or discussed components may be through some interfaces, and the indirect coupling or communication connection of the device or unit may be electrical, mechanical, or other forms of.

上述作为分离部件说明的单元可以是、或也可以不是物理上分开的，作为单元显示的部件可以是、或也可以不是物理单元，即可以位于一个地方，也可以分布到多个网络单元上；可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各实施例中的各功能单元可以全部集成在一个处理单元中，也可以是各单元分别单独作为一个单元，也可以两个或两个以上单元集成在一个单元中；上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the embodiments of the present invention may all be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration The unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art may understand that all or part of the steps to implement the above method embodiments may be completed by program instructions related hardware. The foregoing program may be stored in a computer-readable storage medium, and when the program is executed, The steps of the above method embodiments are included; and the foregoing storage medium includes various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.

或者，本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括：移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above integrated unit of the present invention is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present invention can be embodied in the form of software products in essence or part of contributions to the existing technology. The computer software products are stored in a storage medium and include several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the embodiments of the present invention. The foregoing storage media include various media that can store program codes, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed by the present invention. It should be covered by the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

An information processing method, wherein the method includes:

Perform clustering on the text in the original text collection by using a preset clustering method to obtain multiple first cluster collections;

Perform clustering on the text in each first clustering set by using a preset clustering method to obtain multiple second clustering sets;

The preset clustering mode is the first preset clustering mode or the second preset clustering mode.

The method according to claim 1, wherein

The first preset clustering mode and the second preset clustering mode are determined based on the efficiency requirements and/or precision requirements of text clustering.

The method according to claim 1, wherein the clustering using the preset clustering method includes:

Extract each text in the to-be-processed text collection and represent each text as a signature vector;

Segment the signature vectors of each text to obtain multiple signature vector segments, and cluster each signature vector segment.

The method of claim 3, wherein the expressing each text as a signature vector includes:

Obtaining a word sequence of text, and obtaining the weight of each word in the word sequence, and obtaining a weighted word sequence based on obtaining the weight of each word in the word sequence;

Hashing the weighted word sequence to obtain a weighted hash value sequence;

Performing merge processing on each weighted hash value in the weighted hash value sequence to obtain a weighted hash value corresponding to the text;

Perform binary processing on the weighted hash value to obtain a binary signature vector.

The method according to claim 4, wherein the segmenting the signature vector of each text to obtain multiple signature vector segments, and clustering each signature vector segment includes:

Segment the binary signature vector of each text to obtain multiple binary signature vector segments;

Perform a hash operation on each binary signature vector segment to obtain a hash value corresponding to the binary signature vector segment;

Divide the corresponding text with the same hash value into the same clustering set.

Calculate the similarity between any two texts in the text set to be processed, and perform clustering processing on the texts in the text set to be processed based on the calculation result of the similarity.

The method according to claim 6, wherein the clustering processing of the text in the to-be-processed text set based on the calculation result of the similarity comprises:

Calculate a first similarity between two first texts in the to-be-processed text set, and determine whether the first similarity exceeds a first preset threshold; the first text is any of the to-be-processed text sets A text

When the first similarity exceeds the first preset threshold, the two first texts are divided into the same clustering set.

The method of claim 7, wherein the method further comprises:

Separately calculating a second similarity between the second text in the to-be-processed text set and the first text in the clustering set to determine whether all the second similarities exceed the first preset threshold; The second text is other text than the first text in the text to be processed;

When the second similarities all exceed the first preset threshold, obtain a second average value of the second similarity between the second text and the first text in the cluster set, and determine the second similarity Whether the average degree exceeds the second preset threshold;

When the average value of the second similarity exceeds the second preset threshold, determine the cluster set as a candidate cluster set;

Obtain all candidate cluster sets corresponding to the second text, use the candidate cluster set that satisfies the preset requirements in the candidate cluster set as the target cluster set, and divide the second text into the target cluster set.

The method according to any one of claims 3 to 8, wherein the set of text to be processed is the original text set or the first cluster set.

An information processing device, wherein the device includes:

The first clusterer is used to perform clustering processing on the text in the original text set by using a preset clustering method to obtain multiple first cluster sets;

A second clusterer, configured to perform clustering on the text in each first cluster set by using a preset clustering method to obtain multiple second cluster sets;

Wherein, the preset clustering mode is the first preset clustering mode or the second preset clustering mode.

The device according to claim 10, wherein

The apparatus according to claim 10, wherein the first clusterer or the second clusterer is used to extract each text in the set of text to be processed, and express each text as a signature vector; for each text The signature vectors are segmented, multiple signature vector segments are obtained, and each signature vector segment is clustered.

The apparatus according to claim 12, wherein the first clusterer or the second clusterer is used to obtain a word sequence of text, and obtain the weight of each word in the word sequence, based on obtaining the The weight of each word in the word sequence is used to obtain a weighted word sequence; the hashed word sequence is hashed to obtain a weighted hash value sequence; and each weighted hash in the weighted hash value sequence The values are merged to obtain a weighted hash value corresponding to the text; the weighted hash value is binarized to obtain a binary signature vector.

The apparatus according to claim 13, wherein the first clusterer or the second clusterer is used to segment the binary signature vector of each text to obtain multiple binary signature vector segments; Each binary signature vector segment is hashed to obtain a hash value corresponding to the binary signature vector segment; the corresponding text with the same hash value is divided into the same clustering set.

The apparatus according to claim 10, wherein the first clusterer or the second clusterer is used to calculate the similarity between any two texts in the text set to be processed, based on the calculation result of the similarity Perform clustering processing on the text in the text set to be processed.

The apparatus according to claim 15, wherein the first clusterer or the second clusterer is used to calculate a first similarity between two first texts in the set of text to be processed, and determine the first Whether the similarity exceeds a first preset threshold; the first text is any text in the to-be-processed text set; when the first similarity exceeds the first preset threshold, the two The first text is divided into the same cluster set.

The apparatus according to claim 16, wherein the first clusterer or the second clusterer is configured to respectively calculate the second text in the to-be-processed text set and the first text in the cluster set The second similarity between the two, to determine whether all the second similarities exceed the first preset threshold; the second text is other text than the first text in the text to be processed; When all the second similarities exceed the first preset threshold, obtain a second average value of the second similarity between the second text and the first text in the cluster set, and determine the second average value of the similarity Whether it exceeds a second preset threshold; when the average value of the second similarity exceeds the second preset threshold, determine the cluster set as a candidate cluster set; obtain all candidate clusters corresponding to the second text Class set, taking the candidate cluster set that satisfies the preset requirements among the candidate cluster sets as the target cluster set, and dividing the second text into the target cluster set.

The device according to any one of claims 12 to 17, wherein the set of text to be processed is the original text set or the first cluster set.

A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 9 are implemented.

An information processing device, comprising: a processor and a memory for storing a computer program that can be run on the processor, wherein when the processor is used to run the computer program, any one of claims 1 to 9 is executed Item of the method.