WO2023173543A1

WO2023173543A1 - Data classification model training method and apparatus, classification method and apparatus, device, and medium

Info

Publication number: WO2023173543A1
Application number: PCT/CN2022/090105
Authority: WO
Inventors: 王彦; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-14
Filing date: 2022-04-29
Publication date: 2023-09-21
Anticipated expiration: 2024-09-14
Also published as: CN114662580A; CN114662580B

Abstract

The present application relates to a data classification model training method and apparatus, a classification method and apparatus, a device, and a medium. The training method comprises: dividing a plurality of data samples into a minority class sample set and a majority class sample set; undersampling from the majority class sample set to obtain an undersampling set; performing first iterative training on a classification model on the basis of a training set composed of the minority class sample set and the undersampling set to obtain a classification model meeting a first preset condition; if the model does not meet a second preset condition, performing oversampling on the minority class sample set on the basis of the model, and adding obtained samples into the training set; and performing second iterative training on the model on the basis of the updated training set to obtain a data classification model meeting the second preset condition. According to the training method in the present application, data obtained by means of undersampling and data obtained by means of oversampling are used for training the classification model, data used for training the classification model has good balance, a good training effect is achieved, and the classification accuracy of the trained classification model is high.

Description

Training methods, classification methods, devices, equipment and media for data classification models

优先权申明priority statement

本申请要求于2022年3月14日提交中国专利局、申请号为202210248165.5，发明名称为“数据分类模型的训练方法、分类方法、装置、设备和介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on March 14, 2022, with the application number 202210248165.5, and the invention name is "Training method, classification method, device, equipment and medium for data classification model", all of which The contents are incorporated into this application by reference.

Technical field

本申请涉及人工智能领域，特别是涉及数据分类模型的训练方法、数据分类方法、装置、计算机设备和存储介质。This application relates to the field of artificial intelligence, and in particular to training methods of data classification models, data classification methods, devices, computer equipment and storage media.

Background technique

数据分类问题是机器学习领域最常见的问题之一。现有的常用分类模型例如有逻辑回归算法模型、k最近邻算法模型、决策树算法模型和支持向量机算法模型等等。随着机器学习算法应用在越来越多的应用场景中，分类模型的应用出现了一些问题，其中，由于不均衡数据对分类模型的训练效果不佳，导致训练得到的分类模型的分类准确率不高，数据分布不平衡对分类效果的影响尤其显著。在一些特定应用场景中获取分布平衡的数据十分困难。例如，在电话客服场景中，投诉类电话极少而咨询类电话极多，两种类型的电话数量相差达百倍甚至千倍，这对训练客户投诉分类模型带来了极大的困难，发明人意识到现有技术中，直接利用历史数据训练分类模型，由于未对用于训练的历史数据进行任何处理，导致训练效果不佳，训练出的分类模型会把大部分投诉电话误识别为咨询类电话，分类准确率低。因此，如何克服在训练分类模型时由于训练数据不平衡所导致的训练效果不佳、训练得到的分类模型分类准确率低的问题是当前待解决的技术问题。Data classification problem is one of the most common problems in the field of machine learning. Existing commonly used classification models include logistic regression algorithm model, k nearest neighbor algorithm model, decision tree algorithm model, support vector machine algorithm model, etc. As machine learning algorithms are used in more and more application scenarios, some problems have arisen in the application of classification models. Among them, due to the poor training effect of the classification model on imbalanced data, the classification accuracy of the trained classification model has been reduced. Not high, the imbalance of data distribution has a particularly significant impact on the classification effect. It is very difficult to obtain balanced data in some specific application scenarios. For example, in a telephone customer service scenario, there are very few complaint calls and many consultation calls. The number of calls between the two types differs by a hundred or even a thousand times, which brings great difficulties to training the customer complaint classification model. The inventor I realized that in the existing technology, historical data are directly used to train classification models. Since the historical data used for training is not processed in any way, the training effect is poor. The trained classification model will misidentify most of the complaint calls as consultation types. Phone,classification accuracy is low. Therefore, how to overcome the problems of poor training results and low classification accuracy of the trained classification model due to the imbalance of training data when training a classification model are currently technical issues to be solved.

发明内容Contents of the invention

基于此，有必要针对在训练分类模型时由于历史数据不平衡所导致的训练效果不佳、训练得到的分类模型分类准确率低的问题，提供一种数据分类模型的训练方法、数据分类方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a training method for a data classification model, a data classification method, and a training method for a data classification model to address the problems of poor training results due to imbalance of historical data and low classification accuracy of the trained classification model when training the classification model. Devices, computer equipment and storage media.

一种数据分类模型的训练方法，包括：A method for training a data classification model, including:

将预先获取的多个历史数据样本划分为少数类样本集合和多数类样本集合；Divide multiple historical data samples obtained in advance into a minority class sample set and a majority class sample set;

从所述多数类样本集合中欠采样得到欠采样集合；Obtain an undersampled set from undersampling the majority class sample set;

基于所述少数类样本集合和所述欠采样集合所组成的训练集对预设的分类模型执行第一迭代训练，得到满足第一预设条件的分类模型；Perform first iterative training on the preset classification model based on the training set composed of the minority class sample set and the undersampled set to obtain a classification model that satisfies the first preset condition;

检测所述满足第一预设条件的分类模型是否满足第二预设条件；Detect whether the classification model that satisfies the first preset condition satisfies the second preset condition;

若不满足第二预设条件，则基于所述满足第一预设条件的分类模型对所述少数类样本集合进行过采样，将过采样得到的数据样本加入所述训练集；If the second preset condition is not met, oversampling the minority class sample set based on the classification model that satisfies the first preset condition, and adding the oversampled data samples to the training set;

基于更新后的训练集对满足第一预设条件的分类模型执行第二迭代训练，得到满足第二预设条件的数据分类模型。Based on the updated training set, a second iterative training is performed on the classification model that satisfies the first preset condition to obtain a data classification model that satisfies the second preset condition.

一种数据分类方法，包括：A data classification method that includes:

获取待分类数据；Get the data to be classified;

上述的数据分类模型的训练方法的步骤；以及，The steps of the above-mentioned data classification model training method; and,

利用所述满足第二预设条件的数据分类模型对所述待分类数据进行分类。The data to be classified is classified using the data classification model that satisfies the second preset condition.

一种数据分类模型的训练装置，包括：A training device for data classification models, including:

划分模块，用于将预先获取的多个历史数据样本划分为少数类样本集合和多数类样本集合；A division module used to divide multiple pre-obtained historical data samples into a minority class sample set and a majority class sample set;

欠采样模块，用于从所述多数类样本集合中欠采样得到欠采样集合；An undersampling module, used to undersample and obtain an undersampled set from the majority class sample set;

第一迭代训练模块，用于基于所述少数类样本集合和所述欠采样集合所组成的训练集对预设的分类模型执行第一迭代训练，得到满足第一预设条件的分类模型；A first iterative training module, configured to perform first iterative training on a preset classification model based on a training set composed of the minority class sample set and the undersampled set, to obtain a classification model that satisfies the first preset condition;

检测模块，用于检测所述满足第一预设条件的分类模型是否满足第二预设条件；A detection module configured to detect whether the classification model that satisfies the first preset condition satisfies the second preset condition;

过采样模块，用于若不满足第二预设条件，则基于所述满足第一预设条件的分类模型对所述少数类样本集合进行过采样，将过采样得到的数据样本加入所述训练集；An oversampling module, configured to oversample the minority class sample set based on the classification model that satisfies the first preset condition if the second preset condition is not met, and add the oversampled data samples to the training set;

第二迭代训练模块，用于基于更新后的训练集对满足第一预设条件的分类模型执行第二迭代训练，得到满足第二预设条件的数据分类模型。The second iterative training module is configured to perform second iterative training on the classification model that satisfies the first preset condition based on the updated training set to obtain a data classification model that satisfies the second preset condition.

一种数据分类装置，包括：A data classification device including:

待分类数据获取模块，用于获取待分类数据；The data acquisition module to be classified is used to obtain the data to be classified;

上述的数据分类模型的训练装置；以及，The training device for the above-mentioned data classification model; and,

分类模块，用于利用达到所述预设训练停止条件的分类模型对所述待分类数据进行分类。A classification module, configured to classify the data to be classified using a classification model that meets the preset training stop condition.

一种计算机设备，包括存储器和处理器，所述存储器中存储有计算机可读指令，所述计算机可读指令被所述处理器执行时，使得所述处理器执行如下步骤的指令：A computer device includes a memory and a processor. Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, they cause the processor to perform the following steps:

一种存储有计算机可读指令的存储介质，所述计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行如下步骤的指令：A storage medium storing computer-readable instructions that, when executed by one or more processors, cause one or more processors to perform the following steps:

上述数据分类模型的训练方法、装置、计算机设备和存储介质，将预先获取的多个历史数据样本划分为少数类样本集合和多数类样本集合，从所述多数类样本集合中欠采样得到欠采样集合，基于所述少数类样本集合和所述欠采样集合所组成的训练集对预设的分类模型执行第一迭代训练，得到满足第一预设条件的分类模型，检测所述满足第一预设条件的分类模型是否满足第二预设条件，若不满足第二预设条件，则基于所述满足第一预设条件的分类模型对所述少数类样本集合进行过采样，将过采样得到的数据样本加入所述训练集，基于更新后的训练集对满足第一预设条件的分类模型执行第二迭代训练，得到满足第二预设条件的数据分类模型；由于对分类模型进行训练时采用了欠采样得到的数据和过采样得到的数据，因此，用于训练分类模型的数据平衡性较好，对分类模型的训练效果好、训练得到的分类模型分类准确率高，克服了现有技术中由于在训练分类模型时所采用的训练数据不平衡所导致的训练效果不佳、训练得到的分类模型分类准确率低的问题。The training method, device, computer equipment and storage medium of the above-mentioned data classification model divides multiple historical data samples acquired in advance into a minority class sample set and a majority class sample set, and obtains undersampling from the majority class sample set. set, perform the first iterative training on the preset classification model based on the training set composed of the minority class sample set and the under-sampled set, obtain a classification model that satisfies the first preset condition, and detect that the preset classification model satisfies the first preset condition. Assume whether the conditional classification model satisfies the second preset condition. If it does not satisfy the second preset condition, then oversample the minority class sample set based on the classification model that satisfies the first preset condition, and oversample to obtain The data samples are added to the training set, and the second iterative training is performed on the classification model that satisfies the first preset condition based on the updated training set to obtain the data classification model that satisfies the second preset condition; because when training the classification model Under-sampled data and over-sampled data are used. Therefore, the data used to train the classification model is well balanced, the training effect of the classification model is good, and the trained classification model has high classification accuracy, which overcomes the existing problems. In technology, due to the imbalance of training data used in training classification models, the training effect is poor and the classification accuracy of the trained classification model is low.

Description of the drawings

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative labor.

图1为一个实施例中提供的数据分类模型的训练方法的应用环境图；Figure 1 is an application environment diagram of the training method of the data classification model provided in one embodiment;

图2为一个实施例中数据分类模型的训练方法的流程图；Figure 2 is a flow chart of a training method for a data classification model in one embodiment;

图3为一个具体示例的数据分类模型的训练方法的流程图；Figure 3 is a flow chart of a specific example of a training method for a data classification model;

图4为一个实施例中提供的数据分类模型的训练装置的结构框图；Figure 4 is a structural block diagram of a training device for a data classification model provided in one embodiment;

图5为一个实施例中计算机设备的内部结构框图。Figure 5 is a block diagram of the internal structure of a computer device in one embodiment.

Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

可以理解，术语“第一”、“第二”、“第三”等仅用于区分描述，而不能理解为指示或暗示相对重要性。还应理解的是，虽然术语“第一”、“第二”、“第三”等在文本中在一些本申请实施例中用来描述各种元素，但是这些元素不应该受到这些术语的限制。这些术语仅用于区分各种元素。It will be understood that the terms "first", "second", "third", etc. are only used to differentiate descriptions and cannot be understood as indicating or implying relative importance. It should also be understood that, although the terms "first," "second," "third," etc. are used in the text to describe various elements in some embodiments of the application, these elements should not be limited by these terms. . These terms are only used to distinguish between various elements.

参考图1所示，本申请实施例提供的数据分类模型的训练方法，可应用在如图1的应用环境中，其中，客户端可以通过网络与服务端进行通信。服务端可以将从客户端获取的多个历史数据样本划分为少数类样本集合和多数类样本集合，从多数类样本集合中欠采样得到欠采样集合，基于少数类样本集合和欠采样集合所组成的训练集对预设的分类模型执行第一迭代训练，得到满足第一预设条件的分类模型，然后检测所述满足第一预设条件的分类模型是否满足第二预设条件，若不满足第二预设条件，则基于所述满足第一预设条件的分类模型对所述少数类样本集合进行过采样，将过采样得到的数据样本加入所述训练集，基于更新后的训练集对满足第一预设条件的分类模型执行第二迭代训练，得到满足第二预设条件的数据分类模型。客户端可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。Referring to Figure 1, the training method of the data classification model provided by the embodiment of the present application can be applied in the application environment of Figure 1, where the client can communicate with the server through the network. The server can divide multiple historical data samples obtained from the client into a minority class sample set and a majority class sample set, and obtain an under-sampled set from the majority class sample set, which is composed of a minority class sample set and an under-sampled set. The training set performs the first iterative training on the preset classification model to obtain a classification model that satisfies the first preset condition, and then detects whether the classification model that satisfies the first preset condition satisfies the second preset condition. If it does not meet the The second preset condition is to oversample the minority class sample set based on the classification model that satisfies the first preset condition, add the oversampled data samples to the training set, and based on the updated training set The classification model that satisfies the first preset condition performs a second iterative training to obtain a data classification model that satisfies the second preset condition. Clients can be, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.

过采样(oversampling)和欠采样(undersampling)是两种处理不平衡数据的常用方法。训练分类模型时，过采样方法把占比极小的少数类数据样本重复多次以增加该类数据样本数量，而欠采样方法对占比极大的多数类数据样本做随机采样以减少该类数据样本数量。两种方法都能调整数据样本数量，使不同类别的数据趋于平衡。但是，发明人发现，传统的过采样方法从数据集中随机选取若干少数类数据样本进行复制并加入数据集，容易造成分类模型对这些数据样本过拟合，不利于分类模型的泛化；传统的欠采样方法随机抛弃一些多数类数据样本，这些被抛弃的数据样本可能包含了重要信息，分类模型丢失了这些信息就不能准确识别该类别。Oversampling and undersampling are two common methods for dealing with unbalanced data. When training a classification model, the oversampling method repeats a very small proportion of minority class data samples many times to increase the number of data samples of this type, while the undersampling method randomly samples a large proportion of majority class data samples to reduce the number of data samples of this type. Number of data samples. Both methods can adjust the number of data samples to balance different categories of data. However, the inventor found that the traditional oversampling method randomly selects a number of minority data samples from the data set to copy and add them to the data set, which can easily cause the classification model to overfit these data samples, which is not conducive to the generalization of the classification model; the traditional oversampling method The undersampling method randomly discards some majority class data samples. These discarded data samples may contain important information. If the classification model loses this information, it cannot accurately identify the category.

参考图2所示，在一个实施例中，提出了一种数据分类模型的训练方法，可以包括步骤S10至步骤S60：Referring to Figure 2, in one embodiment, a training method for a data classification model is proposed, which may include steps S10 to S60:

S10、将预先获取的多个历史数据样本划分为少数类样本集合和多数类样本集合。S10. Divide multiple historical data samples obtained in advance into a minority class sample set and a majority class sample set.

在某些实施方式中，所述预先获取的多个历史数据样本中包括两种数据样本；步骤S10可以包括：In some embodiments, the plurality of pre-acquired historical data samples include two types of data samples; step S10 may include:

分别统计所述多个历史数据样本中的所述两种数据样本的数量；Count the numbers of the two types of data samples in the plurality of historical data samples respectively;

对所述两种数据样本的数量进行比较，利用数量少的一种数据样本组成所述少数类样本集合，利用数量多的一种数据样本组成所述多数类样本集合。The numbers of the two data samples are compared, a data sample with a smaller number is used to form the minority class sample set, and a data sample with a larger number is used to form the majority class sample set.

例如，该多个数据样本中可以包括正数数据样本和负数数据样本，利用第一标签对属于正数数据样本的每个数据样本进行标记，利用第二标签对属于负数数据样本的每个数据样本进行标记。通过统计第一标签和第二标签的数量，即可确定出少数类数据样本和多数类数据样本。例如，可以设定第一标签为0，第二标签为1。假设标签0的数量为a，标签1的数量为b，且a小于b，则上述正数数据样本即为少数类数据样本，负数数据样本即为多数类数据样本。For example, the plurality of data samples may include positive data samples and negative data samples, use the first label to mark each data sample belonging to the positive data sample, and use the second label to mark each data belonging to the negative data sample. Samples are labeled. By counting the number of the first label and the second label, the minority class data sample and the majority class data sample can be determined. For example, you can set the first tag to 0 and the second tag to 1. Assume that the number of label 0 is a, the number of label 1 is b, and a is less than b, then the above positive data sample is the minority class data sample, and the negative data sample is the majority class data sample.

以电话客服场景为例，投诉类电话极少而咨询类电话极多，两种类型的电话数量相差达百倍甚至千倍，将预先获取的多个电话客服历史数据样本划分为少数类样本集合和多数类样本集合，其中少数类样本集合为投诉类电话数据样本的集合，多数类样本集合为咨询类电话数据样本的集合。可以用标签0来标记投诉类电话数据样本，可以用标签1来标记咨询类电话数据样本。通过统计标签0和标签1的数量，即可确定出投诉类电话数据样本的数量以及咨询类电话数据样本的数量。Take the telephone customer service scenario as an example. There are very few complaint calls and many consultation calls. The number of calls between the two types differs by hundreds or even a thousand times. Multiple pre-obtained historical telephone customer service data samples are divided into a minority sample set and Majority class sample set, in which the minority class sample set is a collection of complaint phone data samples, and the majority class sample set is a collection of consultation phone data samples. Label 0 can be used to mark complaint phone data samples, and label 1 can be used to mark consultation phone data samples. By counting the number of tags 0 and 1, the number of complaint phone data samples and the number of consultation phone data samples can be determined.

S20、从所述多数类样本集合中欠采样得到欠采样集合。S20. Obtain an under-sampled set from the majority class sample set by under-sampling.

在某些实施方式中，步骤S20可以包括：In some implementations, step S20 may include:

从所述多数类样本集合中随机欠采样出第一数目的多数类数据样本组成欠采样集合；其中，所述第一数目与所述少数类样本集合中的数据样本数目之差的绝对值小于预设阈值。A first number of majority class data samples are randomly undersampled from the majority class sample set to form an undersampled set; wherein the absolute value of the difference between the first number and the number of data samples in the minority class sample set is less than Preset threshold.

参考图3所示，在一个具体示例中，设多数类样本集合为N，少数类样本集合为P，欠采样集合为N ₀，预设欠采样迭代次数阈值为m _under，预设过采样迭代次数阈值为m _over。 Referring to Figure 3, in a specific example, assume that the majority class sample set is N, the minority class sample set is P, the undersampling set is N ₀ , the preset undersampling iteration number threshold is m _under , and the preset oversampling iteration The number of times threshold is m _over .

在该具体示例中，从所述多数类样本集合中欠采样得到欠采样集合可以包括：In this specific example, obtaining an undersampled set from the majority class sample set may include:

从N中随机欠采样出第一数目的多数类数据样本组成集合N ₀，其中，第一数目与P中的数据样本数目之差的绝对值小于预设阈值。 A first number of majority class data samples are randomly undersampled from N to form a set N ₀ , where the absolute value of the difference between the first number and the number of data samples in P is less than a preset threshold.

从N中随机采样出与P中的样本数目相近的多个多数类数据样本组成集合N ₀，其中，

且|P|≈|N ₀|。 Randomly sample multiple majority class data samples from N that are similar to the number of samples in P to form a set N ₀ , where,

And |P|≈|N ₀ |.

S30、基于所述少数类样本集合和所述欠采样集合所组成的训练集对预设的分类模型执行第一迭代训练，得到满足第一预设条件的分类模型。S30. Perform first iterative training on the preset classification model based on the training set composed of the minority class sample set and the undersampled set, to obtain a classification model that satisfies the first preset condition.

在某些实施方式中，预设的分类模型可以采用现有技术的分类模型。所述第一预设条件为达到第一预设训练次数阈值或达到第一预设准确度阈值；所述第一迭代训练中的每一次迭代训练包括：In some embodiments, the preset classification model may adopt an existing technology classification model. The first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each iterative training in the first iterative training includes:

利用所述少数类样本集合和所述欠采样集合所组成的训练集训练当前的分类模型；Using the training set composed of the minority class sample set and the undersampled set to train the current classification model;

判断本次训练是否达到第一预设训练次数阈值；Determine whether this training reaches the first preset training times threshold;

若未达到所述第一预设训练次数阈值，则利用本次训练后的分类模型对所述多数类样本集合中的剩余数据样本进行分类预测；If the first preset training times threshold is not reached, use the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set;

判断分类预测结果是否达到第一预设准确度阈值；Determine whether the classification prediction result reaches the first preset accuracy threshold;

若未达到所述第一预设准确度阈值，则将分类预测错误的数据样本加入所述欠采样集合，得到更新后的欠采样集合；所述更新后的欠采样集合用于所述第一迭代训练中的下一次迭代训练。If the first preset accuracy threshold is not reached, the data samples with incorrect classification predictions are added to the under-sampling set to obtain an updated under-sampling set; the updated under-sampling set is used for the first The next iteration of training in iterative training.

在某些实施方式中，所述利用本次训练后的分类模型对所述多数类样本集合中的剩余数据样本进行分类预测，包括：In some embodiments, using the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set includes:

利用所述本次训练后的分类模型预测所述多数类样本集合中的各剩余数据样本属于所述少数类样本集合的概率值以及属于所述多数类样本集合的概率值；Utilize the classification model after this training to predict the probability value that each remaining data sample in the majority class sample set belongs to the minority class sample set and the probability value that belongs to the majority class sample set;

所述分类预测错误的数据样本为属于所述少数类样本集合的概率值大于属于所述多数类样本集合的概率值的数据样本。The data samples with wrong classification predictions are data samples whose probability value belongs to the minority class sample set is greater than the probability value belonging to the majority class sample set.

在前述的具体示例中，基于所述少数类样本集合和所述欠采样集合所组成的训练集对预设的分类模型执行第一迭代训练，得到满足第一预设条件的分类模型，可以包括：In the aforementioned specific example, performing first iterative training on the preset classification model based on the training set composed of the minority class sample set and the undersampling set to obtain a classification model that satisfies the first preset condition may include :

建立一个误分类样本集合E _N；其中，初始的误分类样本集合E _N为空集； Establish a misclassified sample set E _N ; where the initial misclassified sample set E _N is an empty set;

使用P和N ₀训练预设的分类模型，得到训练后的分类模型； Use P and N ₀ to train the preset classification model to obtain the trained classification model;

利用该训练后的分类模型预测集合N-N ₀中每个数据样本在不同类别上的概率分布，将所有在少数类数据样本类别上的概率值大于预设概率阈值t _N的数据样本都加入误分类样本集合E _N； The trained classification model is used to predict the probability distribution of each data sample in different categories in the set NN ₀ , and all data samples whose probability value on the minority data sample category is greater than the preset probability threshold t _N are added to the misclassification Sample set E _N ;

若误分类样本集合

则停止欠采样；否则，合并E _N和N ₀，利用合并得到的集合更新N ₀；其中，合并E _N和N ₀得到N ₀∪E _N，然后利用N ₀∪E _N更新N ₀，即N ₀＝N ₀∪E _N； If the sample set is misclassified

Then stop undersampling; otherwise, merge E _N and N ₀ , and use the merged set to update N ₀ ; where, merge E _N and N ₀ to get N ₀ ∪ E _N , and then use N ₀ ∪ E _N to update N ₀ , that is N ₀ =N ₀ ∪E _N ;

判断当前欠采样次数是否达到预设欠采样迭代次数阈值m _under；若未达到m _under，则重复上述训练步骤继续训练直至当前欠采样次数达到m _under时停止训练。 Determine whether the current number of under-sampling reaches the preset under-sampling iteration threshold m _under ; if it does not reach m _under , repeat the above training steps and continue training until the current number of under-sampling reaches m _under and stop training.

在本实施例中，随机欠采样与少数类数据样本数量相近的多数类数据样本，组成类别平衡的训练集，并利用该训练集训练预设的分类模型，然后逐步向训练集中添加分类模型预测错误的数据样本，分类困难的多数类数据样本得以添加到训练集中。因此，该欠采样方法偏向于保留分类困难的多数类数据样本。这些分类困难的数据样本往往带有重要的类别信息，保留这些分类困难的数据样本有利于分类模型正确预测多数类数据样本。In this embodiment, majority class data samples with a similar number to the minority class data samples are randomly undersampled to form a class-balanced training set, and the training set is used to train a preset classification model, and then classification model predictions are gradually added to the training set. Wrong data samples, majority class data samples that are difficult to classify are added to the training set. Therefore, this undersampling method is biased towards retaining majority class data samples that are difficult to classify. These difficult-to-classify data samples often contain important category information. Retaining these difficult-to-classify data samples will help the classification model correctly predict most types of data samples.

S40、检测所述满足第一预设条件的分类模型是否满足第二预设条件。S40. Detect whether the classification model that satisfies the first preset condition satisfies the second preset condition.

在某些实施方式中，所述第二预设条件为达到第二预设训练次数阈值或达到第二预设准确度阈值；步骤S40包括：In some embodiments, the second preset condition is reaching a second preset training times threshold or reaching a second preset accuracy threshold; step S40 includes:

利用满足第一预设条件的分类模型对少数类样本集合进行分类预测，得到分类预测结果；Use a classification model that satisfies the first preset condition to perform classification prediction on the minority class sample set to obtain a classification prediction result;

将得到的分类预测结果与第二预设准确度阈值进行比较，判断分类预测结果是否达到第二预设准确度阈值；Compare the obtained classification prediction result with the second preset accuracy threshold, and determine whether the classification prediction result reaches the second preset accuracy threshold;

若达到第二预设准确度阈值，则判断本次训练次数是否达到第二预设训练次数阈值。If the second preset accuracy threshold is reached, it is determined whether the number of training times reaches the second preset training number threshold.

S50、若不满足第二预设条件，则基于所述满足第一预设条件的分类模型对所述少数类样本集合进行过采样，将过采样得到的数据样本加入所述训练集。S50. If the second preset condition is not met, oversample the minority class sample set based on the classification model that satisfies the first preset condition, and add the oversampled data samples to the training set.

在某些实施方式中，基于所述满足第一预设条件的分类模型对所述少数类样本集合进行过采样，包括：利用满足第一预设条件的分类模型对所述少数类样本集合进行分类预测，根据分类预测结果将分类预测错误的数据样本作为过采样得到的数据样本。In some embodiments, oversampling the minority class sample set based on the classification model that satisfies the first preset condition includes: using the classification model that satisfies the first preset condition to oversample the minority class sample set. Classification prediction: use the data samples with incorrect classification predictions as oversampled data samples based on the classification prediction results.

S60、基于更新后的训练集对满足第一预设条件的分类模型执行第二迭代训练，得到满足第二预设条件的数据分类模型。S60: Perform a second iterative training on the classification model that satisfies the first preset condition based on the updated training set to obtain a data classification model that satisfies the second preset condition.

在某些实施方式中，所述第二迭代训练中的每一次迭代训练包括：In some embodiments, each iterative training in the second iterative training includes:

利用更新后的训练集训练当前的分类模型；Use the updated training set to train the current classification model;

判断本次训练是否达到第二预设训练次数阈值；Determine whether this training reaches the second preset training times threshold;

若未达到第二预设训练次数阈值，则利用本次训练后的分类模型对所述少数类样本集合进行分类预测；If the second preset training times threshold is not reached, use the classification model after this training to perform classification prediction on the minority class sample set;

判断分类预测结果是否达到第二预设准确度阈值；Determine whether the classification prediction result reaches the second preset accuracy threshold;

若未达到第二预设准确度阈值，则将分类预测错误的数据样本加入所述少数类样本集合，得到更新后的少数类样本集合；所述更新后的少数类样本集合用于作为所述第二迭代训练中的下一次迭代训练的更新后的训练集。If the second preset accuracy threshold is not reached, the data samples with incorrect classification predictions are added to the minority class sample set to obtain an updated minority class sample set; the updated minority class sample set is used as the The updated training set for the next iteration of training in the second iteration of training.

其中，第二预设准确度阈值例如可以为100％，也可以其他准确度值，具体可根据实际需要进行设定。The second preset accuracy threshold can be, for example, 100%, or other accuracy values, which can be set according to actual needs.

在某些实施方式中，所述判断分类预测结果是否达到第二预设准确度阈值，包括：In some embodiments, determining whether the classification prediction result reaches a second preset accuracy threshold includes:

根据分类预测结果中分类错误的少数类数据样本的数量，确定所述分类预测结果是否达到第二预设准确度阈值。Based on the number of misclassified minority class data samples in the classification prediction result, it is determined whether the classification prediction result reaches a second preset accuracy threshold.

在前述的示例中，基于更新后的训练集对满足第一预设条件的分类模型执行第二迭代训练，得到满足第二预设条件的数据分类模型，可以包括：In the aforementioned example, performing second iterative training on the classification model that satisfies the first preset condition based on the updated training set to obtain a data classification model that satisfies the second preset condition may include:

建立少数类样本集合P ₀，并用P初始化P ₀，即P ₀＝P； Establish a minority class sample set P ₀ and initialize P ₀ with P, that is, P ₀ =P;

建立一个误分类样本集合E _P；其中，初始的误分类样本集合E _P为空集； Establish a misclassified sample set _EP ; where the initial misclassified sample set _EP is an empty set;

利用P ₀和N ₀训练出的分类模型预测集合P中的每个数据样本，所有在多数类数据样本别上的概率值大于阈值t _P的数据样本都被加入误分类样本集合E _P； The classification model trained by P ₀ and N ₀ is used to predict each data sample in the set P. All data samples with probability values greater than the threshold t _P in the majority category of data samples are added to the misclassified sample set _EP ;

若

则停止过采样；否则，将E _P中的数据样本加入P ₀； like

Then stop oversampling; otherwise, add the data samples in _EP to P ₀ ;

判断当前过采样次数是否达到预设过采样迭代次数阈值m _over；若当前过采样次数未达到m _over，则重复上述步骤，直至当前过采样次数达到m _over时停止。 Determine whether the current number of oversampling reaches the preset threshold m _over for the number of oversampling iterations; if the current number of oversampling does not reach m _over , repeat the above steps until the current number of oversampling reaches m _over and stop.

在本实施例中，利用满足第一预设条件的分类模型预测全部少数类数据样本，把预测错误的数据样本重复多次加入训练集，然后利用更新后的训练集继续训练分类模型，并继续预测全部少数类数据样本，依此迭代，直至少数类数据样本全部预测正确为止。因此，与现有技术的随机过采样不同，本实施例的过采样偏向于增强分类困难的少数类数据样本，是一种有偏过采样，能够确保增强分类困难程度，以便于提高对分类模型的训练效果，得到分类准确率更高的分类模型。In this embodiment, a classification model that satisfies the first preset condition is used to predict all minority class data samples, and the data samples with prediction errors are added to the training set repeatedly multiple times, and then the updated training set is used to continue training the classification model, and continue Predict all minority class data samples, and iterate in this manner until all minority class data samples are predicted correctly. Therefore, unlike the random oversampling in the prior art, the oversampling in this embodiment is biased towards enhancing the minority class data samples that are difficult to classify. It is a kind of biased oversampling, which can ensure that the difficulty of classification is enhanced, so as to improve the classification model. The training effect is to obtain a classification model with higher classification accuracy.

本实施例的方法中，由于对分类模型进行训练时采用了欠采样得到的数据和过采样得到的数据，因此，用于训练分类模型的数据平衡性较好，对分类模型的训练效果好、训练得到的分类模型分类准确率高，克服了现有技术中由于在训练分类模型时所采用的训练数据不平衡所导致的训练效果不佳、训练得到的分类模型分类准确率低的问题。In the method of this embodiment, since under-sampled data and over-sampled data are used to train the classification model, the data used to train the classification model is well balanced, and the training effect of the classification model is good. The classification model obtained by training has high classification accuracy, which overcomes the problems in the prior art that the training effect is poor and the classification accuracy of the trained classification model is low due to the imbalance of training data used in training the classification model.

在一个实施例中，提出了一种数据分类方法，包括：In one embodiment, a data classification method is proposed, including:

S00、获取待分类数据。S00. Obtain the data to be classified.

以电话客服场景为例，待分类数据可以为客服接收到的电话数据，需要将这些电话数据分类为投诉类电话以及咨询类电话。Taking the telephone customer service scenario as an example, the data to be classified can be the telephone data received by the customer service, and these telephone data need to be classified into complaint calls and consultation calls.

上述任一实施方式的数据分类模型的训练方法的步骤；以及，The steps of the training method of the data classification model in any of the above embodiments; and,

S70、利用所述满足第二预设条件的数据分类模型对所述待分类数据进行分类。S70: Use the data classification model that satisfies the second preset condition to classify the data to be classified.

以电话客服场景为例，将待分类数据输入满足第二预设条件的数据分类模型中进行处理，得到分类结果。Taking the telephone customer service scenario as an example, the data to be classified is input into the data classification model that meets the second preset condition for processing, and the classification result is obtained.

参考图4所示，在一个实施例中，提出了一种数据分类模型的训练装置，包括：Referring to Figure 4, in one embodiment, a training device for a data classification model is proposed, including:

在某些实施方式中，所述预先获取的多个历史数据样本中包括两种数据样本；划分模块进一步具体用于：In some embodiments, the plurality of pre-acquired historical data samples include two types of data samples; the dividing module is further specifically used to:

对所述两种数据样本的数量进行比较，利用数量少的一种数据样本组成所述少数类样本集合，利用数量多的一种数据样本组成所述多数类样本集合。The numbers of the two data samples are compared, the smaller number of data samples is used to form the minority class sample set, and the larger number of data samples is used to form the majority class sample set.

在某些实施方式中，所述第一预设条件为达到第一预设训练次数阈值或达到第一预设准确度阈值；所述第一迭代训练中的每一次迭代训练包括：In some embodiments, the first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each iterative training in the first iterative training includes:

利用所述本次训练后的分类模型预测所述多数类样本集合中的各剩余数据样本属于所述少数类样本集合的概率值以及属于所述多数类样本集合的概率值；Using the classification model after this training to predict the probability value of each remaining data sample in the majority class sample set belonging to the minority class sample set and the probability value of belonging to the majority class sample set;

在某些实施方式中，所述第二预设条件为达到第二预设训练次数阈值或达到第二预设准确度阈值；所述第二迭代训练中的每一次迭代训练包括：In some embodiments, the second preset condition is reaching a second preset training times threshold or reaching a second preset accuracy threshold; each iterative training in the second iterative training includes:

在某些实施方式中，欠采样模块具体用于：In some implementations, the undersampling module is specifically used to:

在一个实施例中，提供了一种数据分类装置，包括：In one embodiment, a data classification device is provided, including:

上述任一实施方式所述的数据分类模型的训练装置；以及，The training device of the data classification model according to any of the above embodiments; and,

如图5所示，在一个实施例中，提出了一种计算机设备，所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：As shown in Figure 5, in one embodiment, a computer device is proposed. The computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor implements the following steps when executing the computer program:

在某些实施方式中，所述第一预设条件为达到第一预设训练次数阈值或达到第一预设准确度阈值；所述处理器所执行的所述第一迭代训练中的每一次迭代训练包括：In some embodiments, the first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each time in the first iterative training executed by the processor Iterative training includes:

在一个实施例中，所述处理器所执行的所述利用本次训练后的分类模型对所述多数类样本集合中的剩余数据样本进行分类预测，包括：In one embodiment, the step performed by the processor to use the trained classification model to perform classification prediction on the remaining data samples in the majority class sample set includes:

在某些实施方式中，所述第二预设条件为达到第二预设训练次数阈值或达到第二预设准确度阈值；所述处理器所执行的所述第二迭代训练中的每一次迭代训练包括：In some embodiments, the second preset condition is reaching a second preset training times threshold or reaching a second preset accuracy threshold; each time in the second iterative training performed by the processor Iterative training includes:

在一个实施例中，所述处理器所执行的所述判断分类预测结果是否达到第二预设准确度阈值，包括：In one embodiment, the step performed by the processor to determine whether the classification prediction result reaches a second preset accuracy threshold includes:

在一个实施例中，提出了一种计算机设备，所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In one embodiment, a computer device is proposed. The computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor executes the computer program. Perform the following steps during the program:

获取待分类数据；Get the data to be classified;

上述任一实施方式所述数据分类模型的训练方法的步骤；以及，The steps of the training method of the data classification model in any of the above embodiments; and,

在一个实施例中，提出了一种存储有计算机可读指令的存储介质，该计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行以下步骤：In one embodiment, a storage medium storing computer-readable instructions is proposed. When executed by one or more processors, the computer-readable instructions cause one or more processors to perform the following steps:

获取待分类数据；Get the data to be classified;

所述计算机可读存储介质可以是非易失性，也可以是易失性。The computer-readable storage medium may be non-volatile or volatile.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，该计算机程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)等非易失性存储介质，或随机存储记忆体(Random Access Memory，RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. The program can be stored in a computer-readable storage medium. When executed, the process may include the processes of the above method embodiments. Among them, the aforementioned storage media can be non-volatile storage media such as magnetic disks, optical disks, read-only memory (Read-Only Memory, ROM), or random access memory (Random Access Memory, RAM), etc.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, All should be considered to be within the scope of this manual.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

A method for training a data classification model, which includes:

Divide multiple historical data samples obtained in advance into a minority class sample set and a majority class sample set;

Obtain an undersampled set from undersampling the majority class sample set;

Perform first iterative training on the preset classification model based on the training set composed of the minority class sample set and the undersampled set to obtain a classification model that satisfies the first preset condition;

Detect whether the classification model that satisfies the first preset condition satisfies the second preset condition;

If the second preset condition is not met, oversampling the minority class sample set based on the classification model that satisfies the first preset condition, and adding the oversampled data samples to the training set;

Based on the updated training set, a second iterative training is performed on the classification model that satisfies the first preset condition to obtain a data classification model that satisfies the second preset condition.

The method of claim 1, wherein the first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each iterative training in the first iterative training includes :

Using the training set composed of the minority class sample set and the undersampled set to train the current classification model;

Determine whether this training reaches the first preset training times threshold;

If the first preset training times threshold is not reached, use the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set;

Determine whether the classification prediction result reaches the first preset accuracy threshold;

If the first preset accuracy threshold is not reached, the data samples with incorrect classification predictions are added to the under-sampling set to obtain an updated under-sampling set; the updated under-sampling set is used for the first The next iteration of training in iterative training.

The method of claim 2, wherein using the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set includes:

Using the classification model after this training to predict the probability value of each remaining data sample in the majority class sample set belonging to the minority class sample set and the probability value of belonging to the majority class sample set;

The data samples with wrong classification predictions are data samples whose probability value belongs to the minority class sample set is greater than the probability value belonging to the majority class sample set.

The method of claim 1, wherein the second preset condition is reaching a second preset training times threshold or reaching a second preset accuracy threshold; each iterative training in the second iterative training includes :

Use the updated training set to train the current classification model;

Determine whether this training reaches the second preset training times threshold;

If the second preset training times threshold is not reached, use the classification model after this training to perform classification prediction on the minority class sample set;

Determine whether the classification prediction result reaches the second preset accuracy threshold;

If the second preset accuracy threshold is not reached, the data samples with incorrect classification predictions are added to the minority class sample set to obtain an updated minority class sample set; the updated minority class sample set is used as the The updated training set for the next iteration of training in the second iteration of training.

The method of claim 4, wherein determining whether the classification prediction result reaches a second preset accuracy threshold includes:

Based on the number of misclassified minority class data samples in the classification prediction result, it is determined whether the classification prediction result reaches a second preset accuracy threshold.

A data classification method, including:

Get the data to be classified;

The steps of the method according to any one of claims 1-5; and,

The data to be classified is classified using the data classification model that satisfies the second preset condition.

A training device for a data classification model, which includes:

A division module used to divide multiple pre-obtained historical data samples into a minority class sample set and a majority class sample set;

An undersampling module, used to undersample and obtain an undersampled set from the majority class sample set;

A first iterative training module, configured to perform first iterative training on a preset classification model based on a training set composed of the minority class sample set and the undersampled set, to obtain a classification model that satisfies the first preset condition;

A detection module configured to detect whether the classification model that satisfies the first preset condition satisfies the second preset condition;

An oversampling module, configured to oversample the minority class sample set based on the classification model that satisfies the first preset condition if the second preset condition is not met, and add the oversampled data samples to the training set;

The second iterative training module is configured to perform second iterative training on the classification model that satisfies the first preset condition based on the updated training set to obtain a data classification model that satisfies the second preset condition.

The training device of claim 7, wherein the first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each iterative training in the first iterative training include:

The training device according to claim 8, wherein using the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set includes:

The data samples with classification prediction errors are data samples whose probability value belongs to the minority class sample set is greater than the probability value belonging to the majority class sample set.

A data classification device, which includes:

The data acquisition module to be classified is used to obtain the data to be classified;

The training device according to any one of claims 7-9; and,

A classification module, configured to classify the data to be classified using a classification model that meets the preset training stop condition.

A computer device includes a memory and a processor. Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, they cause the processor to perform the following steps:

Obtain an undersampled set from undersampling the majority class sample set;

The computer device of claim 11, wherein the first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each iteration training in the first iterative training include:

The computer device according to claim 12, wherein using the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set includes:

The computer device of claim 11, wherein the second preset condition is reaching a second preset training times threshold or reaching a second preset accuracy threshold; each iteration training in the second iterative training include:

Use the updated training set to train the current classification model;

The computer device of claim 14, wherein the determining whether the classification prediction result reaches a second preset accuracy threshold includes:

A storage medium storing computer-readable instructions that, when executed by one or more processors, cause one or more processors to perform the following steps:

Obtain an undersampled set from undersampling the majority class sample set;

The storage medium of claim 16, wherein the first preset condition is reaching a first preset training times threshold or reaching a first preset accuracy threshold; each iteration training in the first iterative training include:

The storage medium of claim 17, wherein using the classification model after this training to perform classification prediction on the remaining data samples in the majority class sample set includes:

The storage medium of claim 16, wherein the second preset condition is reaching a second preset training times threshold or reaching a second preset accuracy threshold; each iteration training in the second iterative training include:

Use the updated training set to train the current classification model;

The storage medium of claim 19, wherein the determining whether the classification prediction result reaches a second preset accuracy threshold includes: