WO2019061187A1

WO2019061187A1 - Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus

Info

Publication number: WO2019061187A1
Application number: PCT/CN2017/104069
Authority: WO
Inventors: 赵敏; 林磊
Original assignee: Shenzhen Lexin Software Technology Co Ltd
Current assignee: Shenzhen Lexin Software Technology Co Ltd
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2019-04-04
Anticipated expiration: 2020-03-28
Also published as: CN109496322A; CN109496322B

Abstract

A credit evaluation method and apparatus and a gradient boosting decision tree parameter adjustment method and apparatus, the credit evaluation method comprising: respectively inputting first sample data into at least two gradient boosting decision tree GBDT models to obtain a first overdue credit probability set, the first sample data being credit data of a first user set; respectively inputting second sample data into at least two GBDT models to obtain a second overdue credit probability set, the second sample data being credit data of a second user set; the GBDT parameters of the at least two GBDT models are different; on the basis of the first overdue credit probability set and the second overdue credit probability set, implementing KS value calculation and, on the basis of the calculation result, determining a target GBDT model from amongst the at least two GBDT models; and, on the basis of the target GBDT model, performing a credit evaluation of a user.

Description

Credit evaluation method and device, and gradient progressive decision tree parameter adjustment method and device

Technical field

本公开涉及信息处理技术领域，例如涉及一种信用评价方法和装置以及梯度渐进决策树参数调整方法和装置。The present disclosure relates to the field of information processing technologies, for example, to a credit evaluation method and apparatus, and a gradient progressive decision tree parameter adjustment method and apparatus.

Background technique

梯度渐进决策树(Gradient Boost Decision Tree，GBDT)是一种解决分类问题和回归问题中常用的算法，优点是具备很强的拟合能力和分类能力，但是过强的拟合能力可能会在测试集上出现过拟合现象。The Gradient Boost Decision Tree (GBDT) is a commonly used algorithm for solving classification problems and regression problems. The advantage is that it has strong fitting ability and classification ability, but the strong fitting ability may be tested. There has been a fitting phenomenon on the set.

相关技术中，采用GBDT模型对用户进行信用评价时，通常需要手动对GBDT模型中的多个参数逐一进行调整，以使GBDT模型输出的信用逾期概率接近用户真实的信用逾期概率，但是，在GBDT参数调整过程中，往往基于人为确定的参数值对参数进行调整的，参数的精度不高，逐个参数调优的方式得到的模型不稳定，参数调整效率较低，对用户进行信用评价的准确性较低。In the related art, when the GBDT model is used for credit evaluation of users, it is usually necessary to manually adjust multiple parameters in the GBDT model one by one, so that the credit overdue probability outputted by the GBDT model is close to the user's true credit overdue probability, but in GBDT In the parameter adjustment process, the parameters are often adjusted based on the artificially determined parameter values, the accuracy of the parameters is not high, the model obtained by the parameter tuning method is unstable, the parameter adjustment efficiency is low, and the accuracy of the credit evaluation for the user is performed. Lower.

发明内容Summary of the invention

本公开提供了一种信用评价方法和装置以及梯度渐进决策树参数调整方法和装置，可以实现提高GBDT模型的参数调整效率，提高GBDT模型的稳定性保证对用户进行信用评价的准确性。The present disclosure provides a credit evaluation method and apparatus and a gradient progressive decision tree parameter adjustment method and apparatus, which can improve the parameter adjustment efficiency of the GBDT model, improve the stability of the GBDT model, and ensure the accuracy of credit evaluation for users.

一实施例提供了一种信用评价方法，可以包括：An embodiment provides a credit evaluation method, which may include:

将第一样本数据分别输入至少两个梯度渐进决策树GBDT模型中，得到第一信用逾期概率集，所述第一样本数据为第一用户集的信用数据；The first sample data is respectively input into at least two gradient progressive decision tree GBDT models to obtain a first credit overdue probability set, and the first sample data is credit data of the first user set;

将第二样本数据分别输入所述至少两个GBDT模型中，得到第二信用逾期概率集，所述第二样本数据为第二用户集的信用数据；所述至少两个GBDT模型的GBDT参数不同；Entering the second sample data into the at least two GBDT models respectively to obtain a second credit overdue summary a rate set, the second sample data is credit data of a second user set; and the GBDT parameters of the at least two GBDT models are different;

根据所述第一信用逾期概率集和所述第二信用逾期概率集进行KS值计算，根据计算结果，从所述至少两个GBDT模型中确定目标GBDT模型；以及根据所述目标GBDT模型对用户进行信用评价。Performing a KS value calculation according to the first credit overdue probability set and the second credit overdue probability set, determining a target GBDT model from the at least two GBDT models according to the calculation result; and pairing the user according to the target GBDT model Conduct a credit evaluation.

一实施例提供了一种信用评价装置，可以包括：An embodiment provides a credit evaluation apparatus, which may include:

第一信用逾期概率获取模块，设置为将第一样本数据分别输入至少两个梯度渐进决策树GBDT模型中，得到第一信用逾期概率集，所述第一样本数据为第一用户集的信用数据；The first credit overdue probability obtaining module is configured to input the first sample data into the at least two gradient progressive decision tree GBDT models respectively, to obtain a first credit overdue probability set, where the first sample data is the first user set Credit data

第二信用逾期概率获取模块，设置为将第二样本数据分别输入所述至少两个GBDT模型中，得到第二信用逾期概率集，所述第二样本数据为第二用户集的信用数据；所述至少两个GBDT模型的GBDT参数不同；a second credit overdue probability obtaining module, configured to input second sample data into the at least two GBDT models respectively to obtain a second credit overdue probability set, where the second sample data is credit data of the second user set; The GBDT parameters of at least two GBDT models are different;

模型确定模块，设置为根据所述第一信用逾期概率集和所述第二信用逾期概率集进行KS值计算，根据计算结果，从所述至少两个GBDT模型中确定目标GBDT模型；以及评价模块，设置为根据所述目标GBDT模型对用户进行信用评价。a model determining module, configured to perform a KS value calculation according to the first credit overdue probability set and the second credit overdue probability set, and determine a target GBDT model from the at least two GBDT models according to the calculation result; and an evaluation module And set to perform credit evaluation on the user according to the target GBDT model.

一实施例提供了一种梯度渐进决策树参数调整方法，可以包括：An embodiment provides a gradient progressive decision tree parameter adjustment method, which may include:

依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围；Determining the domain dimension and the domain range of the particle swarm optimization algorithm according to the number of adjustment parameters in the gradient progressive decision tree and the range of values corresponding to each parameter;

设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围得到粒子群中每个粒子的轨迹最优点；依据所述轨迹最优点确定梯度渐进决策树的参数值。Setting an initial parameter of the particle swarm optimization algorithm, obtaining a trajectory of each particle in the particle group according to the particle swarm optimization algorithm, the domain dimension, and the domain definition; determining the best advantage according to the trajectory The parameter value of the gradient progressive decision tree.

一实施例提供了一种梯度渐进决策树参数调整装置，包括：An embodiment provides a gradient progressive decision tree parameter adjustment apparatus, including:

映射模块，设置为依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围；Mapping module, set to adjust the number of parameters in the progressive decision tree according to the gradient and each parameter pair The range of values to be determined determines the domain dimensions of the particle swarm optimization algorithm and the scope of the domain;

轨迹最优点确定模块，设置为设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围得到粒子群中每个粒子的轨迹最优点；a trajectory most advantageous determining module, configured to set an initial parameter of the particle swarm optimization algorithm, and obtain a trajectory of each particle in the particle group according to the particle swarm optimization algorithm, the domain dimension, and the domain definition range ;

参数确定模块，设置为依据所述轨迹最优点确定梯度渐进决策树的参数值。The parameter determination module is configured to determine a parameter value of the gradient progressive decision tree according to the most advantageous of the trajectory.

一实施例提供一种计算机可读存储介质，存储有计算机可执行指令，所述计算机可执行指令用于执行上述任意一种方法。An embodiment provides a computer readable storage medium storing computer executable instructions for performing any of the methods described above.

一实施例还提供一种数据处理设备，该数据处理设备包括一个或多个处理器、存储器以及一个或多个程序，所述一个或多个程序存储在存储器中，当被一个或多个处理器执行时，执行上述任意一种方法。An embodiment further provides a data processing device comprising one or more processors, a memory, and one or more programs, the one or more programs being stored in the memory when processed by one or more When the device is executed, perform any of the above methods.

一实施例还提供了一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，使所述计算机执行上述任意一种方法。An embodiment further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer Having the computer perform any of the methods described above.

本公开可以实现提高梯度渐进决策树的参数调整效率，避免调整过程中陷入单一区域的局部最优搜索，对参数空间的搜索范围更广。The present disclosure can improve the parameter adjustment efficiency of the gradient progressive decision tree, avoid the local optimal search that falls into a single region during the adjustment process, and have a wider search range for the parameter space.

DRAWINGS

图1a是一实施例提供的一种信用评价方法的流程示意图；FIG. 1a is a schematic flowchart of a credit evaluation method according to an embodiment; FIG.

图1b是一实施例提供的一种信用评价方法的子流程示意图；FIG. 1b is a schematic diagram of a sub-flow of a credit evaluation method according to an embodiment; FIG.

图1c是一实施例提供的一种信用评价方法的另一子流程示意图；FIG. 1c is a schematic diagram of another sub-flow of a credit evaluation method according to an embodiment;

图2a是一实施例提供的一种信用评价的流程示意图；2a is a schematic flow chart of a credit evaluation provided by an embodiment;

图2b是一实施例提供的一种信用评价的子流程示意图；2b is a schematic diagram of a sub-flow of credit evaluation provided by an embodiment;

图3是一实施例提供的一种信用评价装置的结构示意图；FIG. 3 is a schematic structural diagram of a credit evaluation apparatus according to an embodiment; FIG.

图4是一实施例提供的一种梯度渐进决策树参数调整方法的流程示意图； 4 is a schematic flow chart of a gradient progressive decision tree parameter adjustment method according to an embodiment;

图5是一实施例提供的另一种梯度渐进决策树参数调整方法的流程示意图；FIG. 5 is a schematic flowchart diagram of another gradient progressive decision tree parameter adjustment method according to an embodiment; FIG.

图6是一实施例提供的另一种梯度渐进决策树参数调整方法的流程示意图；FIG. 6 is a schematic flowchart diagram of another gradient progressive decision tree parameter adjustment method according to an embodiment; FIG.

图7是一实施例提供的一种梯度渐进决策树参数调整装置的结构示意图；FIG. 7 is a schematic structural diagram of a gradient progressive decision tree parameter adjusting apparatus according to an embodiment; FIG.

图8是一实施例提供的数据处理设备的硬件结构示意图。FIG. 8 is a schematic diagram of a hardware structure of a data processing device according to an embodiment.

Detailed ways

图1a是一实施例提供的一种信用评价方法的流程示意图，该方法可以应用与数据处理设备中，例如计算设备，如图1a所示，该方法可以包括步骤110-步骤140。FIG. 1a is a schematic flowchart of a credit evaluation method according to an embodiment. The method may be applied to a data processing device, such as a computing device. As shown in FIG. 1a, the method may include steps 110-140.

在步骤110中，将第一样本数据分别输入至少两个梯度渐进决策树GBDT模型中，得到第一信用逾期概率集，所述第一样本数据为第一用户集的信用数据。In step 110, the first sample data is respectively input into at least two gradient progressive decision tree GBDT models to obtain a first credit overdue probability set, and the first sample data is credit data of the first user set.

在步骤120中，将第二样本数据分别输入所述至少两个GBDT模型中，得到第二信用逾期概率集，所述第二样本数据为第二用户集的信用数据；所述至少两个GBDT模型的GBDT参数不同。In step 120, the second sample data is separately input into the at least two GBDT models to obtain a second credit overdue probability set, the second sample data is credit data of the second user set; the at least two GBDTs The model's GBDT parameters are different.

例如，用户的信用数据可以包括用户的履约能力、多头数据、信用时长、欠款总额及行为偏好等信息，将样本数据输入到GBDT模型后，可以得到用户的信用逾期概率。For example, the user's credit data may include information such as the user's performance capability, long-term data, credit duration, total amount of arrears, and behavioral preference. After inputting the sample data into the GBDT model, the user's credit overdue probability may be obtained.

其中，履约能力可以包括用户历史逾期记录，例如历史最大逾期天数和90天或180天内逾期次数等信息；多头数据可以包括用户在过往30天、60天、90天、120天和180天的时间内，在金融平台和非金融平台借款次数等信息；信用时长可以包括用户开户的时间长度、第一笔交易开始时间以及手机在网时长等信息；欠款总额可以包括，个人用户当前的在袋总额或者机构内部在贷总额及机构外部在贷总额；行为偏好可以包括用户网上注册时是否在多类网页浏览或者购买消费品、用户进行取现、虚拟交易或电商实物类交易的金额比例等信息。 The performance capability may include user history overdue records, such as historical maximum overdue days and 90 days or 180 days of overdue times; the multi-head data may include the user's past 30 days, 60 days, 90 days, 120 days, and 180 days. Within, information on the number of borrowings on the financial platform and non-financial platforms; credit duration may include information such as the length of time the user opens an account, the start time of the first transaction, and the length of time the mobile phone is on the network; the total amount of the arrears may include the current pocket of the individual user. The total amount or the total amount of loans in the organization and the total amount of loans outside the organization; behavioral preferences may include information such as whether the user is browsing on multiple types of web pages or purchasing consumer goods, the user withdraws cash, virtual transactions, or the proportion of e-commerce transactions in the online registration.

本实施例中，第一样本数据和第二样本数据中均包括多个用户的信用数据，将第一样本数据中每一个用户的所有信用数据输入到每一个GBDT模型中，得到第一样本数据中的多个用户的信用预期概率，构成上述第一信用逾期概率集，同理，得到上述第二信用逾期概率集。In this embodiment, the first sample data and the second sample data respectively include credit data of a plurality of users, and all credit data of each user in the first sample data is input into each GBDT model to obtain the first The credit expected probability of the plurality of users in the sample data constitutes the first credit overdue probability set, and similarly, the second credit overdue probability set is obtained.

在步骤130中，根据所述第一信用逾期概率集和所述第二信用逾期概率集进行KS值计算，根据计算结果，从所述至少两个GBDT模型中确定目标GBDT模型。In step 130, the KS value calculation is performed according to the first credit overdue probability set and the second credit overdue probability set, and the target GBDT model is determined from the at least two GBDT models according to the calculation result.

可选地，如图1b所示，上述步骤130可以包括步骤1310-步骤1330。Optionally, as shown in FIG. 1b, the foregoing step 130 may include steps 1310 to 1330.

在步骤1310中，根据所述第一信用逾期概率集以及所述第一用户集对应的第一实际信用逾期概率集进行KS值计算，得到第一KS集。In step 1310, the KS value calculation is performed according to the first credit overdue probability set and the first actual credit overdue probability set corresponding to the first user set, to obtain a first KS set.

在步骤1320中，根据所述第二信用逾期概率集以及所述第二用户集对应的第二实际信用逾期概率集进行KS值计算，得到第二KS集。In step 1320, the KS value calculation is performed according to the second credit overdue probability set and the second actual credit overdue probability set corresponding to the second user set, to obtain a second KS set.

例如，可以根据上述第一信用逾期概率集和所述第二信用逾期概率集，选取相应的概率阈值，根据K-S曲线的计算原理，得到第一样本数据输入每一个个GBDT模型对应的KS值，构成上述第一KS集，同理得到上述第二KS集。For example, according to the first credit overdue probability set and the second credit overdue probability set, a corresponding probability threshold may be selected, and according to the calculation principle of the KS curve, the KS value corresponding to each GBDT model of the first sample data input is obtained. The first KS set is formed, and the second KS set is obtained in the same manner.

在步骤1330中，对所述第一KS集和所述第二KS集进行比较计算，根据计算结果，从所述至少两个GBDT模型中确定所述目标GBDT模型。In step 1330, a comparison calculation is performed on the first KS set and the second KS set, and the target GBDT model is determined from the at least two GBDT models according to a calculation result.

例如，可以将第一样本确定为训练样本，将第二样本确定为测试样本，则第一KS集中的KS值可以表示为KS_train，第二KS集中的KS值可以表示为KS_test。For example, the first sample may be determined as a training sample, and the second sample may be determined as a test sample, and the KS value in the first KS set may be represented as KS_train, and the KS value in the second KS set may be represented as KS_test.

可选地，如图1c所示，上述步骤1330可以包括步骤1332-步骤1336。Optionally, as shown in FIG. 1c, the above step 1330 may include step 1332 - step 1336.

在步骤1322中，将根据相同GBDT模型得到的所述第一KS集中的KS值与所述第二KS集中的KS值进行取最小值计算，得到第三KS集。 In step 1322, the KS value of the first KS set obtained according to the same GBDT model and the KS value of the second KS set are subjected to a minimum value calculation to obtain a third KS set.

例如，可以通过函数min(KS_train，KS_test)，将根据同一GBDT模型计算得到的第一KS集中的KS值与第二KS集中的KS值进行取最小值计算，得到多个最小值，构成第三KS集。For example, the KS value of the first KS set calculated according to the same GBDT model and the KS value of the second KS set may be calculated by using the min function (KS_train, KS_test) to obtain a minimum value, which is a third value. KS set.

在步骤1334中，对所述第三KS集中包含的KS值进行取最大值计算，得到目标KS值。In step 1334, the maximum value of the KS value included in the third KS set is calculated to obtain a target KS value.

例如，将第三KS集中的多个KS值依据函数max(min(KS_train，KS_test))进行计算，得到目标的KS值，即计算出第三KS集中的KS最大值，即为目标KS值。For example, the plurality of KS values in the third KS set are calculated according to the function max(min(KS_train, KS_test)), and the KS value of the target is obtained, that is, the KS maximum value in the third KS set is calculated, that is, the target KS value.

在步骤1336中，将所述至少两个GBDT模型中与所述目标KS值对应的GBDT模型确定为所述目标GBDT模型。In step 1336, the GBDT model corresponding to the target KS value in the at least two GBDT models is determined as the target GBDT model.

在步骤140中，根据所述目标GBDT模型对用户进行信用评价。In step 140, the user is credited according to the target GBDT model.

例如，将新输入的用户信用数据输入目标GBDT模型，得到该用户的信用逾期概率，根据该用户的信用逾期概率可以评价用户的信用情况是否良好。For example, the newly input user credit data is input into the target GBDT model to obtain the credit overdue probability of the user, and the user's credit condition can be evaluated according to the credit overdue probability of the user.

可选地，如图2a所示，在上述步骤110之前，还包括步骤100。Optionally, as shown in FIG. 2a, before step 110, step 100 is further included.

在步骤100中，根据粒子群优化PSO算法，确定所述至少两个GBDT模型的GBDT参数。In step 100, the GBDT parameters of the at least two GBDT models are determined according to a particle swarm optimization PSO algorithm.

可选地，如图2b所示，步骤100可以包括：步骤1010-步骤1050。Optionally, as shown in FIG. 2b, step 100 may include: step 1010 - step 1050.

在步骤1010中，将GBDT模型中的参数个数映射为PSO算法的定义域维度。In step 1010, the number of parameters in the GBDT model is mapped to the domain dimension of the PSO algorithm.

在步骤1020中，将GBDT模型中每个所述参数的取值范围映射为PSO算法的定义域范围。In step 1020, the range of values of each of the parameters in the GBDT model is mapped to the domain of the PSO algorithm.

在步骤1030中，从所述定义域维度对应的定义域范围内抽取至少两组维度值数据，作为至少两个粒子。In step 1030, at least two sets of dimension value data are extracted from the domain of the domain corresponding to the domain dimension as at least two particles.

在步骤1040中，通过PSO算法计算所述至少两个粒子的轨迹最优点。 In step 1040, the trajectory of the at least two particles is calculated by the PSO algorithm.

其中，所述轨迹最优点是指粒子走过的轨迹中使目标函数达到最大值的点，所述目标函数为对所述第一KS集中的KS值与所述第二KS集中的KS值取最小值的函数以及The most advantageous of the trajectory refers to a point in the trajectory of the particle that makes the objective function reach a maximum value, and the objective function is that the KS value in the first KS set and the KS value in the second KS set are taken. The function of the minimum and

在步骤1050中，将所述至少两个粒子的轨迹最优点对应的维度值数据映射回GBDT模型中，得到至少两组GBDT参数。In step 1050, the dimension value data corresponding to the trajectory of the at least two particles is mapped back into the GBDT model to obtain at least two sets of GBDT parameters.

其中，PSO算法属于粒子群理论，该算法中定义N维空间中的粒子xi＝(x1，x2，……，xN)，粒子在空间的飞行速度为vi＝(v1，v2，……，vN)，每个粒子都有一个目标函数决定的适应值(fitness value)，并且每个粒子都追随整个粒子群中最优粒子在空间中进行搜索，经过多次迭代找到整个空间中的最好位置。Among them, the PSO algorithm belongs to the particle swarm theory, which defines the particles xi=(x1,x2,...,xN) in the N-dimensional space, and the flying speed of the particles in space is vi=(v1,v2,...,vN ), each particle has a fitness value determined by the objective function, and each particle follows the optimal particle in the entire particle group to search in space, and finds the best position in the whole space after multiple iterations. .

可选地，根据所述目标GBDT模型对用户进行信用评价，包括：Optionally, performing credit evaluation on the user according to the target GBDT model, including:

将所述用户的信用数据输入所述目标GBDT模型，得到所述用户的信用逾期概率；以及将所述用户的信用逾期概率与预设信用逾期概率阈值进行比较，得到所述用户的信用评价结果。Entering the credit data of the user into the target GBDT model to obtain a credit overdue probability of the user; and comparing the credit overdue probability of the user with a preset credit overdue probability threshold to obtain a credit evaluation result of the user .

例如，可以设定相应的信用逾期概率阈值，例如，当用户的信用逾期概率大于等于80％，确定用户信用较差；当用户的信用逾期概率小于80％，大于等于50％，确定用户信用一般；当用户的信用逾期概率小于50％，大于等于10％，确定用户信用良好，当用户的信用逾期概率小于10％，确定用户信用优秀。For example, the corresponding credit overdue probability threshold may be set, for example, when the user's credit overdue probability is greater than or equal to 80%, the user credit is determined to be poor; when the user's credit overdue probability is less than 80%, greater than or equal to 50%, the user credit is generally determined. When the user's credit overdue probability is less than 50%, greater than or equal to 10%, it is determined that the user's credit is good, and when the user's credit overdue probability is less than 10%, it is determined that the user credit is excellent.

图3是一实施例提供的一种信用评价装置的结构示意图，该装置可执行上述实施例提供的信用评价方法，本实施例中的模块的功能可以参考上述实施例提供的方法步骤，如图3所示，该装置可以包括：FIG. 3 is a schematic structural diagram of a credit evaluation apparatus according to an embodiment. The apparatus may perform the credit evaluation method provided by the foregoing embodiment. The function of the module in this embodiment may refer to the method steps provided in the foregoing embodiment. As shown in Figure 3, the device can include:

第一信用逾期概率获取模块310，设置为将第一样本数据分别输入至少两个梯度渐进决策树GBDT模型中，得到第一信用逾期概率集，所述第一样本数据为第一用户集的信用数据；The first credit overdue probability obtaining module 310 is configured to input the first sample data into the at least two gradient progressive decision tree GBDT models respectively, to obtain a first credit overdue probability set, where the first sample data is Credit data of the first user set;

第二信用逾期概率获取模块320，设置为将第二样本数据分别输入所述至少两个GBDT模型中，得到第二信用逾期概率集，所述第二样本数据为第二用户集的信用数据；所述至少两个GBDT模型的GBDT参数不同；The second credit overdue probability obtaining module 320 is configured to input the second sample data into the at least two GBDT models respectively to obtain a second credit overdue probability set, where the second sample data is the credit data of the second user set; The GBDT parameters of the at least two GBDT models are different;

模型确定模块330，设置为根据所述第一信用逾期概率集和所述第二信用逾期概率集进行KS值计算，根据计算结果，从所述至少两个GBDT模型中确定目标GBDT模型；以及The model determining module 330 is configured to perform KS value calculation according to the first credit overdue probability set and the second credit overdue probability set, and determine a target GBDT model from the at least two GBDT models according to the calculation result;

评价模块340，设置为根据所述目标GBDT模型对用户进行信用评价。The evaluation module 340 is configured to perform credit evaluation on the user according to the target GBDT model.

可选地，该装置还可以包括参数确定模块300，设置为在将第一样本数据分别输入至少两个梯度渐进决策树GBDT模型中之前，根据粒子群优化PSO算法，确定所述至少两个GBDT模型的GBDT参数。Optionally, the apparatus may further include a parameter determining module 300 configured to determine the at least two according to the particle swarm optimization PSO algorithm before inputting the first sample data into the at least two gradient progressive decision tree GBDT models, respectively. GBDT parameters of the GBDT model.

图4是一实施例提供的一种梯度渐进决策树参数调整方法的流程图，该方法可适用于在采用梯度渐进决策树进行建模或机器学习等计算时，对梯度渐进决策树中的参数调整的情况，该方法可以由计算设备如计算机来执行，也可以由梯度渐进决策树参数调整装置来执行，该梯度渐进决策树参数调整装置可采用软件和硬件中的至少一种方式实现，如图4所示，该方法可以包括步骤410-步骤430。FIG. 4 is a flowchart of a gradient progressive decision tree parameter adjustment method according to an embodiment. The method is applicable to parameters in a gradient progressive decision tree when performing calculations such as modeling or machine learning using a gradient progressive decision tree. In the case of adjustment, the method may be performed by a computing device such as a computer, or may be performed by a gradient progressive decision tree parameter adjustment device, and the gradient progressive decision tree parameter adjustment device may be implemented by at least one of software and hardware, such as As shown in FIG. 4, the method can include steps 410-430.

在步骤410中，依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围。In step 410, the domain dimension and the domain scope of the particle swarm optimization algorithm are determined according to the number of adjustment parameters in the gradient progressive decision tree and the range of values corresponding to each parameter.

例如，梯度渐进决策树中的调节参数共有8个，分别为n_estimators、learning_rate、subsample、max_features、max_depth、min_samples_split、min_samples_leaf和random_state，相应的，粒子群优化算法的定义域维度为8维。 For example, there are 8 adjustment parameters in the gradient progressive decision tree, namely n_estimators, learning_rate, subsample, max_features, max_depth, min_samples_split, min_samples_leaf, and random_state. Correspondingly, the domain dimension of the particle swarm optimization algorithm is 8 dimensions.

其中，n_estimators指弱学习器的最大迭代次数，n_estimators值若太小则容易欠拟合，n_estimators值太大又容易过拟合，对n_estimators值的大小进行调节选择一个适中的值，n_estimators的取值范围可定义为[1，1000]。learning_rate指每个弱学习器的权重缩减系数，也称作步长，对于同样的训练集拟合效果，较小的步长表示需要更多的弱学习器的迭代次数，learning_rate的取值范围可定义为(0，1)。subsample指子采样，取值范围为(0，1)。max_features指最大特征数比例，取值范围可设定为(0，1)。max_depth指决策树的最大深度，其取值范围可以是(0，10)中的任一整数。min_samples_split指内部节点划分所需最小样本数，该值限制了子树继续划分的条件，如果某一节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分，min_samples_split的取值范围可设定为[1，1000]。min_samples_leaf指叶子节点的最少样本数，如果叶子节点数目小于上述最少样本数，则叶子节点会和兄弟节点一起被剪枝，当样本量不大时，该值起到的作用较小，当样本量数量级非常大，则适应性的调高该值。random_state参数用于随机划分训练样本(即建模样本)和测试样本，取值范围可定义为[1，1000]。Among them, n_estimators refers to the maximum number of iterations of the weak learner. If the value of n_estimators is too small, it is easy to underfit. The value of n_estimators is too large and easy to overfit. Adjust the size of the value of n_estimators to choose a moderate value. The value of n_estimators The range can be defined as [1,1000]. Learning_rate refers to the weight reduction coefficient of each weak learner, also called the step size. For the same training set fitting effect, the smaller step size indicates that more weak learner iterations are needed, and the learning_rate value range can be Defined as (0,1). Subsample refers to subsampling, with a range of (0,1). Max_features refers to the maximum number of features, the value range can be set to (0,1). Max_depth refers to the maximum depth of the decision tree, which can be any integer in (0, 10). Min_samples_split refers to the minimum number of samples required for internal node division. This value limits the condition for subtree to continue to divide. If the number of samples of a node is less than min_samples_split, then it will not continue to try to select the optimal feature for partitioning. The min_samples_split is taken. The value range can be set to [1,1000]. Min_samples_leaf refers to the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the minimum number of samples mentioned above, the leaf nodes will be pruned together with the sibling nodes. When the sample size is not large, the value plays a small role when the sample size If the order of magnitude is very large, adjust the value adaptively. The random_state parameter is used to randomly divide the training samples (ie, modeled samples) and test samples, and the range of values can be defined as [1,1000].

将上述参数以及对应的取值范围映射到粒子群算法的定义域中，得到粒子群优化算法的定义域维度以及定义域范围。其中，粒子群优化算法(Particle Swarm Optimization，PSO)为一种基于种群的随机优化算法，该算法可以模仿昆虫、兽群、鸟群和鱼群等的群集行为，这些群体按照一种合作的方式寻找食物，群体中的每个成员通过学习自身的经验和其他成员的经验来不断改变其搜索模式。本实施例中选择粒子群优化算法进行决策树参数的调整为例进行说明，还可以使用其它随机优化算法进行决策树参数的调整。The above parameters and corresponding range of values are mapped into the domain of the particle swarm optimization algorithm to obtain the domain dimension and the domain scope of the particle swarm optimization algorithm. Among them, Particle Swarm Optimization (PSO) is a population-based stochastic optimization algorithm that can mimic the clustering behavior of insects, herds, flocks and fish groups. These groups are in a cooperative way. Looking for food, each member of the group constantly changes its search model by learning its own experience and the experience of other members. In this embodiment, the particle swarm optimization algorithm is selected to adjust the decision tree parameters as an example, and other stochastic optimization algorithms may also be used to adjust the decision tree parameters.

在步骤420中，设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围，得到粒子群中每个粒子的轨迹最优点。In step 420, setting initial parameters of the particle swarm optimization algorithm according to the particle group optimization The algorithm, the domain dimension, and the scope of the domain are derived to obtain the most advantageous trajectory of each particle in the particle group.

其中，粒子群优化算法的初始参数可以设置为(ω，φ₁，φ₂)，其中，ω为冲量项，取值在(0，1)之间(可定义为0.5)，φ₁大小可自定义，如定义为0.5，φ₂大小为PSO的设定参数，可定义为0.5，指定粒子种群的数量(popsize)为100，对此100个粒子的速度和位置进行随机赋值，通过粒子的当前位置以及当前速度进行粒子位置的更新，根据目标函数的值更新粒子的速度。例如，PSO算法根据每个粒子曾经走过的轨迹最优点以及100个粒子中全局轨道最优点结合当前粒子的速度来对粒子的下一速度和下一的位置进行更新，公式如下：Among them, the initial parameters of the particle swarm optimization algorithm can be set to (ω, φ ₁ , φ ₂ ), where ω is the impulse term, the value is between (0, 1) (can be defined as 0.5), and the size of φ ₁ can be Custom, if defined as 0.5, φ ₂ size is the setting parameter of PSO, which can be defined as 0.5, the number of specified particle populations (popsize) is 100, and the speed and position of 100 particles are randomly assigned through the particles. The current position and the current speed are updated by the particle position, and the speed of the particle is updated according to the value of the objective function. For example, the PSO algorithm updates the next and next positions of the particle based on the best merit of the trajectory that each particle has traveled and the global orbital best of the 100 particles combined with the speed of the current particle. The formula is as follows:

其中，v_i+1表示粒子的下一速度，v_i代表粒子的当前速度，ω为冲量项，U(0，φ₁)为均匀分布在(0，φ₁)之间的随机数，U(0，φ₂)为均匀分布在(0，φ₂)之间的随机数，

为该粒子的轨迹最优点，代表粒子曾经走过的使目标函数达到最大值的点，

为全局最优点，即所有粒子走过的点中使目标函数达到最大值的点，x_i表示粒子的当前位置，x_i+1表示粒子的下一位置，。Where v _i+1 represents the next velocity of the particle, v _i represents the current velocity of the particle, ω is the impulse term, and U(0, φ ₁ ) is a random number uniformly distributed between (0, φ ₁ ), U (0, φ ₂ ) is a random number uniformly distributed between (0, φ ₂ ),

The best point for the trajectory of the particle, which represents the point at which the particle has reached the maximum value of the objective function.

For the global best, the point at which all particles travel through the point where the objective function reaches its maximum value, x _i represents the current position of the particle, and x _i+1 represents the next position of the particle.

可选地，将通过PSO算法计算得到的100个粒子中每个粒子的轨迹最优点进行记录。Optionally, the trajectory of each of the 100 particles calculated by the PSO algorithm is recorded as the most advantageous.

在步骤430中，依据所述轨迹最优点确定梯度渐进决策树的参数值。In step 430, the parameter values of the gradient progressive decision tree are determined based on the trajectory best.

依据记录的粒子的轨迹最优点来确定得到最终梯度渐进决策树的参数值。其中，目标函数为训练样本的KS值和测试样本的KS值的最小值函数，即min(KS_train，KS_test)，粒子的轨迹最优点依据粒子群优化算法最大化目标函数得到，即根据max(min(KS_test，KS_train))函数得到。 The parameter values of the final gradient progressive decision tree are determined based on the trajectory of the recorded particles. Wherein, the objective function is the minimum function of the KS value of the training sample and the KS value of the test sample, ie min(KS_train, KS_test), and the most advantageous trajectory of the particle is obtained according to the maximum objective function of the particle swarm optimization algorithm, that is, according to max(min) (KS_test, KS_train)) function is obtained.

其中，依据所述轨迹最优点以及对应的目标函数的值的大小确定梯度渐进决策树的参数值，所述目标函数为训练样本KS值和测试样本的KS值的最小值函数。其中，KS值是在模型中用于区分预测正负样本分隔程度的评价指标，KS值的取值范围是[0，1]，表示模型的分隔能力。The parameter value of the gradient progressive decision tree is determined according to the best advantage of the trajectory and the value of the corresponding objective function, and the objective function is a minimum function of the KS value of the training sample and the KS value of the test sample. Among them, the KS value is an evaluation index used to distinguish the degree of separation between positive and negative samples in the model. The value range of the KS value is [0, 1], indicating the separation ability of the model.

本实施例中的GBDT模型可以作为信用评分模型，样本数据可以为用户的信用信息，如用户的履约能力、多头数据、信用时长、欠款总额及行为偏好等信息，将样本数据输入到GBDT模型后，可以得到用户的信用逾期概率。本实施例中调整梯度渐进决策树参数的过程可以包括步骤11-步骤18：The GBDT model in this embodiment can be used as a credit scoring model, and the sample data can be the user's credit information, such as the user's performance capability, long data, credit duration, total amount of arrears, and behavioral preferences, and the sample data is input into the GBDT model. After that, you can get the user's credit overdue probability. The process of adjusting the gradient progressive decision tree parameters in this embodiment may include steps 11 - 18:

在步骤11中，根据GBDT中的参数个数及每个参数的取值范围，映射到PSO算法的定义域中，得到PSO算法的定义域维度以及定义域范围。In step 11, according to the number of parameters in the GBDT and the range of values of each parameter, the domain is mapped to the domain of the PSO algorithm, and the domain dimension and the domain scope of the PSO algorithm are obtained.

在步骤12中，可以在PSO算法的定义域维度以及定义域范围随机抽取100组数据，即上述100个粒子。In step 12, 100 sets of data, that is, the above 100 particles, can be randomly extracted in the domain dimension and the domain range of the PSO algorithm.

在步骤13中，可以根据上述抽取的100个粒子的轨迹最优点，及全局轨迹最优点，依据上述公式(1)进行计算，并更新每个粒子下一步的位置，直至根据每个粒子的适应值(fitness value)比较确定出每个粒子的轨迹最优点。In step 13, according to the trajectory of the extracted 100 particles, and the best advantage of the global trajectory, the calculation is performed according to the above formula (1), and the next position of each particle is updated until the adaptation according to each particle The fitness value comparison determines the most advantageous trajectory of each particle.

例如，上述粒子可以为：For example, the above particles can be:

[n_estimators，learning_rate，subsample，max_features，max_depth，min_samples_split，min_samples_leaf，random_state]，更新粒子的位置可以理解为，上一步该粒子的位置为：[50，0.1，0.8，0.7，5，900，500，70]，根据PSO的公式可以将该粒子的位置更新到另一个位置为：[52，0.096，0.73，0.65，4，903，495，69]。[n_estimators, learning_rate, subsample, max_features, max_depth, min_samples_split, min_samples_leaf, random_state], updating the position of the particle can be understood as the position of the particle in the previous step: [50, 0.1, 0.8, 0.7, 5, 900, 500, 70 ], according to the formula of PSO, the position of the particle can be updated to another position: [52, 0.096, 0.73, 0.65, 4, 903, 495, 69].

在步骤14中，根据上述100个粒子的轨迹最优点对应的维度值，映射回GBDT中，得到对应的100组GBDT参数取值。 In step 14, according to the dimension value corresponding to the trajectory of the above 100 particles, the mapping is returned to the GBDT, and the corresponding 100 sets of GBDT parameters are obtained.

在S15中，将上述得到的100组GBDT参数，逐组带入用于进行信用卡评分的GBDT模型中，并分别代入训练样本数据和测试样本数据，得到相应用户的信用逾期概率值。In S15, the 100 sets of GBDT parameters obtained above are grouped into the GBDT model for credit card scoring, and the training sample data and the test sample data are respectively substituted to obtain the credit overdue probability value of the corresponding user.

在步骤16中，根据用户的真实信用逾期概率和根据GBDT模型得到的信用逾期概率，对每组用户的信用逾期概率值进行KS值计算，得到训练样本数据的100个KS值(即KS-test)和测试样本数据的100个KS值(即KS-train)。In step 16, according to the real credit overdue probability of the user and the credit overdue probability obtained according to the GBDT model, the KS value is calculated for each group of users' credit overdue probability values, and 100 KS values of the training sample data are obtained (ie, KS-test) And test the sample data for 100 KS values (ie KS-train).

在步骤17中，根据max(min(KS-train，KS-test))，得到目标KS-test值。In step 17, the target KS-test value is obtained according to max(min(KS-train, KS-test)).

其中，KS-train是根据训练样本数据计算得到的KS值，KS-test为根据测试样本数据计算得到的KS值，对于一组GBDT的参数，对应一个KS-train值和一个KS-test值，本实施例在PSO算法中设置了100组粒子，因而就有100组GBDT参数，对应100个KS-train和100个KS-test值，将每组GBDT参数对应的KS-train和KS-test，根据max(min(KS-train，KS-test))进行比较计算，从而得到目标KS-test值。KS-train is the KS value calculated based on the training sample data, KS-test is the KS value calculated according to the test sample data, and for a set of GBDT parameters, corresponding to a KS-train value and a KS-test value, In this embodiment, 100 sets of particles are set in the PSO algorithm, so there are 100 sets of GBDT parameters, corresponding to 100 KS-trains and 100 KS-test values, and KS-train and KS-test corresponding to each set of GBDT parameters, The comparison calculation is performed according to max(min(KS-train, KS-test)), thereby obtaining the target KS-test value.

例如，对100组GBDT参数对应的KS-train和KS-test，根据min(KS-train，KS-test)进行比较，得到100个较小的KS值，从100个较小的KS值中选择最大的KS值，从而得到目标KS-test值。For example, comparing KS-train and KS-test corresponding to 100 sets of GBDT parameters, according to min (KS-train, KS-test), 100 smaller KS values are obtained, and 100 smaller KS values are selected. The maximum KS value, resulting in the target KS-test value.

在步骤18中，采用目标KS-test值对应的目标GBDT模型对用户进行信用评价。In step 18, the user is credit evaluated using the target GBDT model corresponding to the target KS-test value.

例如，将目标KS-test对应的GBDT参数值作为GBDT模型的参数值，得到目标GBDT模型，将新用户的信用信息输入到目标GBDT模型中，得到该新用户的信用逾期概率，可以设定逾期概率阈值，当用户的信用逾期概率达到该概率阈值时，则该用户的信用较低。也可以设定多个信用逾期概率范围及对应的信用等级。For example, the GBDT parameter value corresponding to the target KS-test is used as the parameter value of the GBDT model to obtain the target GBDT model, and the credit information of the new user is input into the target GBDT model, and the credit overdue probability of the new user is obtained, and the overdue period can be set. The probability threshold, when the user's credit overdue probability reaches the probability threshold, then the user's credit is lower. You can also set multiple credit overdue probability ranges and corresponding letters. Use rank.

在本实施例中，选取目标函数值为最大的值时的轨迹最优点对应的参数值作为决策树的参数值，由此可以最大化训练样本的KS值和测试样本的KS值。选择的目标函数是最大化min(KS_train，KS_test)可以有效防止测试样本KS高于训练样本KS，而且可以很好地使训练和测试样本的KS值接近，由此得到泛化能力较强的模型。In this embodiment, the parameter value corresponding to the trajectory best advantage when the objective function value is the largest value is selected as the parameter value of the decision tree, thereby maximizing the KS value of the training sample and the KS value of the test sample. The selected objective function is to maximize min(KS_train, KS_test), which can effectively prevent the test sample KS from being higher than the training sample KS, and can well make the KS values of the training and test samples close, thereby obtaining a model with strong generalization ability. .

可选地，对原始数据集进行分类，划分为训练样本和测试样本，其中，原始数据集可以为预测信用逾期概率的建模样本数据。Optionally, the original data set is classified into training samples and test samples, wherein the original data set may be modeled sample data for predicting credit overdue probability.

定义PSO算法中的popsize＝100，generation＝100，ω＝0.5，φ₁＝0.5，φ₂＝0.5，运算得到轨迹最优点集合中目标函数值(fitness value)最大的轨迹最优点对应的参数值如下(其中fitness value为0.44368566870386)：Define popsize=100, generation=100, ω=0.5, φ ₁ =0.5, φ ₂ =0.5 in the PSO algorithm, and calculate the parameter value corresponding to the most advantageous trajectory with the largest objective value (the fitness value) in the most advantageous set of trajectories. As follows (where the fitness value is 0.44368566870386):

n_estimators＝89.9755412363669，learning_rate＝0.255267311338214，Subsample＝0.861905071771738，max_features＝0.786393083477439，max_depth＝5.51493470652752，min_samples_split＝788.538534238246，min_samples_leaf＝318.682482373024，random_state＝678.303928724576。N_estimators=89.9755412363669, learning_rate=0.255267311338214, Subsample=0.861905071771738, max_features=0.786393083477439, max_depth=5.51493470652752, min_samples_split=788.538534238246, min_samples_leaf=318.682482373024, random_state=678.303928724576.

将轨迹最优点对应的上述参数映射回决策树参数时，对需要取整的参数进行自动取整，如参数n_estimators取值必须为整数，则相应的，对参数值进行向下取整，得到结果为89。When the above parameters corresponding to the most advantageous trajectory are mapped back to the decision tree parameter, the parameters that need to be rounded are automatically rounded. For example, the value of the parameter n_estimators must be an integer. Then, the parameter value is rounded down to obtain the result. Is 89.

相关技术在进行决策树的参数调整时无法进行进行全局搜索，所调整参数的精度不高，而手动调参需要不断的人工设定梯度决策树的参数值，再根据结果逐个进行多次调整。本实施例提供了一种梯度渐进决策树参数调整方法，能够避免陷入单一区域的局部最优搜索，无需人工确定参数的取值及进行参数的逐一测试，通过本实施例提供的GBDT的参数值调整方法比手动调参在测试样本上的KS值更高，得到的模型更加稳定。The related technology cannot perform global search when adjusting the parameters of the decision tree, and the accuracy of the adjusted parameters is not high, and manual parameter adjustment needs to manually set the parameter values of the gradient decision tree, and then perform multiple adjustments one by one according to the results. The embodiment provides a gradient progressive decision tree parameter adjustment method, which can avoid the local optimal search in a single region, without manually determining the value of the parameter and performing the parameter test one by one, and the parameter value of the GBDT provided by the embodiment. Adjustment method is more than manual adjustment in the test sample The upper KS value is higher and the resulting model is more stable.

图5是一实施例提供的另一种梯度渐进决策树参数调整方法的流程图，如图5所示，本实施例提供的方法可以包括步骤510-步骤530。FIG. 5 is a flowchart of another gradient progressive decision tree parameter adjustment method according to an embodiment. As shown in FIG. 5, the method provided in this embodiment may include steps 510-530.

在步骤510中，依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围。In step 510, the domain dimension and the domain scope of the particle swarm optimization algorithm are determined according to the number of adjustment parameters in the gradient progressive decision tree and the range of values corresponding to each parameter.

在步骤520中，设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围得到粒子群中每个粒子的轨迹最优点。In step 520, an initial parameter of the particle swarm optimization algorithm is set, and a trajectory of each particle in the particle group is obtained according to the particle swarm optimization algorithm, the domain dimension, and the domain definition.

在步骤530中，依据所述轨迹最优点确定对应的周边点，依据所述周边点对应的目标函数的值的大小确定梯度渐进决策树的参数值。In step 530, the corresponding peripheral point is determined according to the trajectory best advantage, and the parameter value of the gradient progressive decision tree is determined according to the magnitude of the value of the objective function corresponding to the peripheral point.

其中，轨迹最优点的周边点以所述轨迹最优点为起始点依据爬山(Hill Climbing)算法得到，所述目标函数为训练样本和测试样本的KS值的最小值函数，例如，轨迹最优点的周边点由爬山算法最大化目标函数(即max(min(KS_test，KS_train)))得到，使得确定出的梯度渐进决策树的参数更优。爬山算法是一种局部择优的方法，采用启发式方法，是对深度优先搜索的一种改进，该算法利用反馈信息生成解的决策。由于本实施例中轨迹最优点的周边点中可能存在更优的轨迹点，故采用爬山算法进行运算以找到比轨迹最优点更优的周边点。Wherein, the peripheral point of the most advantageous trajectory is obtained according to the Hill Climbing algorithm with the best advantage of the trajectory as a starting point, and the objective function is a minimum function of the KS value of the training sample and the test sample, for example, the most advantageous trajectory The surrounding points are obtained by the hill climbing algorithm to maximize the objective function (ie, max(KS_test, KS_train)), so that the parameters of the determined gradient progressive decision tree are better. Mountain climbing algorithm is a local optimization method. Heuristic method is an improvement on depth-first search. The algorithm uses feedback information to generate solution decisions. Since there may be better trajectory points in the peripheral points of the most advantageous trajectory in this embodiment, the hill climbing algorithm is used to perform operations to find peripheral points that are superior to the trajectory best.

例如，定义爬山算法中8个参数的步长，可以是如下所示的步长：For example, define the step size of the 8 parameters in the hill climbing algorithm, which can be the step size as follows:

n_estimators步长为1，learning_rate步长为0.01，Subsample步长为0.01，max_features步长为0.01，max_depth步长为1，min_samples_split步长为20，min_samples_leaf步长为20，random_state步长为1。The n_estimators step size is 1, the learning_rate step size is 0.01, the Subsample step size is 0.01, the max_features step size is 0.01, the max_depth step size is 1, the min_samples_split step size is 20, the min_samples_leaf step size is 20, and the random_state step size is 1.

根据上述定义的步长，逐个测试轨迹最优点的周边点，测试过程中，找到目标函数值上升最大点的为下一步的起始点，如果不存在目标函数值增长的点则停止运算，停止运算时对应的周边点则为轨迹最优点。According to the step size defined above, test the peripheral points of the most advantageous trajectory one by one, during the test, find The maximum point of the increase of the objective function value is the starting point of the next step. If there is no point where the value of the objective function increases, the operation is stopped, and the corresponding peripheral point when the operation is stopped is the best advantage of the track.

本实施例提供了一种梯度渐进决策树参数调整方法，依据轨迹最优点确定对应的周边点，依据周边点对应的目标函数的值的大小确定梯度渐进决策树的参数值，改善了参数调节的结果。The embodiment provides a gradient progressive decision tree parameter adjustment method, and determines a corresponding peripheral point according to the trajectory best advantage, and determines a parameter value of the gradient progressive decision tree according to the value of the objective function corresponding to the surrounding point, thereby improving parameter adjustment. result.

例如，对同一GBDT模型，手动调参得到的KS_train值为58.19％，KS_test值为41.57％，使用PSO算法调参确定的KS_train值为45.19％，KS_test值为44.12％，使用PSO算法加爬山算法得到的KS_train值为50.37％，KS_test值为45.22％，由此可知，使用PSO算法加爬山算法确定的KS值高于采用PSO算法得到的KS值，同时采用PSO算法及采用PSO和爬山算法两种方式得到的训练样本的KS值与测试样本的KS值的差值均小于手动调参得到的训练样本的KS值与测试样本的KS值的差值。For example, for the same GBDT model, the KS_train value obtained by manual tuning is 58.19%, the KS_test value is 41.57%, the KS_train value determined by the PSO algorithm is 45.19%, and the KS_test value is 44.12%. The PSO algorithm is used to add the hill climbing algorithm. The KS_train value is 50.37% and the KS_test value is 45.22%. It can be seen that the KS value determined by the PSO algorithm plus the hill climbing algorithm is higher than the KS value obtained by the PSO algorithm, and the PSO algorithm and the PSO and the hill climbing algorithm are adopted. The difference between the obtained KS value of the training sample and the KS value of the test sample is smaller than the difference between the KS value of the training sample obtained by the manual adjustment and the KS value of the test sample.

例如，还可根据PSO算法得到的全局最优点进行爬山算法的进一步优化，对应得到的KS_train值为45.54％，KS_test值为44.46％，效果介于仅用PSO算法以及使用PSO算法结合爬山算法之间。For example, the hill-climbing algorithm can be further optimized according to the global best obtained by the PSO algorithm, and the corresponding KS_train value is 45.54%, and the KS_test value is 44.46%. The effect is between using only the PSO algorithm and using the PSO algorithm combined with the hill-climbing algorithm. .

图6是一实施例提供的另一种梯度渐进决策树参数调整方法的流程图，如图6所示，本实施例提供的方法可以包括步骤610-步骤630。FIG. 6 is a flowchart of another gradient progressive decision tree parameter adjustment method according to an embodiment. As shown in FIG. 6, the method provided in this embodiment may include steps 610-630.

在步骤610中，依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围。In step 610, the domain dimension and the domain scope of the particle swarm optimization algorithm are determined according to the number of adjustment parameters in the gradient progressive decision tree and the range of values corresponding to each parameter.

在步骤620中，设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围得到粒子群中每个粒子的轨迹最优点。In step 620, an initial parameter of the particle swarm optimization algorithm is set, and a trajectory of each particle in the particle group is obtained according to the particle swarm optimization algorithm, the domain dimension, and the domain definition.

在步骤630中，依据所述轨迹最优点确定对应的周边点，对所述周边点对应的目标函数的值的大小进行排序，选择最大的目标函数的值对应的周边点对应的参数值作为梯度渐进决策树的参数值。In step 630, a corresponding peripheral point is determined according to the most advantageous trajectory, and the peripheral point pair is determined. The value of the value of the objective function should be sorted, and the parameter value corresponding to the peripheral point corresponding to the value of the largest objective function is selected as the parameter value of the gradient progressive decision tree.

可选地，对轨迹最优点的周边点对应的目标函数的值的大小进行排序，选择最大的目标函数的值对应的周边点对应的参数值作为梯度渐进决策树的参数值。通过自动排序选取排序结果中目标函数值最大的周边点对应的参数值，例如，目标函数值最大的周边点对应的参数值如下(其中fitness value为0.456814121199906)：Optionally, the value of the value of the objective function corresponding to the peripheral point of the most advantageous trajectory is sorted, and the parameter value corresponding to the peripheral point corresponding to the value of the largest objective function is selected as the parameter value of the gradient progressive decision tree. The parameter values corresponding to the peripheral points having the largest target function value in the sort result are selected by automatic sorting. For example, the parameter values corresponding to the peripheral points having the largest target function value are as follows (where the fitness value is 0.456814121199906):

n_estimators＝89.944668235715，learning_rate＝0.253604654375516，subsample＝0.84938040034035，max_features＝0.791557099759923，max_depth＝5.52083587628895，min_samples_split＝785.648574406732，min_samples_leaf＝323.345684890637，random_state＝683.655366674717。N_estimators=89.944668235715, learning_rate=0.253604654375516, subsample=0.84938040034035, max_features=0.791557099759923, max_depth=5.52083587628895, min_samples_split=785.648574406732, min_samples_leaf=323.345684890637, random_state=683.655366674717.

还可以仅选取轨迹最优点中目标函数值最大的点进行爬山算法得到周边点，将周边点对应的各个维度的值确定为梯度渐进决策树的参数值。It is also possible to select only the point where the objective function value is the largest among the trajectories, and the climbing algorithm obtains the surrounding points, and the values of the respective dimensions corresponding to the surrounding points are determined as the parameter values of the gradient progressive decision tree.

本实施例提供的梯度渐进决策树参数调整方法，可以提高了梯度渐进决策树的参数调整效率，避免了调整过程中陷入单一区域的局部最优搜索，对参数空间的搜索范围更广。The gradient progressive decision tree parameter adjustment method provided in this embodiment can improve the parameter adjustment efficiency of the gradient progressive decision tree, avoids the local optimal search that falls into a single region during the adjustment process, and has a wider search range for the parameter space.

大部分银行信用卡评分模型开发框架是基于数理统计理论的，变量(即参数)要在模型中产生发挥作用，则变量与输出变量是统计显著，对数据量和变量信息强度要求很高。Most bank credit card scoring model development frameworks are based on mathematical statistics theory. Variables (ie, parameters) must play a role in the model. Variables and output variables are statistically significant, and data volume and variable information strength are very high.

相较于相关技术中的统计学方法，梯度渐进决策树在解决分类问题和回归问题时，具备更强的拟合能力和分类能力，能够更有效地利用样本数据中的弱变量信息，但是过强的拟合能力可能会在测试集上出现过拟合现象。为了克服过拟合现象，算法参数选择非常重要。目前实践中大量依赖于人工选择参数，本实例提供了自动化选择参数的方案。Compared with the statistical methods in the related art, the gradient progressive decision tree has stronger fitting ability and classification ability when solving classification problems and regression problems, and can more effectively utilize the weak variable information in the sample data, but A strong fit may result in over-fitting on the test set. In order to overcome the over-fitting phenomenon, algorithm parameter selection is very important. In practice, a large amount of practice relies on manual selection of parameters. This example provides a scheme for automatically selecting parameters.

图7是一实施例提供的一种梯度渐进决策树参数调整装置的结构示意图，该装置可执行上述实施例提供的梯度渐进决策树参数调整方法，具备执行方法相应的功能模块和有益效果。如图7所示，该装置可以包括：映射模块701、轨迹最优点确定模块702和参数确定模块703。FIG. 7 is a schematic structural diagram of a gradient progressive decision tree parameter adjustment apparatus according to an embodiment. The apparatus may perform the gradient progressive decision tree parameter adjustment method provided by the foregoing embodiment, and has a corresponding functional module and a beneficial effect of the execution method. As shown in FIG. 7, the apparatus may include: a mapping module 701, a trajectory best advantage determining module 702, and a parameter determining module 703.

其中，映射模块701设置为依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围；The mapping module 701 is configured to determine a domain dimension and a domain range of the particle group optimization algorithm according to the number of adjustment parameters in the gradient progressive decision tree and the value range corresponding to each parameter;

轨迹最优点确定模块702设置为设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围得到粒子群中每个粒子的轨迹最优点；The trajectory most advantageous determination module 702 is configured to set initial parameters of the particle swarm optimization algorithm, and obtain the trajectory of each particle in the particle group according to the particle swarm optimization algorithm, the domain dimension, and the domain definition range. ;

参数确定模块703设置为依据所述轨迹最优点确定梯度渐进决策树的参数值。The parameter determination module 703 is arranged to determine a parameter value of the gradient progressive decision tree based on the trajectory best.

在本实施例中，依据梯度渐进决策树中调节参数的数目以及每个参数对应的取值范围确定粒子群优化算法的定义域维度以及定义域范围，设定所述粒子群优化算法的初始参数，根据所述粒子群优化算法、所述定义域维度以及所述定义域范围得到粒子群中每个粒子的轨迹最优点，依据所述轨迹最优点确定梯度渐进决策树的参数值，提高了梯度渐进决策树的参数调整效率，避免了调整过程中陷入单一区域的局部最优搜索，对参数空间的搜索范围更广。In this embodiment, determining the domain dimension and the domain range of the particle swarm optimization algorithm according to the number of adjustment parameters in the gradient progressive decision tree and the range of values corresponding to each parameter, and setting initial parameters of the particle swarm optimization algorithm Determining the trajectory of each particle in the particle group according to the particle swarm optimization algorithm, the domain dimension and the domain definition, determining the parameter value of the gradient progressive tree according to the trajectory best, and improving the gradient The parameter adjustment efficiency of the progressive decision tree avoids the local optimal search that falls into a single region during the adjustment process, and has a wider search range for the parameter space.

可选的，所述参数确定模块703设置为：Optionally, the parameter determining module 703 is configured to:

依据所述轨迹最优点以及对应的目标函数的值的大小确定梯度渐进决策树的参数值，所述目标函数为训练样本和测试样本的KS值的最小值函数。A parameter value of the gradient progressive decision tree is determined according to the trajectory best advantage and the magnitude of the value of the corresponding objective function, the objective function being a minimum function of the KS value of the training sample and the test sample.

依据所述轨迹最优点确定对应的周边点，所述轨迹最优点的周边点以所述轨迹最优点为起始点依据爬山算法得到；Determining corresponding peripheral points according to the most advantageous trajectory, and the peripheral points of the most advantageous trajectory are as described The most advantageous advantage of the trajectory is that the starting point is obtained according to the hill climbing algorithm;

依据所述周边点对应的目标函数的值的大小确定梯度渐进决策树的参数值，所述目标函数为训练样本和测试样本的KS值的最小值函数。The parameter value of the gradient progressive decision tree is determined according to the magnitude of the value of the objective function corresponding to the peripheral point, and the objective function is a minimum function of the KS value of the training sample and the test sample.

对所述周边点对应的目标函数的值的大小进行排序，选择最大的目标函数的值对应的周边点对应的参数值作为梯度渐进决策树的参数值。The size of the value of the objective function corresponding to the peripheral point is sorted, and the parameter value corresponding to the peripheral point corresponding to the value of the largest objective function is selected as the parameter value of the gradient progressive decision tree.

可选的，所述梯度渐进决策树的调节参数的数目为8，所述定义域范围为每个调节参数的最小值到最大值的区间。Optionally, the number of adjustment parameters of the gradient progressive decision tree is 8, and the range of the definition domain is a range from a minimum value to a maximum value of each adjustment parameter.

一实施例还提供一种计算机可读存储介质，存储有计算机可执行指令，所述计算机可执行指令用于执行上述任意一种信用评价方法。An embodiment further provides a computer readable storage medium storing computer executable instructions for performing any of the credit evaluation methods described above.

一实施例还提供一种包含计算机可执行指令的存储介质，所述计算机可执行指令在由计算机处理器执行时可以执行上述实施例提供的任意一种梯度渐进决策树参数调整方法。An embodiment further provides a storage medium containing computer executable instructions that, when executed by a computer processor, can perform any of the gradient progressive decision tree parameter adjustment methods provided by the above embodiments.

上述存储介质可以是不同类型的存储器设备或存储设备。可以包括：安装介质，例如CD-ROM、软盘或磁带装置；计算机系统存储器或随机存取存储器，诸如DRAM、DDR RAM、SRAM、EDO RAM，兰巴斯(Rambus)RAM等；非易失性存储器，诸如闪存、磁介质(例如硬盘或光存储)；寄存器或其它相似类型的存储器元件等。存储介质可以还包括其它类型的存储器或组合。另外，存储介质可以位于程序在其中被执行的第一计算机系统中，或者可以位于不同的第二计算机系统中，第二计算机系统通过网络(诸如因特网)连接到第一计算机系统。第二计算机系统可以提供程序指令给第一计算机用于执行。存储介质还可以包括驻留在不同位置中(例如在通过网络连接的不同计算机系统中)的两个或更多存储介质。存储介质可以存储可由一个或多个处理器执行的程序指令(例如计算机程序)。The above storage medium may be a different type of memory device or storage device. These may include: a mounting medium such as a CD-ROM, a floppy disk or a tape device; a computer system memory or a random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Rambus RAM, etc.; non-volatile memory , such as flash memory, magnetic media (such as hard disk or optical storage); registers or other similar types of memory components, and the like. The storage medium may also include other types of memory or combinations. Additionally, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system, the second computer system being coupled to the first computer system via a network, such as the Internet. The second computer system can provide program instructions to the first computer for execution. Storage media may also include two or more storage media that reside in different locations (eg, in different computer systems connected through a network). A storage medium may store program instructions executable by one or more processors (eg, computer programs) sequence).

一实施例提供一种数据处理设备，该数据处理设备可以为填补器，如图8所示，是一实施例提供的一种数据处理设备的硬件结构示意图，该数据处理设备可以包括：处理器(processor)810和存储器(memory)820；还可以包括通信接口(Communications lnterface)830和总线840。An embodiment provides a data processing device, which may be a filler, as shown in FIG. 8, is a hardware structure diagram of a data processing device provided by an embodiment, and the data processing device may include: a processor (processor) 810 and memory 820; may also include a communication interface 830 and a bus 840.

其中，处理器810、存储器820和通信接口830可以通过总线840完成相互间的通信。通信接口830可以用于信息传输。处理器810可以调用存储器820中的逻辑指令，以执行上述实施例的任意一种方法。The processor 810, the memory 820, and the communication interface 830 can complete communication with each other through the bus 840. Communication interface 830 can be used for information transfer. Processor 810 can invoke logic instructions in memory 820 to perform any of the methods of the above-described embodiments.

存储器820可以包括存储程序区和存储数据区，存储程序区可以存储操作系统和至少一个功能所需的应用程序。存储数据区可以存储根据数据处理设备的使用所创建的数据等。此外，存储器可以包括，例如，随机存取存储器的易失性存储器，还可以包括非易失性存储器。例如至少一个磁盘存储器件、闪存器件或者其他非暂态固态存储器件。The memory 820 may include a storage program area and a storage data area, and the storage program area may store an operating system and an application required for at least one function. The storage data area can store data and the like created according to the use of the data processing device. Further, the memory may include, for example, a volatile memory of a random access memory, and may also include a non-volatile memory. For example, at least one disk storage device, flash memory device, or other non-transitory solid state storage device.

此外，在上述存储器820中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，该逻辑指令可以存储在一个计算机可读取存储介质中。本公开的技术方案可以以计算机软件产品的形式体现出来，该计算机软件产品可以存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本实施例所述方法的全部或部分步骤。Moreover, when the logic instructions in the memory 820 described above can be implemented in the form of software functional units and sold or used as separate products, the logic instructions can be stored in a computer readable storage medium. The technical solution of the present disclosure may be embodied in the form of a computer software product, which may be stored in a storage medium, and includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) All or part of the steps of the method described in this embodiment are performed.

上述实施例方法中的全部或部分流程，是可以通过计算机程序来指示相关的硬件完成的，该程序可存储于一个非暂态计算机可读存储介质中，该程序被执行时，可包括如上述方法的实施例的流程。All or part of the processes in the foregoing embodiment may be completed by a computer program indicating related hardware, and the program may be stored in a non-transitory computer readable storage medium, and when the program is executed, may include the above The flow of an embodiment of the method.

Industrial applicability

本公开提供的信用评价方法和装置以及梯度渐进决策树参数调整方法和装置，可以实现提高GBDT模型的参数调整效率，提高GBDT模型的稳定性。 The credit evaluation method and device provided by the present disclosure and the gradient progressive decision tree parameter adjustment method and device can improve the parameter adjustment efficiency of the GBDT model and improve the stability of the GBDT model.

Claims

A credit evaluation method, including:

The first sample data is respectively input into at least two gradient progressive decision tree GBDT models to obtain a first credit overdue probability set, and the first sample data is credit data of the first user set;

Entering the second sample data into the at least two GBDT models respectively to obtain a second credit overdue probability set, the second sample data being credit data of the second user set; the GBDT parameters of the at least two GBDT models are different ;

Performing a KS value calculation according to the first credit overdue probability set and the second credit overdue probability set, and determining a target GBDT model from the at least two GBDT models according to the calculation result;

The user is credit evaluated according to the target GBDT model.

The method according to claim 1, wherein the KS value calculation is performed according to the first credit overdue probability set and the second credit overdue probability set, and the target GBDT model is determined from the at least two GBDT models according to the calculation result, include:

And performing KS value calculation according to the first credit overdue probability set and the first actual credit overdue probability set corresponding to the first user set, to obtain a first KS set;

And performing a KS value calculation according to the second credit overdue probability set and the second actual credit overdue probability set corresponding to the second user set, to obtain a second KS set;

Performing a comparison calculation on the first KS set and the second KS set, and determining the target GBDT model from the at least two GBDT models according to a calculation result.

The method according to claim 2, wherein the first KS set and the second KS set are compared and calculated, and the target GBDT model is determined from the at least two GBDT models according to a calculation result, including :

Calculating a minimum value of the KS value of the first KS set obtained from the same GBDT model and the KS value of the second KS set to obtain a third KS set;

Calculating a maximum value of the KS value included in the third KS set to obtain a target KS value;

A GBDT model corresponding to the target KS value among the at least two GBDT models is determined as the target GBDT model.

The method according to claim 2, wherein before the first sample data is separately input into the at least two gradient progressive decision tree GBDT models, the method further comprises:

The GBDT parameters of the at least two GBDT models are determined according to a particle swarm optimization PSO algorithm.

The method according to claim 4, wherein the GBDT parameters of the at least two GBDT models are determined according to a particle swarm optimization algorithm PSO algorithm, including:

Mapping the number of parameters in the GBDT model to the domain dimension of the PSO algorithm;

Mapping the range of values of each of the parameters in the GBDT model to the domain of the PSO algorithm;

Extracting at least two sets of dimension value data from the domain of the domain corresponding to the domain dimension as at least two particles;

Calculating the most advantageous trajectory of the at least two particles by a PSO algorithm; wherein the trajectory most advantageous is a point in the trajectory through which the particle passes to maximize the objective function, the objective function being the first KS a function of the concentrated KS value and the minimum value of the KS value in the second KS set and

Mapping the dimension value data corresponding to the trajectory of the at least two particles to the GBDT model to obtain at least two sets of GBDT parameters.

The method of claim 1, wherein the credit evaluation of the user based on the target GBDT model comprises:

Entering the credit data of the user into the target GBDT model to obtain a credit overdue probability of the user;

Comparing the credit overdue probability of the user with a preset credit overdue probability threshold to obtain a credit evaluation result of the user.

A credit evaluation device comprising:

The first credit overdue probability obtaining module is configured to input the first sample data into the at least two gradient progressive decision tree GBDT models respectively, to obtain a first credit overdue probability set, where the first sample data is the first user set Credit data

a second credit overdue probability obtaining module, configured to input second sample data into the at least two GBDT models respectively to obtain a second credit overdue probability set, where the second sample data is credit data of the second user set; The GBDT parameters of at least two GBDT models are different;

a model determining module, configured to perform a KS value calculation according to the first credit overdue probability set and the second credit overdue probability set, and determine a target GBDT model from the at least two GBDT models according to the calculation result;

The evaluation module is configured to perform credit evaluation on the user according to the target GBDT model.

A gradient progressive decision tree parameter adjustment method includes:

Determining the domain dimension and the domain range of the particle swarm optimization algorithm according to the number of adjustment parameters in the gradient progressive decision tree and the range of values corresponding to each parameter;

Setting an initial parameter of the particle swarm optimization algorithm, and obtaining a trajectory of each particle in the particle group according to the particle swarm optimization algorithm, the domain dimension, and the domain definition; and

The parameter values of the gradient progressive decision tree are determined according to the best merit of the trajectory.

The method of claim 8 wherein determining a parameter value of the gradient progressive decision tree based on the best merit of the trajectory comprises:

Determining a parameter value of the gradient progressive decision tree according to the maximum merit of the trajectory and the value of the corresponding objective function, the objective function being the minimum value of the Kolmogorov-Smirnov KS value of the training sample and the test sample Value function.

The method of claim 8 wherein the gradient is determined based on the most advantageous trajectory The parameter values of the decision tree, including:

Determining a corresponding peripheral point according to the most advantageous trajectory, wherein the peripheral point of the most advantageous trajectory is obtained according to the hill climbing algorithm with the best advantage of the trajectory as a starting point;

The parameter value of the gradient progressive decision tree is determined according to the magnitude of the value of the objective function corresponding to the peripheral point, and the objective function is a minimum function of the KS value of the training sample and the test sample.

The method according to claim 10, wherein determining a parameter value of the gradient progressive decision tree according to a magnitude of a value of the objective function corresponding to the peripheral point comprises:

The size of the value of the objective function corresponding to the peripheral point is sorted, and the parameter value corresponding to the corresponding peripheral point when the value of the objective function takes the maximum value is selected as the parameter value of the gradient progressive decision tree.

The method according to any of claims 8-11, wherein the number of adjustment parameters of the gradient progressive decision tree is 8, the range of the definition domain being the interval from the minimum value to the maximum value of each adjustment parameter.

A gradient progressive decision tree parameter adjustment device, comprising:

The mapping module is configured to determine a domain dimension and a domain range of the particle swarm optimization algorithm according to the number of adjustment parameters in the gradient progressive decision tree and the value range corresponding to each parameter;

a trajectory most advantageous determining module, configured to set an initial parameter of the particle swarm optimization algorithm, and obtain a trajectory of each particle in the particle group according to the particle swarm optimization algorithm, the domain dimension, and the domain definition range ;as well as

The parameter determination module is configured to determine a parameter value of the gradient progressive decision tree according to the most advantageous of the trajectory.

The apparatus of claim 13 wherein said parameter determination module is configured to:

A parameter value of the gradient progressive decision tree is determined according to the trajectory best advantage and the magnitude of the value of the corresponding objective function, the objective function being a minimum function of the KS value of the training sample and the test sample.

Determining a parameter of the gradient progressive decision tree according to a magnitude of a value of the objective function corresponding to the peripheral point, the objective function being a minimum function of the KS value of the training sample and the test sample.

The apparatus of claim 15 wherein said parameter determination module is configured to:

The size of the value of the objective function corresponding to the peripheral point is sorted, and the parameter value corresponding to the peripheral point corresponding to the value of the largest objective function is selected as the parameter value of the gradient progressive decision tree.

Apparatus according to any one of claims 13-16, wherein the number of adjustment parameters of the gradient progressive decision tree is 8, the range of the definition domain being the interval from the minimum value to the maximum value of each adjustment parameter.

A computer readable storage medium storing computer executable instructions for performing the method of any of claims 1-6 and 8-12.