CN116318813A

CN116318813A - A method and system for detecting domain name abuse based on cluster analysis

Info

Publication number: CN116318813A
Application number: CN202211705047.9A
Authority: CN
Inventors: 陈勇; 张志勇; 董科军; 延志伟; 沙晓爽
Original assignee: China Internet Network Information Center
Current assignee: China Internet Network Information Center
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-06-23
Anticipated expiration: 2042-12-28
Also published as: CN116318813B; WO2024139862A1

Abstract

The application discloses a domain name abuse detection method and system based on cluster analysis, wherein the method comprises the following steps: selecting multidimensional features of domain name applications; establishing a corresponding one-dimensional feature coordinate system for each feature in the multi-dimensional features; the one-dimensional feature coordinate system is used for identifying the position of the numerical value converted by the feature according to a preset numerical value standard in the coordinate system; establishing a multi-dimensional feature coordinate system according to all the one-dimensional feature coordinate systems; acquiring a preset number of domain name applications according to a domain name list or a URL list; collecting multidimensional features of the domain name application; converting each of the domain name multidimensional features into a numerical value used for being identified in the multidimensional feature coordinate system according to a preset numerical value standard; and calculating the aggregation condition of the domain name application according to the numerical value marked in the multidimensional feature coordinate system, and obtaining the detection result of domain name abuse according to the aggregation condition, thereby achieving the purposes of independence of single feature, high detection efficiency and high accuracy.

Description

A method and system for detecting domain name abuse based on cluster analysis

技术领域technical field

本发明涉及域名维护技术领域，特别涉及一种基于聚类分析的域名滥用检测方法及系统。The invention relates to the technical field of domain name maintenance, in particular to a domain name abuse detection method and system based on cluster analysis.

背景技术Background technique

域名由于其容易记忆的特点，成为互联网各类应用(网站、邮件等)访问的入口。随着互联网的发展，域名数量的数据也迅速增长，根据第三方数据，2022年初，全球顶级域名注册量已达3.5亿个。域名滥用的定义比较广泛，一般而言就是任何滥用、误用或者用来攻击现有的域名系统的恶意行为，都可以被称为域名滥用。Due to its easy-to-remember feature, the domain name has become the entrance for various Internet applications (websites, emails, etc.) to access. With the development of the Internet, the data on the number of domain names has also increased rapidly. According to third-party data, at the beginning of 2022, the number of global top-level domain registrations has reached 350 million. The definition of domain name abuse is relatively broad. Generally speaking, any abuse, misuse, or malicious behavior used to attack the existing domain name system can be called domain name abuse.

常见的域名滥用包括通过注册域名搭建的非法网站、钓鱼网站、发送垃圾邮件、僵尸网络、传播恶意软件、其他灰色黑色域名产业等。伴随着互联网的发展，域名滥用也层出不穷，这些滥用行为都对互联网造成了一定的威胁，影响了计算机用户的正常网络，严重影响了互联网的使用体验，甚至造成了重大的经济或者社会利益的损失。Common domain name abuses include illegal websites built through registered domain names, phishing websites, sending spam, botnets, spreading malware, and other gray and black domain name industries, etc. With the development of the Internet, the abuse of domain names has also emerged in an endless stream. These abuses have caused certain threats to the Internet, affecting the normal network of computer users, seriously affecting the experience of using the Internet, and even causing major economic or social losses. .

目前针对域名滥用的检测技术主要是对每个域名应用通过特定的特征进行匹配或者筛选，特征包括：域名特征、URL特征、域名解析特征、文本特征、网站图像特征和网站结构特征(对于网站形式的域名滥用)。具体来说又有两种方式：特征库匹配检测和机器学习检测可以对已发现的滥用有较好的检测效果，但是也存在一些缺点，如人工投入大，低效率，扩展性差，检测效率持续性差。The current detection technology for domain name abuse is mainly to match or filter each domain name application through specific features, which include: domain name features, URL features, domain name resolution features, text features, website image features and website structure features (for website form domain name abuse). Specifically, there are two methods: feature library matching detection and machine learning detection can have a better detection effect on discovered abuses, but there are also some disadvantages, such as large labor investment, low efficiency, poor scalability, and continuous detection efficiency. Poor sex.

发明内容Contents of the invention

本申请提供了一种基于聚类分析的域名滥用检测方法及系统，解决了目前存在人工投入大，低效率，扩展性差，检测效率持续性差的问题。This application provides a domain name abuse detection method and system based on cluster analysis, which solves the current problems of large manual investment, low efficiency, poor scalability, and poor detection efficiency continuity.

第一方面，本申请提供了一种基于聚类分析的域名滥用检测方法，所述方法包括：In the first aspect, the present application provides a domain name abuse detection method based on cluster analysis, the method comprising:

选择域名应用的多维特征；Select the multi-dimensional features of the domain name application;

将所述多维特征中每一个特征建立对应的一维特征坐标系；所述一维特征坐标系用于标识特征根据预设数值标准转换的数值在坐标系中的位置；Establishing a corresponding one-dimensional feature coordinate system for each feature in the multi-dimensional features; the one-dimensional feature coordinate system is used to identify the position of the value converted by the feature according to the preset numerical standard in the coordinate system;

根据所有所述一维特征坐标系建立多维特征坐标系；establishing a multi-dimensional feature coordinate system according to all the one-dimensional feature coordinate systems;

根据域名列表或URL列表获取预设数量的域名应用；采集所述域名应用的域名多维特征；Obtaining a preset number of domain name applications according to the domain name list or URL list; collecting domain name multi-dimensional features of the domain name application;

根据预设数值标准将所述域名多维特征中每一个特征分别转换为用于标识在所述多维特征坐标系中的数值；converting each feature in the multidimensional feature of the domain name into a numerical value for identification in the multidimensional feature coordinate system according to a preset numerical standard;

根据所述多维特征坐标系中标识的数值计算所述域名应用的聚集情况，根据所述聚集情况得到域名滥用的检测结果。Calculate the aggregation situation of the domain name application according to the value identified in the multi-dimensional feature coordinate system, and obtain the detection result of domain name abuse according to the aggregation situation.

在一种实现方式中，所述多维特征包括：域名名字特征、URL特征、IPv4地址、IP地址归属国别、域名解析特征、文本特征、网站图像特征和网站结构特征。In an implementation manner, the multi-dimensional features include: domain name features, URL features, IPv4 address, country of IP address, domain name resolution features, text features, website image features and website structure features.

在一种实现方式中，所述预设数值标准被配置为：In an implementation manner, the preset numerical standard is configured as:

若所述多维特征中存在转化数值困难的特征，则根据预设规则对特征拆分为多个维度，并对每个维度进行顺序编号，使不同的特征对应不同的编号。If there is a feature that is difficult to convert in the multi-dimensional features, the feature is split into multiple dimensions according to preset rules, and each dimension is numbered sequentially, so that different features correspond to different numbers.

在一种实现方式中，根据所述多维特征坐标系中标识的数值计算所述域名应用的聚集情况，根据所述聚集情况得到域名滥用的检测结果的步骤包括：In one implementation, the aggregation of the domain name application is calculated according to the value identified in the multi-dimensional feature coordinate system, and the step of obtaining the detection result of domain name abuse according to the aggregation includes:

将所述域名多维特征拆分为多个维度；Splitting the domain name multidimensional feature into multiple dimensions;

将所有所述维度设置单位距离；Set all said dimensions to unit distance;

根据所有所述维度的单位距离生成单位空间；generating a unit space based on unit distances in all said dimensions;

计算所述域名应用在所述单位空间的域名数量在整个多维空间中的占比；Calculating the proportion of the number of domain names used by the domain name in the unit space in the entire multidimensional space;

判断所述占比是否大于或等于预设阈值；judging whether the proportion is greater than or equal to a preset threshold;

若是，存在聚集性域名应用。If yes, there is an aggregated domain name application.

在一种实现方式中，判断所述占比是否大于或等于预设阈值之后的步骤包括：In an implementation manner, the steps after judging whether the proportion is greater than or equal to a preset threshold include:

若否，不存在聚集性域名应用。If not, there is no aggregate domain name application.

在一种实现方式中，计算所述域名应用在所述单位空间的占比的步骤包括：In an implementation manner, the step of calculating the proportion of the domain name application in the unit space includes:

占比＝所述单位空间中所述域名应用的数量/所有所述域名应用的数量。Proportion = the number of domain name applications in the unit space/the number of all domain name applications.

第二方面，本申请提供了一种基于聚类分析的域名滥用检测系统，包括：配置模块、数据处理模块、第一转换模块、采集模块、第二转换模块、计算模块；In the second aspect, the present application provides a domain name abuse detection system based on cluster analysis, including: a configuration module, a data processing module, a first conversion module, a collection module, a second conversion module, and a calculation module;

配置模块，用于选择域名应用的多维特征；The configuration module is used to select the multi-dimensional characteristics of the domain name application;

数据处理模块，用于将所述多维特征中每一个特征建立对应的一维特征坐标系；所述一维特征坐标系用于标识特征根据预设数值标准转换的数值在坐标系中的位置；A data processing module, configured to establish a corresponding one-dimensional feature coordinate system for each feature in the multi-dimensional features; the one-dimensional feature coordinate system is used to identify the position of the value converted by the feature according to the preset numerical standard in the coordinate system;

第一转换模块，用于根据所有所述一维特征坐标系建立多维特征坐标系；The first conversion module is used to establish a multi-dimensional feature coordinate system according to all the one-dimensional feature coordinate systems;

采集模块，用于根据域名列表或URL列表获取预设数量的域名应用；采集所述域名应用的域名多维特征；A collection module, configured to obtain a preset number of domain name applications according to a domain name list or a URL list; collect domain name multidimensional features of the domain name application;

第二转换模块，用于根据预设数值标准将所述域名多维特征中每一个特征分别转换为用于标识在所述多维特征坐标系中的数值；The second conversion module is used to convert each feature in the domain name multi-dimensional feature into a numerical value for identification in the multi-dimensional feature coordinate system according to a preset numerical standard;

计算模块，用于根据所述多维特征坐标系中标识的数值计算所述域名应用的聚集情况，根据所述聚集情况得到域名滥用的检测结果。A calculation module, configured to calculate the aggregation of the domain name application according to the value identified in the multi-dimensional feature coordinate system, and obtain the detection result of domain name abuse according to the aggregation.

第三方面，本申请提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如第一方面所述一种基于聚类分析的域名滥用检测方法的步骤。In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the computer program described in the first aspect is implemented. The steps of a domain name abuse detection method based on cluster analysis.

第四方面，本申请提供了一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如第一方面所述一种基于聚类分析的域名滥用检测方法的步骤。In a fourth aspect, the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, a domain name abuse based on cluster analysis is implemented as described in the first aspect The steps of the detection method.

由上述技术方案可知，通过利用域名滥用多维特征上聚集性特征进行滥用检测，不依赖于单一的域名滥用实例，不依赖于单一的特征(如：不依赖于域名滥用应用的文本特征)，在一个或者多个特征无法获取数据的情况下基于其他多维特征也可使用。适用于各种域名滥用类型。而且不需要对各种多维特征进行单独建模和训练，前期工作量少。利用的是域名滥用在多维特征上的聚集性特征，所以单次检测检出的疑似域名滥用是成批量的，单次检测产出量非常高。根据我们对2000万域名的滥用检测，单次产出域名滥用数量在20万-40万之间。对于产出的批量疑似结果后续人工处置工作量也非常少。作为一种在一定算力投入下产出最多域名滥用检测技术出现，其检出的疑似域名滥用准确率较高。It can be seen from the above-mentioned technical scheme that by utilizing the aggregated features on domain name abuse multi-dimensional features for abuse detection, it does not depend on a single instance of domain name abuse, does not depend on a single feature (such as: does not depend on the text features of domain name abuse applications), in It can also be used based on other multidimensional features in cases where data is not available for one or more features. Suitable for all types of domain name abuse. Moreover, there is no need to separately model and train various multi-dimensional features, and the workload in the early stage is small. The multi-dimensional characteristics of domain name abuse are used, so the suspected domain name abuse detected in a single detection is in batches, and the output of a single detection is very high. According to our abuse detection of 20 million domain names, the number of abused domain names in a single output is between 200,000 and 400,000. The follow-up manual processing workload for the batch of suspected results produced is also very small. As a domain name abuse detection technology that produces the most output under a certain amount of computing power, it has a high accuracy rate of suspected domain name abuse.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided ones without creative efforts.

图1为本申请提供的一种基于聚类分析的域名滥用检测方法及系统的流程图；FIG. 1 is a flowchart of a domain name abuse detection method and system based on cluster analysis provided by the present application;

图2为本申请提供的一种基于聚类分析的域名滥用检测方法及系统的计算域名应用的聚集情况的流程图；Fig. 2 is a flow chart of a domain name abuse detection method and system based on cluster analysis provided by the application to calculate the aggregation of domain name applications;

图3为本申请提供的一种基于聚类分析的域名滥用检测系统的示意图。FIG. 3 is a schematic diagram of a domain name abuse detection system based on cluster analysis provided by the present application.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

将物理或抽象对象的集合分成由类似的对象组成的多个类的过程被称为聚类。由聚类所生成的簇是一组数据对象的集合，这些对象与同一个簇中的对象彼此相似，与其他簇中的对象相异。“物以类聚，人以群分”，在自然科学和社会科学中，存在着大量的分类问题。聚类分析又称群分析，它是研究(样品或指标)分类问题的一种统计分析方法。The process of dividing a collection of physical or abstract objects into classes of similar objects is called clustering. A cluster generated by clustering is a collection of data objects that are similar to objects in the same cluster and different from objects in other clusters. "Birds of a feather flock together, people are divided into groups", in natural science and social science, there are a lot of classification problems. Cluster analysis, also known as group analysis, is a statistical analysis method for studying (samples or indicators) classification problems.

目前针对域名滥用的检测技术主要是对每个域名应用通过特定的特征进行匹配或者筛选，特征包括：域名特征、URL特征、域名解析特征、文本特征、网站图像特征和网站结构特征(对于网站形式的域名滥用)。具体来说又有下列两种方式：The current detection technology for domain name abuse is mainly to match or filter each domain name application through specific features, which include: domain name features, URL features, domain name resolution features, text features, website image features and website structure features (for website form domain name abuse). Specifically, there are two ways:

1.特征库匹配检测：基于提前准备好的特征库(文本库、图像库、其他特征库)进行系统对比，匹配特定特征的域名应用；1. Feature library matching detection: system comparison based on the pre-prepared feature library (text library, image library, other feature library) to match domain name applications with specific features;

2.机器学习检测：基于上述单个特征或者多个特征使用正样本进行机器学习训练建模，再使用模型进行系统检测。2. Machine learning detection: Based on the above single or multiple features, use positive samples for machine learning training and modeling, and then use the model for system detection.

上述检测技术对已发现的滥用有较好的检测效果，但是也存在如下缺点：The above-mentioned detection technology has a good detection effect on the discovered abuse, but it also has the following disadvantages:

(1)人工投入大：基于特征库的检测需要基于前期大量的人工筛选或者人工举报处理。而基于机器学习的域名滥用检测需要选择大量的正负样本进行人工标注建模。(1) Large manual investment: The detection based on the feature database needs to be based on a large amount of manual screening or manual reporting in the early stage. However, domain name abuse detection based on machine learning needs to select a large number of positive and negative samples for manual labeling and modeling.

(2)低效率：现有的检测技术对不同文本、图像、域名特征的域名应用需要分别建模或者收集特征库，构建成各种检测模型或特征库，具体开展检测时也需要分别经过不同的模型进行计算或者不同特征库进行比对，检测效率较低。(2) Low efficiency: Existing detection technologies need to model or collect feature databases separately for domain name applications with different text, image, and domain name features, and build various detection models or feature libraries. The model is calculated or compared with different feature libraries, and the detection efficiency is low.

(3)扩展性差，检测效率持续性差：互联网日新月异，域名滥用变化多端，随着时间的推移和域名滥用情况的变化，基于原有特征库或者已经建立的检测模型的检测的准确率和查全率会越来越低，必须不断对特征库进行更新，对检测模型也需要不断采集新的样本重新建模。(3) Poor scalability and poor continuity of detection efficiency: the Internet is changing with each passing day, and domain name abuse changes in many ways. The rate will become lower and lower, the feature library must be continuously updated, and the detection model also needs to continuously collect new samples to remodel.

本申请为解决现有技术中，目前针对域名滥用的检测技术主要是对每个域名应用通过特定的特征进行匹配或者筛选，其中特征包括：域名特征、URL特征、域名解析特征、文本特征、网站图像特征和网站结构特征(对于网站形式的域名滥用)。具体来说又有两种方式：特征库匹配检测和机器学习检测可以对已发现的滥用有较好的检测效果，但又存在如人工投入大，低效率，扩展性差，检测效率持续性差的问题。基于以上原因，本申请提供了一种基于聚类分析的域名滥用检测方法及系统。This application aims to solve the problems in the prior art. The current detection technology for domain name abuse is mainly to match or screen each domain name application through specific features. The features include: domain name features, URL features, domain name resolution features, text features, and website features. Image characteristics and website structure characteristics (for domain name abuse in the form of websites). Specifically, there are two methods: feature library matching detection and machine learning detection can have a better detection effect on discovered abuses, but there are problems such as large manual investment, low efficiency, poor scalability, and poor detection efficiency. . Based on the above reasons, the present application provides a domain name abuse detection method and system based on cluster analysis.

域名滥用发展至今已经成为多个上下游产业链，从域名的注册、服务器的租赁和架设、域名滥用应用程序的开发和推广都有专业的人员提供服务，为了攫取更多的利益，往往域名滥用以集群而不是孤例的方式存在。在我们的日常数据分析中，也发现相当一部分域名滥用存在聚集性特征。通过对域名滥用聚集特征的分析，我们可以迅速的批量的发现大量域名滥用。Domain name abuse has developed into multiple upstream and downstream industry chains. There are professional personnel to provide services from domain name registration, server leasing and erection, domain name abuse application development and promotion. In order to grab more benefits, domain name abuse often Exist in clusters rather than isolated cases. In our daily data analysis, we also found that a considerable part of domain name abuse has clustering characteristics. By analyzing the aggregation characteristics of domain name abuse, we can quickly discover a large number of domain name abuse in batches.

下面结合具体的实施例对本发明的方法及系统作进一步的阐述。The method and system of the present invention will be further described below in conjunction with specific embodiments.

第一方面，如图1所示，本申请提供了一种基于聚类分析的域名滥用检测方法，所述方法包括：In the first aspect, as shown in Figure 1, the present application provides a method for detecting domain name abuse based on cluster analysis, the method comprising:

S100，选择域名应用的多维特征；S100, selecting multi-dimensional features of the domain name application;

在步骤S100中，常见域名应用的多维特征包括：域名名字特征、URL特征、IPv4地址、IP地址归属国别、域名解析特征、文本特征、网站图像特征和网站结构特征等。In step S100, the multi-dimensional features of common domain name applications include: domain name features, URL features, IPv4 address, country of IP address, domain name resolution features, text features, website image features, and website structure features.

S200，将所述多维特征中每一个特征建立对应的一维特征坐标系；所述一维特征坐标系用于标识特征根据预设数值标准转换的数值在坐标系中的位置；S200, establishing a corresponding one-dimensional feature coordinate system for each feature in the multi-dimensional features; the one-dimensional feature coordinate system is used to identify the position in the coordinate system of the value converted by the feature according to the preset numerical standard;

在步骤S200中，所述预设数值标准被配置为：若所述多维特征中存在转化数值困难的特征，则根据预设规则对特征拆分为多个维度，并对每个维度进行顺序编号，使不同的特征对应不同的编号。In step S200, the preset value standard is configured as follows: if there are features in the multi-dimensional features that are difficult to convert into values, split the features into multiple dimensions according to preset rules, and sequentially number each dimension , so that different features correspond to different numbers.

在实际应用场景中，由于本申请利用的是域名滥用在多维特征上的聚集性特征，所以需要先选择系统使用的多维特征，选择的多维特征可以是域名名字特征、URL特征、IPv4地址、IP地址归属国别、域名解析特征、文本特征、网站图像特征和网站结构特征等。确定所需的多维特征之后，分别将多维特征中的每一个特征根据预设的数值标准将其转化为数值，其中，若多维特征中并不存在转化数值困难的特征，那么可直接将选取的多维特征转化为数值，例如：IPv4地址特征，若IPv4地址为193.168.32.2，那么转换后的10进制整数的数值就是3249020930；转化的方式也可以是直接对特征进行顺序编号，例如：特征为IP地址归属国别，那么就可以把200多个国家进行数值编号，比如1为中国，2为美国。那么IP地址归属国别为中国的特征转换的数值就是1。若多维特征中存在转化数值困难的特征，那么就需要根据预设规则将其拆分为多个维度，并对每一个维度进行顺序编号，确保不同的特征对应不同的编号，例如：该特征为IP地址归属地(即IP所在的地理位置)，由于IP地址归属地本身不是数字，就需要转换为数值，正常对于地理位置进行表示需要通过经纬度来表示，但由于经纬度获取困难，根据实际情况可以增加两个维度分别是国别和省份。国别又把200多个国家进行数值编号，比如1为中国，2为美国。那么IP地址为中国的国别特征数值就是1。省份特征采用同样的方法处理，比如1为辽宁省，2为浙江省，那么IP地址为浙江省的省份特征数值就是2。再根据转化后的数值分别建立两个一维特征坐标系。In the actual application scenario, since this application utilizes the aggregated features of domain name abuse on multi-dimensional features, it is necessary to select the multi-dimensional features used by the system first. The selected multi-dimensional features can be domain name features, URL features, IPv4 addresses, IP Address attribution country, domain name analysis characteristics, text characteristics, website image characteristics and website structure characteristics, etc. After determining the required multi-dimensional features, each feature in the multi-dimensional features is converted into a numerical value according to the preset numerical standard. If there is no feature in the multi-dimensional features that is difficult to convert numerical values, then the selected Multi-dimensional features are converted into values, for example: IPv4 address features, if the IPv4 address is 193.168.32.2, then the value of the converted decimal integer is 3249020930; the conversion method can also be to directly sequence the features, for example: the feature is The IP address belongs to the country, so more than 200 countries can be numerically numbered, such as 1 for China and 2 for the United States. Then the value converted from the feature that the country of IP address belongs to China is 1. If there are features that are difficult to convert numerical values in the multi-dimensional features, then it needs to be split into multiple dimensions according to the preset rules, and each dimension is numbered sequentially to ensure that different features correspond to different numbers. For example: the feature is The location of the IP address (that is, the geographic location where the IP is located), since the location of the IP address itself is not a number, it needs to be converted into a numerical value. Normally, the geographic location needs to be represented by latitude and longitude, but because it is difficult to obtain the latitude and longitude, it can be used according to the actual situation. Two additional dimensions are country and province. The country numbered more than 200 countries, such as 1 for China and 2 for the United States. Then the country-specific value of the IP address being China is 1. The province feature is processed in the same way, for example, 1 is Liaoning Province, and 2 is Zhejiang Province, then the province feature value whose IP address is Zhejiang Province is 2. Then, two one-dimensional feature coordinate systems are respectively established according to the converted values.

S300，根据所有所述一维特征坐标系建立多维特征坐标系；S300. Establish a multi-dimensional feature coordinate system according to all the one-dimensional feature coordinate systems;

在实际应用场景中，由于已经根据转化后的数值建立了一维特征坐标系，此时一维特征坐标系上的点就表示此多维特征的位置。根据获取的所有一维特征坐标系建立多维特征坐标系。In the actual application scenario, since the one-dimensional feature coordinate system has been established according to the converted value, the point on the one-dimensional feature coordinate system at this time represents the position of the multi-dimensional feature. Establish a multi-dimensional feature coordinate system based on all acquired one-dimensional feature coordinate systems.

S400，根据域名列表或URL列表获取预设数量的域名应用；采集所述域名应用的域名多维特征；S400. Obtain a preset number of domain name applications according to the domain name list or URL list; collect domain name multi-dimensional features of the domain name applications;

在实际应用场景中，域名列表是网站的域名的集合，域名列表作为检测的入口，通常可以从网上搜集或者第三方渠道获取，URL是统一资源定位系统(uniformresourcelocator；URL)的简称，是因特网的万维网服务程序上用于指定信息位置的表示方法。URL是由一串字符组成，这些字符可以是字母，数字和特殊符号。其中URL包含以下信息：用于访问资源的协议，服务器的位置(无论是通过IPv4地址还是域名)，服务器上的端口号，资源在服务器目录结构中的位置，以及片段标识符，URL列表作为检测的入口，通常可以从网上搜集或者第三方渠道获取。根据域名列表或URL列表获取预设数量的域名应用，其中，为了更好的判断域名应用的聚集情况，域名应用的数量至少要以万为单位。通过使用爬虫抓取网页，使用探测程序进行数据探测以获取域名应用的域名多维特征。In practical application scenarios, the domain name list is a collection of website domain names. The domain name list is used as the entrance of detection, and can usually be obtained from online collection or third-party channels. URL is the abbreviation of Uniform Resource Locator (URL), which is the Internet A representation used on a World Wide Web server to specify the location of information. A URL is composed of a string of characters, which can be letters, numbers, and special symbols. where the URL contains the following information: the protocol used to access the resource, the location of the server (whether by IPv4 address or domain name), the port number on the server, the location of the resource in the server's directory structure, and a fragment identifier, a list of URLs as detected The entrance of , usually can be collected from the Internet or obtained from third-party channels. Acquire a preset number of domain name applications according to the domain name list or URL list, wherein, in order to better judge the aggregation of domain name applications, the number of domain name applications must be at least in the unit of 10,000. By using the crawler to crawl the webpage, and using the detection program to perform data detection to obtain the domain name multi-dimensional characteristics of the domain name application.

S500，根据预设数值标准将所述域名多维特征中每一个特征分别转换为用于标识在所述多维特征坐标系中的数值；S500. Convert each feature in the multi-dimensional features of the domain name into a numerical value for identification in the multi-dimensional feature coordinate system according to a preset numerical standard;

在实际应用场景中，需要先建立一个预设数值标准，并根据预设数值标准将域名多维特征中每一个特征分别转换为数值，该数值用于标识该特征在多维特征坐标系中的位置。其中，在将域名多维特征中每一个特征分别转换为数值时，若域名多维特征中并不存在转化数值困难的特征，那么可直接将选取的域名多维特征按照上述预设规则转化为数值，或者转化的方式可以是直接对域名多维特征进行顺序编号，若域名多维特征中存在转化数值困难的特征，那么就需要根据上述预设规则将其拆分为多个维度，并对每一个维度进行顺序编号，确保不同的特征对应不同的编号。In practical application scenarios, it is necessary to establish a preset value standard first, and convert each feature in the domain name multi-dimensional feature into a value according to the preset value standard, and the value is used to identify the position of the feature in the multi-dimensional feature coordinate system. Among them, when converting each feature in the domain name multi-dimensional feature into a value, if there is no feature in the domain name multi-dimensional feature that is difficult to convert into a value, then the selected domain name multi-dimensional feature can be directly converted into a value according to the above preset rules, or The conversion method can be to directly number the multi-dimensional features of the domain name in sequence. If there are features that are difficult to convert numerical values in the multi-dimensional features of the domain name, then it needs to be split into multiple dimensions according to the above preset rules, and each dimension is sequenced. Numbering, to ensure that different features correspond to different numbers.

S600，根据所述多维特征坐标系中标识的数值计算所述域名应用的聚集情况，根据所述聚集情况得到域名滥用的检测结果。S600. Calculate the aggregation situation of the domain name application according to the value identified in the multi-dimensional feature coordinate system, and obtain a domain name abuse detection result according to the aggregation situation.

在步骤S600中，如图2所示，根据所述多维特征坐标系中标识的数值计算所述域名应用的聚集情况，根据所述聚集情况得到域名滥用的检测结果的步骤包括：In step S600, as shown in FIG. 2, the aggregation of the domain name application is calculated according to the value identified in the multi-dimensional feature coordinate system, and the step of obtaining the detection result of domain name abuse according to the aggregation includes:

S610，将所述域名多维特征拆分为多个维度；S610. Split the domain name multi-dimensional feature into multiple dimensions;

在步骤S610中，若域名多维特征中存在转化数值困难的特征，那么就需要根据预设规则将其拆分为多个维度，便于将其转化为数值。In step S610, if there is a feature that is difficult to convert into a numerical value among the multi-dimensional features of the domain name, it needs to be split into multiple dimensions according to preset rules so as to convert it into a numerical value.

S620，将所有所述维度设置单位距离；S620, setting a unit distance for all the dimensions;

在步骤S620中，为拆分好的每个维度设置一个标准距离L，从L₁到L_W。In step S620, a standard distance L is set for each split dimension, from L ₁ to L _W .

S630，根据所有所述维度的单位距离生成单位空间；S630, generating a unit space according to unit distances of all dimensions;

在步骤S630中，根据每个维度设置的单位距离组成一个单位空间(单位空间的体积为L₁*L₂*…*L_w)。In step S630, a unit space is formed according to the unit distance set in each dimension (the volume of the unit space is L ₁ *L ₂ *...*L _w ).

S640，计算所述域名应用在所述单位空间的域名数量在整个多维空间(多维特征坐标系)中的占比；S640, calculating the proportion of the number of domain names used by the domain name in the unit space in the entire multidimensional space (multidimensional feature coordinate system);

在步骤S640中，为了判断是否存在聚集性应用，需要先计算域名应用在单位空间的占比，其中，占比＝所述单位空间中所述域名应用的数量/所有所述域名应用的数量。In step S640, in order to determine whether there is an aggregated application, it is necessary to calculate the proportion of the domain name application in the unit space, where proportion=the number of the domain name application in the unit space/the number of all the domain name applications.

S650，判断所述占比是否大于或等于预设阈值；S650, judging whether the proportion is greater than or equal to a preset threshold;

在步骤S650中，由于判断是否存在聚集性应用需要确定占比是否大于或等于预设阈值，其中预设阈值K为0到1之间的一个小数。而且K值大小实际取决于多种因素：包括检测的域名应用是否足够多并有代表性、多维特征的选择、每一个特征维度坐标系的建立标准(特征转换为数值的计算方式)。In step S650, it is necessary to determine whether the proportion is greater than or equal to a preset threshold value because of judging whether there is an aggregation application, wherein the preset threshold value K is a decimal between 0 and 1. Moreover, the value of K actually depends on a variety of factors: including whether the detected domain name applications are sufficient and representative, the selection of multi-dimensional features, and the establishment standard of the coordinate system of each feature dimension (the calculation method of converting features into values).

S660，若是，存在聚集性域名应用；S660, if yes, there is an aggregated domain name application;

S670，若否，不存在聚集性域名应用。S670, if not, there is no aggregation domain name application.

在实际应用场景中，先将域名多维特征拆分为多个维度，并对拆分好的每个维度设置一个单位距离，若该域名多维特征被拆分为3个维度：L₁L₂L₃，那么所有维度的单位距离生成的单位空间表示为V＝L₁*L₂*L₃，之后需要计算该域名应用在单位空间的占比R，由于占比＝所述单位空间中所述域名应用的数量/所有所述域名应用的数量。预设阈值K若为0.1，若所述标准单位空间中所述域名应用的数量为500，所有所述域名应用的数量为50000，所以R＝500÷50000＝0.01，因此R＜K，不存在聚集性域名应用。若所述单位空间中所述域名应用的数量为5000，所有所述域名应用的数量为50000，所以R＝5000÷50000＝0.1，此时的K＝0.1，因此R＝K，存在聚集性域名应用，可以直接人工处置或者结合其他特征进一步检测后再人工处置。若所述单位空间中所述域名应用的数量为10000，所有所述域名应用的数量为50000，所以R＝10000÷50000＝0.2，此时的K＝0.1，因此R大于K，存在聚集性域名应用，可以直接人工处置或者结合其他特征进一步检测后再人工处置。In the actual application scenario, the domain name multi-dimensional feature is first split into multiple dimensions, and a unit distance is set for each split dimension. If the domain name multi-dimensional feature is split into three dimensions: L ₁ L ₂ L ₃ , then the unit space generated by the unit distance of all dimensions is expressed as V=L ₁ *L ₂ *L ₃ , and then it is necessary to calculate the proportion R of the domain name application in the unit space, because the proportion = the unit space described in The number of domain name applications/the number of all domain name applications. If the preset threshold K is 0.1, if the number of domain name applications in the standard unit space is 500, the number of all domain name applications is 50,000, so R=500÷50000=0.01, so R<K, does not exist Aggregated domain name applications. If the number of domain name applications in the unit space is 5000, the number of all domain name applications is 50000, so R=5000÷50000=0.1, K=0.1 at this time, so R=K, there is an aggregated domain name It can be directly processed manually or combined with other features for further detection before manual processing. If the number of domain name applications in the unit space is 10,000, the number of all domain name applications is 50,000, so R=10000÷50000=0.2, and K=0.1 at this time, so R is greater than K, and there is an aggregated domain name It can be directly processed manually or combined with other features for further detection before manual processing.

以网站域名滥用检测为例：先获取域名应用的多维特征，其中，多维特征包括：域名名字特征、URL特征、IPv4地址、IP地址归属国别、域名解析特征、文本特征、网站图像特征和网站结构特征。确定所需的多维特征之后，分别将多维特征中的每一个特征根据预设的数值标准将其转化为数值，其中，若多维特征中并不存在转化数值困难的特征，那么可直接将选取的多维特征转化为数值，且转化的方式可以是直接对多维特征进行顺序编号。若多维特征中存在转化数值困难的特征，那么就需要根据预设规则将其拆分为多个维度，并对每一个维度进行顺序编号，确保不同的特征对应不同的编号，并根据转化好的数值建立一维特征坐标系，根据所有的一维特征坐标系建立多维特征坐标系。Take the website domain name abuse detection as an example: first obtain the multi-dimensional features of the domain name application, among which the multi-dimensional features include: domain name features, URL features, IPv4 address, country of IP address, domain name resolution features, text features, website image features and website Structure. After determining the required multi-dimensional features, each feature in the multi-dimensional features is converted into a numerical value according to the preset numerical standard. If there is no feature in the multi-dimensional features that is difficult to convert numerical values, then the selected The multi-dimensional features are converted into values, and the conversion method can be directly numbering the multi-dimensional features sequentially. If there are features that are difficult to convert numerical values in the multi-dimensional features, then it is necessary to split them into multiple dimensions according to the preset rules, and number each dimension sequentially to ensure that different features correspond to different numbers, and according to the converted A one-dimensional feature coordinate system is established numerically, and a multi-dimensional feature coordinate system is established based on all one-dimensional feature coordinate systems.

根据域名列表或URL列表获取预设数量的网站应用，并采集该网站应用的域名多维特征，对于网站域名滥用检测来说，选择的域名多维特征为网站IPv4地址，网站关键字、网站服务器所在国家地区，首先将网站IPv4地址转换为数值，其次网站关键字选择滥用文本特征库，利用是否包含特征库中的文本以及特征文本出现的频率通过预设数值标准转换为数值，最后将网站服务器所在国家地区根据预设数值标准转换为数值，需要对每个国家地区进行数值编号，如地区A编号为1，地区B编号为5。每个单独的网站应用按照上述三个特征计算得到具体三维坐标，根据所有网站应用在三维空间的聚集程度进行滥用分析：即在一定三维空间内出现的网站应用占所有网站应用的比例达到一定阈值，即判断此空间聚集的域名应用存在疑似域名滥用，后续开展人工处置或者结合其他特定特征进一步筛选后进行人工处置。Obtain a preset number of website applications according to the domain name list or URL list, and collect the domain name multidimensional features of the website application. For website domain name abuse detection, the selected domain name multidimensional features are the website IPv4 address, website keywords, and the country where the website server is located. region, first convert the IPv4 address of the website into a numerical value, and then select the abused text feature library for the website keyword, use whether it contains the text in the feature library and the frequency of feature text to convert it into a numerical value through the preset numerical standard, and finally convert the country where the website server is located The region is converted into a numerical value according to the preset numerical standard, and each country and region needs to be numerically numbered, for example, the number of region A is 1, and the number of region B is 5. Each individual website application calculates the specific three-dimensional coordinates according to the above three characteristics, and conducts abuse analysis according to the aggregation degree of all website applications in the three-dimensional space: that is, the proportion of website applications appearing in a certain three-dimensional space to all website applications reaches a certain threshold , that is, it is judged that there is suspected domain name abuse in the domain name applications gathered in this space, and subsequent manual processing or manual processing is carried out after further screening in combination with other specific characteristics.

以垃圾邮件检测为例：先获取域名应用的多维特征，其中，多维特征包括：域名名字特征、URL特征、IPv4地址、IP地址归属国别、域名解析特征、文本特征、网站图像特征和网站结构特征。确定所需的多维特征之后，分别将多维特征中的每一个特征根据预设的数值标准将其转化为数值，其中，若多维特征中并不存在转化数值困难的特征，那么可直接将选取的多维特征转化为数值，且转化的方式可以是直接对多维特征进行顺序编号。若多维特征中存在转化数值困难的特征，那么就需要根据预设规则将其拆分为多个维度，并对每一个维度进行顺序编号，确保不同的特征对应不同的编号，并根据转化好的数值建立一维特征坐标系，根据所有的一维特征坐标系建立多维特征坐标系。Take spam detection as an example: first obtain the multi-dimensional features of the domain name application, among which the multi-dimensional features include: domain name features, URL features, IPv4 address, country of IP address, domain name analysis features, text features, website image features and website structure feature. After determining the required multi-dimensional features, each feature in the multi-dimensional features is converted into a numerical value according to the preset numerical standard. If there is no feature in the multi-dimensional features that is difficult to convert numerical values, then the selected The multi-dimensional features are converted into values, and the conversion method can be directly numbering the multi-dimensional features sequentially. If there are features that are difficult to convert numerical values in the multi-dimensional features, then it is necessary to split them into multiple dimensions according to the preset rules, and number each dimension sequentially to ensure that different features correspond to different numbers, and according to the converted A one-dimensional feature coordinate system is established numerically, and a multi-dimensional feature coordinate system is established based on all one-dimensional feature coordinate systems.

根据域名列表或URL列表获取预设数量的邮件，并采集该邮件的域名多维特征，对于垃圾邮件检测来说，选择的域名多维特征为：邮件发送IPv4地址、邮件服务器IPv4地址、邮件发送时间、邮件发送者。确认域名多维特征之后，需要将其分别根据预设数值标准转换为数值，其中，邮件发送IPv4地址以256进制转换为数值、邮件发送服务器IPv4地址以256进制转换为数值、邮件发送时间以时间戳300进制(5分钟为周期)转换为数值、邮件发送者依照对照表(处理过程中每发现一个新的邮件发送者就增加一个新的记录)转换为数值。每个邮件按照上述四个域名多维特征计算得到具体四维坐标，根据所有邮件在四维空间的聚集程度进行滥用分析：即在一定四维空间内出现的邮件占所有邮件的比例达到一定阈值，即判断此空间聚集的邮件为疑似垃圾邮件，可以直接进行屏蔽处理。Obtain a preset number of emails according to the domain name list or URL list, and collect the domain name multi-dimensional features of the emails. For spam detection, the selected domain name multi-dimensional features are: mail sending IPv4 address, mail server IPv4 address, mail sending time, mail sender. After confirming the multi-dimensional characteristics of the domain name, it needs to be converted into a value according to the preset value standard. Among them, the mail sending IPv4 address is converted into a value in 256 base, the mail sending server IPv4 address is converted into a value in 256 base, and the mail sending time is converted into a value in 256 base. Timestamp 300 (period of 5 minutes) is converted into a value, and the email sender is converted into a value according to the comparison table (a new record is added every time a new email sender is found during processing). The specific four-dimensional coordinates of each email are calculated according to the multi-dimensional characteristics of the above four domain names, and the abuse analysis is performed according to the aggregation degree of all emails in the four-dimensional space: that is, the proportion of emails appearing in a certain four-dimensional space to all emails reaches a certain threshold, that is, it is judged Spatially aggregated emails are suspected spam and can be blocked directly.

第二方面，如图3所示，本申请提供了一种基于聚类分析的域名滥用检测系统，包括：配置模块、数据处理模块、第一转换模块、采集模块、第二转换模块、计算模块；In the second aspect, as shown in Figure 3, the present application provides a domain name abuse detection system based on cluster analysis, including: a configuration module, a data processing module, a first conversion module, a collection module, a second conversion module, and a calculation module ;

上述系统中在应用前述方法时的作用效果可参见前述方法实施例中的说明，在此不再赘述。For the function and effect of the above-mentioned system when the above-mentioned method is applied, please refer to the description in the above-mentioned method embodiment, which will not be repeated here.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本发明的真正范围和精神由本申请的权利要求指出。Other embodiments of the invention will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present invention, these modifications, uses or adaptations follow the general principles of the present invention and include common knowledge or conventional technical means in the technical field not disclosed in the present invention . It is intended that the specification and examples be considered exemplary only, with a true scope and spirit of the invention indicated by the appended claims.

Claims

1. A domain name abuse detection method based on cluster analysis, said method comprising:

Select the multi-dimensional features of the domain name application;

Establishing a corresponding one-dimensional feature coordinate system for each feature in the multi-dimensional features; the one-dimensional feature coordinate system is used to identify the position of the value converted by the feature according to the preset numerical standard in the coordinate system;

establishing a multi-dimensional feature coordinate system according to all the one-dimensional feature coordinate systems;

Obtaining a preset number of domain name applications according to the domain name list or URL list; collecting domain name multi-dimensional features of the domain name application;

converting each feature in the multidimensional feature of the domain name into a numerical value for identification in the multidimensional feature coordinate system according to a preset numerical standard;

Calculate the aggregation situation of the domain name application according to the value identified in the multi-dimensional feature coordinate system, and obtain the detection result of domain name abuse according to the aggregation situation.

2. A method for domain name abuse detection based on cluster analysis according to claim 1, wherein said multi-dimensional feature comprises: domain name feature, URL feature, IPv4 address, country of IP address attribution, domain name resolution features, text features, website image features, and website structure features.

3. A method for domain name abuse detection based on cluster analysis according to claim 1, wherein the preset numerical standard is configured as:

If there is a feature that is difficult to convert numerical value among the multi-dimensional features, the feature is split into multiple dimensions according to preset rules, and each dimension is sequentially numbered, so that different features correspond to different numbers.

4. The method for domain name abuse detection based on cluster analysis according to claim 1, characterized in that, according to the numerical values identified in the multi-dimensional feature coordinate system, the aggregation of the domain name application is calculated, and according to the aggregation The steps to obtain detection results of domain name abuse include:

Splitting the domain name multidimensional feature into multiple dimensions;

Set all said dimensions to unit distance;

generating a unit space based on unit distances in all said dimensions;

Calculating the proportion of the number of domain names used by the domain name in the unit space in the entire multidimensional space;

judging whether the proportion is greater than or equal to a preset threshold;

If yes, there is an aggregated domain name application.

5. A method for domain name abuse detection based on cluster analysis according to claim 4, wherein the step after judging whether the proportion is greater than or equal to a preset threshold comprises:

If not, there is no aggregate domain name application.

6. A method for domain name abuse detection based on cluster analysis according to claim 4, wherein the step of calculating the proportion of the domain name application in the unit space comprises:

Proportion = the number of domain name applications in the unit space/the number of all domain name applications.

7. A domain name abuse detection system based on cluster analysis, applied to the domain name abuse detection method based on cluster analysis according to any one of claims 1 to 6, characterized in that it includes: a configuration module, a data processing module, A first conversion module, an acquisition module, a second conversion module, and a calculation module;

The configuration module is used to select the multi-dimensional characteristics of the domain name application;

A data processing module, configured to establish a corresponding one-dimensional feature coordinate system for each feature in the multi-dimensional features; the one-dimensional feature coordinate system is used to identify the position of the value converted by the feature according to the preset numerical standard in the coordinate system;

The first conversion module is used to establish a multi-dimensional feature coordinate system according to all the one-dimensional feature coordinate systems;

A collection module, configured to obtain a preset number of domain name applications according to a domain name list or a URL list; collect domain name multidimensional features of the domain name application;

The second conversion module is used to convert each feature in the domain name multi-dimensional feature into a numerical value for identification in the multi-dimensional feature coordinate system according to a preset numerical standard;

A calculation module, configured to calculate the aggregation of the domain name application according to the value identified in the multi-dimensional feature coordinate system, and obtain the detection result of domain name abuse according to the aggregation.

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements any of claims 1 to 6 when executing the program. The steps of a domain name abuse detection method based on cluster analysis described in the item.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, characterized in that, when the computer program is executed by a processor, it implements the cluster analysis based on any one of claims 1 to 6 The steps of the domain name abuse detection method.