CN118233180A

CN118233180A - A method for identifying abnormal users based on behavior analysis and traffic detection

Info

Publication number: CN118233180A
Application number: CN202410340769.1A
Authority: CN
Inventors: 陈明亮; 蔺子卿; 刘星宇; 谢国强; 余滢婷; 黄可; 张晓娟; 朱亚运; 胡柏吉; 曹靖怡; 姚爽; 李梦琳; 张小松
Original assignee: University of Electronic Science and Technology of China; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Jiangxi Electric Power Co Ltd
Current assignee: University of Electronic Science and Technology of China; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Jiangxi Electric Power Co Ltd
Priority date: 2024-03-25
Filing date: 2024-03-25
Publication date: 2024-06-21

Abstract

The invention discloses an abnormal user identification method based on behavior analysis and flow detection. And secondly, in the operation process of the user, analyzing flow data and sensitive operation information generated when the user interacts with the server by utilizing an integrated learning model, and further intercepting abnormal operation performed by an illegal user after the illegal user logs in the system legally. The invention can rapidly complete detection without affecting the login experience of the user, continuously analyze abnormal behaviors in the use process of the user, and realize efficient abnormal user identification.

Description

A method for identifying abnormal users based on behavior analysis and traffic detection

技术领域Technical Field

本发明涉及互联网用户行为识别技术，具体涉及基于行为分析和流量检测的异常用户识别技术。The present invention relates to Internet user behavior recognition technology, and in particular to abnormal user recognition technology based on behavior analysis and traffic detection.

背景技术Background technique

随着互联网的普及和广泛应用，用户在网络上的活动日益增多。然而，网络环境中也存在着各种安全威胁和风险，例如网络钓鱼，其可以通过欺骗手段窃取用户的密码信息，并利用用户合法的账户进行违法行为。因此为了确保系统的安全，对于合法访问系统的用户进行异常行为检测显得尤为重要。With the popularization and widespread application of the Internet, users' activities on the Internet are increasing day by day. However, there are also various security threats and risks in the network environment, such as phishing, which can steal users' password information through deception and use users' legitimate accounts to commit illegal acts. Therefore, in order to ensure the security of the system, it is particularly important to detect abnormal behavior of users who legally access the system.

传统的安全防护方法主要基于规则和签名，用于检测和防御已知的攻击模式。然而，随着网络攻击手段的不断演变，传统方法的局限性变得越来越显著。因此，基于行为分析和流量检测的异常用户识别方法应运而生。Traditional security protection methods are mainly based on rules and signatures to detect and defend known attack patterns. However, with the continuous evolution of network attack methods, the limitations of traditional methods have become more and more significant. Therefore, abnormal user identification methods based on behavior analysis and traffic detection have emerged.

基于行为分析的方法通过对用户在网络上的行为进行监测和分析，来识别异常的行为模式。这些行为包括登录模式、访问模式、数据传输模式等。通过建立用户的正常行为模型，可以对与正常行为模型不符的行为进行识别和预警。例如，如果一个用户短时间内在多地登录账号或者在非常规的时间段进行大量数据传输，就可能被判定为异常用户。Behavior analysis-based methods monitor and analyze user behaviors on the Internet to identify abnormal behavior patterns. These behaviors include login patterns, access patterns, data transmission patterns, etc. By establishing a normal behavior model for users, behaviors that do not conform to the normal behavior model can be identified and warned. For example, if a user logs in to an account in multiple locations within a short period of time or transmits a large amount of data in an unconventional time period, he or she may be judged as an abnormal user.

流量检测是另一种常用的异常用户识别方法，它通过监控网络流量中的数据包大小、协议类型、传输速率等信息，并通过对正常流量模式的学习和比对，来发现异常的流量行为，如大规模的DDoS攻击、端口扫描等。Traffic detection is another commonly used method for identifying abnormal users. It detects abnormal traffic behaviors, such as large-scale DDoS attacks and port scans, by monitoring the packet size, protocol type, transmission rate and other information in the network traffic and learning and comparing normal traffic patterns.

发明内容Summary of the invention

本发明所要解决的技术问题是，提供一种综合运用行为分析和流量检测在用户登录态与持续使用过程中判断异常用户的方法。The technical problem to be solved by the present invention is to provide a method for judging abnormal users in the user login state and continuous use process by comprehensively using behavior analysis and flow detection.

本发明为解决上述技术问题所采用的技术方案是，基于行为分析和流量检测的异常用户识别方法，其特征在于，包括：The technical solution adopted by the present invention to solve the above technical problems is a method for identifying abnormal users based on behavior analysis and traffic detection, which is characterized by comprising:

用户登录系统时的检测步骤：Detection steps when a user logs into the system:

S1.响应于用户登录过程中用户密码校验通过之后，对本次的登录信息进行特征提取和数据预处理；所述登录信息包括用户使用的浏览器类型、用户操作系统类型、用户的真实地址、用户设备MAC地址、用户登录时间和用户输入密码时间；S1. In response to the user password verification during the user login process, feature extraction and data preprocessing are performed on the login information of this time; the login information includes the browser type used by the user, the user operating system type, the user's real address, the user's device MAC address, the user login time and the user input password time;

S2.利用数据库中该用户登录的历史记录，对其进行关联规则挖掘，构建登录时的合法范围；所述历史记录为用户本次登录之前满足合法登录规则的登录信息；S2. Use the user's login history in the database to mine association rules and build a legal range for login; the history is the login information that meets the legal login rules before the user logs in this time;

S3.判断本次用户登录信息是否在合法范围内，如是则允许用户登入，否则进行异常行为拦截；S3. Determine whether the user login information is within the legal range. If yes, allow the user to log in; otherwise, intercept abnormal behavior;

用户使用系统过程中的检测步骤：Detection steps when users use the system:

S4.定时收集设定时间内的流量数据和敏感接口访问记录，对收集的流量数据进行特征提取得到对应IP一段时间内的流量特征；所述敏感接口为预设与敏感行为相关的操作接口；S4. Regularly collect traffic data and sensitive interface access records within a set time, extract features from the collected traffic data to obtain traffic features of the corresponding IP within a period of time; the sensitive interface is a preset operation interface related to sensitive behavior;

S6：将流量特征与敏感接口访问记录输入至训练好的异常检测模型中，异常检测模型判断用户行为是否存在异常情况，当用户行为存在异常时则进行异常行为拦截。S6: Input the traffic characteristics and sensitive interface access records into the trained anomaly detection model. The anomaly detection model determines whether there are abnormalities in the user behavior. If there are abnormalities in the user behavior, the abnormal behavior is intercepted.

本发明聚合多种特征综合分析用户登录过程中是否存在异常行为，并且能够保证登录流程的稳定快速，并在用户使用过程中通过监控流量数据和敏感接口访问情况持续判断是否存在异常。综合运用基于行为分析和流量检测的方法，可以提高对异常用户的识别准确性和及时性。通过建立和更新用户的正常行为模型，结合实时的流量分析和监测技术，可以及时发现和应对各种新型的网络攻击和异常行为，保护网络的安全和用户的利益。The present invention aggregates multiple features to comprehensively analyze whether there is abnormal behavior during the user login process, and can ensure the stability and speed of the login process, and continuously determine whether there is anomaly by monitoring traffic data and sensitive interface access during user use. The comprehensive use of methods based on behavior analysis and traffic detection can improve the accuracy and timeliness of identifying abnormal users. By establishing and updating the user's normal behavior model, combined with real-time traffic analysis and monitoring technology, various new types of network attacks and abnormal behaviors can be discovered and responded to in a timely manner, protecting the security of the network and the interests of users.

具体的，异常检测模型使用过往收集到的已知正常流量数据和异常流量数据作为训练数据集，以此完成检测模型构建；异常检测模型为集成学习模型EasyEnsemble。Specifically, the anomaly detection model uses the known normal traffic data and abnormal traffic data collected in the past as training data sets to complete the construction of the detection model; the anomaly detection model is the integrated learning model EasyEnsemble.

具体的，步骤S1中的数据预处理具体为：Specifically, the data preprocessing in step S1 is as follows:

将用户登录时间转换为对应日期的上午、中午，下午、傍晚、晚间、凌晨这六个时段；Convert the user login time into six time periods of the corresponding date: morning, noon, afternoon, evening, night, and early morning;

将用户使用的浏览器类型转换为IE、SOGOU_EXPLORER、CHROME、SAFARI、EDGE、FIREFOX、ANDROID_BROWSER以及其它类型；Convert the browser type used by the user to IE, SOGOU_EXPLORER, CHROME, SAFARI, EDGE, FIREFOX, ANDROID_BROWSER and other types;

将用户使用的操作系统类型转换为WINDOWS、MAC、ANDROID、IOS以及其他类型；Convert the operating system type used by the user to WINDOWS, MAC, ANDROID, IOS and other types;

将用户输入密码时间转换为时长范围，时长范围由计时的最早时间和最晚时间确定；Convert the time when the user inputs the password into a time range, where the time range is determined by the earliest and latest time of the timing;

真实地址为城市名称。The real address is the city name.

具体的，步骤S2中利用数据库中该用户登录的历史记录，对其进行关联规则挖掘，构建登录请求信息合法范围的具体方法为：Specifically, in step S2, the historical records of the user's login in the database are used to mine association rules, and the specific method for constructing the legal range of the login request information is as follows:

从数据库中查询该用户过往登陆中使用过的浏览器类型、操作系统类型和用户登录时间，并进行符号化表示为三元组；根据关联分析算法FP-Growth，对每个历史信息的三元组构建FP-Tree树，然后从FP-Tree树中挖掘频繁模式得到频繁项集，从而构成该用户登录时的合法范围。The browser type, operating system type and user login time used by the user in the past login are queried from the database and symbolically represented as triples; according to the association analysis algorithm FP-Growth, an FP-Tree is constructed for each triple of historical information, and then frequent patterns are mined from the FP-Tree to obtain frequent item sets, thereby constituting the legal range of the user's login.

步骤S3判断本次用户登录是否在合法范围内，具体为：Step S3 determines whether the current user login is within the legal range, specifically:

得到本次登录时用户使用的浏览器类型、操作系统类型以及用户登录时间符号化表示的三元组；判断本次登录的三元组中任意组合是否存在于合法范围中，如是则再进入其他用户行为模式判断，如否，则进行异常行为拦截；Obtain the triplet of the browser type, operating system type, and user login time used by the user during this login; determine whether any combination of the triplet of this login exists in the legal range, if so, proceed to other user behavior mode determination, if not, intercept abnormal behavior;

其他用户行为模式判断：对本次登录时用户的真实地址、用户设备MAC地址、用户登录时间以及用户输入密码的时间长度按照设定的合法行为模式进行比较判断，如符合合法行为模式，则允许用户登入，否则进行异常行为拦截。Other user behavior pattern judgments: The user's real address, user device MAC address, user login time, and the length of time the user inputs the password during this login are compared and judged according to the set legal behavior pattern. If it meets the legal behavior pattern, the user is allowed to log in, otherwise abnormal behavior is intercepted.

优选的，其他用户行为模式判断具体为：判断是否满足本次登录时用户的真实地址和用户设备MAC地址均在历史记录中出现过；本次用户登录时间距离上次登录时间在预设天数之内；用户输入密码的时间长度在限定范围内；如全部满足，则允许用户登入，否则进行异常行为拦截。Preferably, other user behavior pattern judgments are specifically as follows: judging whether the user's real address and the user device MAC address have both appeared in the historical records at the time of this login; whether the user's login time this time is within a preset number of days from the last login time; and whether the length of time the user has to input the password is within a limited range; if all of the conditions are met, the user is allowed to log in, otherwise abnormal behavior is intercepted.

本发明的有益效果是：The beneficial effects of the present invention are:

(1)采用登录状态拦截和在线持续检测两种方式同时检测用户是否存在异常行为，在登录请求中利用传统数据挖掘算法进行快速分析，保证了用户登录流程的流畅，其次在用户实际使用中持续对用户流量行为进行检测，防范了系统在非授权用户正常登录系统后进行恶意操作的行为。(1) Login status interception and online continuous detection are used to simultaneously detect whether users have abnormal behavior. Traditional data mining algorithms are used to quickly analyze login requests to ensure the smoothness of the user login process. Secondly, user traffic behavior is continuously detected during actual user use to prevent the system from performing malicious operations after unauthorized users log in normally.

(2)在登录请求中使用关联规则挖掘方法对多个特征进行组合分析，避免了使用单一特征判断可能导致的异常拦截问题。进一步的，在异常流量行为检测过程是使用了集成学习EasyEnsemble，解决了异常流量和正常流量训练数据量不均衡，导致模型训练效果不佳的问题。(2) In the login request, the association rule mining method is used to perform a combined analysis of multiple features, avoiding the problem of abnormal interception that may be caused by using a single feature judgment. Furthermore, in the abnormal traffic behavior detection process, the ensemble learning EasyEnsemble is used to solve the problem of unbalanced training data volume between abnormal traffic and normal traffic, resulting in poor model training effect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为发明流程图。Fig. 1 is a flow chart of the invention.

具体实施方式Detailed ways

基于行为分析和流量检测的异常用户识别方法采用的技术方案，如图1所示，包括步骤：The technical solution adopted by the abnormal user identification method based on behavior analysis and traffic detection is shown in Figure 1 and includes the following steps:

S1.在用户登录时，对用户使用的浏览器、用户操作系统、用户真实地址、设备MAC地址、登录时间和用户输入密码时间信息进行提取，并进行数据预处理；S1. When the user logs in, the browser used by the user, the user's operating system, the user's real address, the device MAC address, the login time and the time when the user enters the password are extracted and the data is preprocessed;

S2.利用数据库中过往用户登录的记录，对其进行关联规则挖掘，构建登录状态时的合法规则；S2. Use the past user login records in the database to mine association rules and construct legal rules for the login state;

S3.判断用户登录情况是否在规则的合法范围内，如果合法则允许用户进入系统，否则进行异常行为拦截；S3. Determine whether the user login situation is within the legal scope of the rules. If it is legal, allow the user to enter the system; otherwise, intercept abnormal behavior;

S4.监控服务器一段时间内流量数据和敏感接口情况，并对数据进行解析和特征提取，得到对应IP一段时间内的多维特征；S4. Monitor the traffic data and sensitive interface status of the server over a period of time, parse and extract features from the data, and obtain multi-dimensional features of the corresponding IP over a period of time;

S5.对过往收集到的正常流量数据和异常流量数据用S3中同样的解析和特征提取方法进行处理，并将处理好的数据用于训练基于集成学习的流量分析模型；S5. Process the normal traffic data and abnormal traffic data collected in the past using the same parsing and feature extraction methods in S3, and use the processed data to train a traffic analysis model based on ensemble learning;

S6：利用训练好的模型持续对服务中收集到的流量数据进行异常检测，并在发现异常情况后对相应用户进行异常行为拦截。S6: Use the trained model to continuously detect anomalies in the traffic data collected in the service, and intercept abnormal behaviors of corresponding users after anomalies are found.

所述S1包括以下步骤：The S1 comprises the following steps:

S1.1在用户点击登录时，用户的信息会以HTTP的形式与服务器进行交互，通过解析HTTP可以获取所需信息；S1.1 When the user clicks to log in, the user's information will interact with the server in the form of HTTP, and the required information can be obtained by parsing HTTP;

S1.2对HTTP请求头进行解析后，可以从User-Agent字段中提取到用户在发送请求时使用的浏览器和操作系统信息；S1.2 After parsing the HTTP request header, the browser and operating system information used by the user when sending the request can be extracted from the User-Agent field;

S1.3对HTTP请求头进行解析后，可以获取到用户请求时的IP地址，并通过ARP命令获取目标设备的MAC地址。在服务器端可以用代码模拟命令行操作获取MAC地址。这两个用户行为特征将组成用户地址行为向量。利用IP地址库，把用户的IP地址映射为真实的用户所在地；After S1.3 parses the HTTP request header, it can obtain the IP address of the user request and obtain the MAC address of the target device through the ARP command. On the server side, code can be used to simulate command line operations to obtain the MAC address. These two user behavior features will form the user address behavior vector. Using the IP address library, the user's IP address is mapped to the real user location;

S1.4通过WEB应用系统的前端键盘监听事件获取用户输入密码的时间，用户输入密码的时间从用户输入密码开始计时，并在用户点击其他位置使密码框失去焦点时结束。用户向WEB应用系统发送HTTP请求时获取当时的时间记为用户的登录时间；S1.4 The time when the user enters the password is obtained through the front-end keyboard monitoring event of the WEB application system. The time when the user enters the password starts from the time the user enters the password and ends when the user clicks somewhere else to make the password box lose focus. When the user sends an HTTP request to the WEB application system, the time obtained at that time is recorded as the user's login time;

S1.5通过提取上述的特征向量之后，用户的行为被定义成了用户使用的浏览器、用户使用的操作系统、用户的真实地址、用户设备的MAC地址、用户的登录时间、用户输入密码的时间这六个特征的集合体，并用向量形式进行表示。S1.5 After extracting the above feature vectors, the user's behavior is defined as a collection of six features: the browser used by the user, the operating system used by the user, the user's real address, the MAC address of the user's device, the user's login time, and the time when the user enters the password, and is represented in vector form.

S1.6对于上述六种特征分别进行处理，其中将用户登录时间转换为上午、中午，下午、傍晚、晚间、凌晨这六个时段；将用户使用的浏览器类型转换为IE、SOGOU_EXPLORER、CHROME、SAFARI、EDGE、FIREFOX、ANDROID_BROWSER以及其它类型；将用户使用的操作系统类型转换为WINDOWS、MAC、ANDROID、IOS以及其他类型；将用户输入密码时间转换为一个范围，最长时间和最短时间；将用户登录的IP转换为具体的地点信息，例如北京、上海。S1.6 processes the above six features respectively, converting the user login time into six time periods: morning, noon, afternoon, evening, night and early morning; converting the browser type used by the user into IE, SOGOU_EXPLORER, CHROME, SAFARI, EDGE, FIREFOX, ANDROID_BROWSER and other types; converting the operating system type used by the user into WINDOWS, MAC, ANDROID, IOS and other types; converting the time when the user enters the password into a range, the longest time and the shortest time; converting the user's login IP into specific location information, such as Beijing and Shanghai.

所述S2包括以下步骤：The S2 comprises the following steps:

S2.1从数据库中查询该用户过往登陆中使用过的浏览器、操作系统和用户登录时间信息，并进行符号化表示抽象为三元组；S2.1 retrieves the browser, operating system and user login time information used by the user in the past login from the database, and symbolizes and abstracts them into triples;

S2.2根据FP-Growth算法，对三元组中的数据构建频繁项集表：遍历数据集并计算每个项的支持度。根据最小支持度阈值，筛选出频繁项集，并按照支持度降序排列；具体的，设置最小支持度为3；每个项的支持度为项集在数据集中出现的频率；S2.2 According to the FP-Growth algorithm, a frequent itemset table is constructed for the data in the triples: the data set is traversed and the support of each item is calculated. According to the minimum support threshold, frequent itemsets are screened out and sorted in descending order of support; specifically, the minimum support is set to 3; the support of each item is the frequency of the item set appearing in the data set;

S2.3根据FP-Growth算法，对每个历史信息的三元组构建FP-Tree：根据频繁项集表和数据集，构建FP-Tree。对于每个三元组，根据频繁项集表中的项集顺序，构建一条路径，如果路径上的节点已存在，则增加节点的支持度计数，否则创建新节点。对于每个三元组，重复上述步骤，直到所有事务都被处理完毕；S2.3 According to the FP-Growth algorithm, construct an FP-Tree for each triple of historical information: construct an FP-Tree based on the frequent item set table and the data set. For each triple, construct a path according to the order of itemsets in the frequent item set table. If the node on the path already exists, increase the support count of the node, otherwise create a new node. Repeat the above steps for each triple until all transactions are processed;

S2.4根据FP-Growth算法，从FP-Tree中挖掘频繁模式：通过递归遍历FP-Tree，挖掘频繁模式。对于每个项集，从叶子节点开始，依次向上遍历节点的父节点，构建条件模式基(即从该节点到根节点路径上的所有节点)；S2.4 Mining frequent patterns from FP-Tree according to FP-Growth algorithm: Mining frequent patterns by recursively traversing FP-Tree. For each item set, starting from the leaf node, traverse the parent nodes of the node upwards in sequence to build the conditional pattern base (i.e. all nodes on the path from the node to the root node);

对于每个频繁项集，生成其条件模式基；For each frequent itemset, generate its conditional pattern base;

对于每个条件模式基，构建条件FP-Tree，以及对应的频繁项集表；For each conditional pattern base, construct a conditional FP-Tree and the corresponding frequent item set table;

如果条件FP-Tree非空，递归执行S2.4，直到无法生成更多的频繁项集；If the conditional FP-Tree is not empty, recursively execute S2.4 until no more frequent itemsets can be generated;

通过上述过程，可以获得所有频繁模式及其支持度，其中所生成频繁模式即为该用户登录状态的合法规则。Through the above process, all frequent patterns and their support can be obtained, wherein the generated frequent patterns are the legal rules of the user's login status.

所述S3包括以下步骤：The S3 comprises the following steps:

在用户密码比对成功后，判断用户行为是否符合步骤S2中挖掘出的登录状态合法规则和用户行为是否符合用户行为识别条件；After the user password is successfully compared, it is determined whether the user behavior meets the legal login status rules mined in step S2 and whether the user behavior meets the user behavior identification conditions;

S3.1得到当前登录用户使用的操作系统、用户使用的浏览器以及当前登录时间这三个用户行为特征，并根据S1、S2中方式对数据进行转换，得到三元组；S3.1 obtains three user behavior features: the operating system used by the current logged-in user, the browser used by the user, and the current login time, and converts the data according to the methods in S1 and S2 to obtain a triple;

S3.2依次判断三元组中任意组合是否存在于合法规则中；S3.2 determines in turn whether any combination of the triples exists in the legal rules;

S3.3用户当前登录地址是否在用户行为模式中出现过；S3.3 Whether the user's current login address has appeared in the user's behavior pattern;

S3.4用户当前设备的MAC地址是否在用户行为模式中出现过；S3.4 Whether the MAC address of the user's current device has appeared in the user's behavior pattern;

S3.5用户当前登录时间是否距离上次登录时间30天之内；S3.5 Whether the user's current login time is within 30 days of the last login time;

S3.6用户输入密码的时间是否在限定范围内。S3.6 Whether the time when the user enters the password is within the specified range.

所述S4包括以下步骤：The S4 comprises the following steps:

S4.1利用抓包工具，对服务器系统中一段时间内的流量数据进行收集，并对数据包进行解析，可以得到具体的协议信息。S4.1 uses packet capture tools to collect traffic data in the server system over a period of time and parse the data packets to obtain specific protocol information.

S4.2利用CICFlowMeter将抓取的数据包进行特征提取，其中包括流量持续时间、包总数、数据包大小等特征信息。S4.2 uses CICFlowMeter to extract features from captured data packets, including flow duration, total number of packets, data packet size and other characteristic information.

所述S5包括以下步骤：The S5 comprises the following steps:

S5.1收集过往系统流量数据和开源流量数据信息，同样利用CICFlowMeter工具进行流量数据包解析，并将收集到的数据按照良性、恶意进行标记S5.1 collects past system traffic data and open source traffic data information, and also uses the CICFlowMeter tool to parse traffic data packets, and marks the collected data as benign or malicious.

S5.2将S6.3收集到的数据输入到集成学习模型EasyEnsemble中进行训练，得到恶意流量分类模型。S5.2 inputs the data collected by S6.3 into the integrated learning model EasyEnsemble for training to obtain a malicious traffic classification model.

所述S6包括以下步骤：The S6 comprises the following steps:

通过在服务器系统中设置定时任务，定期将流量数据进行收集和特征提取，并将数据输入到恶意流量检测模型中进行分类，如果被标记为恶意流量，则获取该流量涉及到的用户，检查其是否在线，如果在线则进行异常用户处理，如果不在线则再下次登陆时进行异常用户处理。By setting up scheduled tasks in the server system, traffic data is collected and features are extracted regularly, and the data is input into the malicious traffic detection model for classification. If it is marked as malicious traffic, the users involved in the traffic are obtained to check whether they are online. If they are online, abnormal user processing is performed. If they are not online, abnormal user processing is performed at the next login.

实施例Example

S1：收集流量异常检测的数据集，其中包括系统历史流量数据和开源数据集，开源数据集中包括良性流量和恶意流量，系统产生的历史流量数据都作为良性流量。按照8:1:1的比例将其划分为训练集、验证集和测试集；S1: Collect data sets for traffic anomaly detection, including system historical traffic data and open source data sets. The open source data sets include benign traffic and malicious traffic. The historical traffic data generated by the system is regarded as benign traffic. Divide it into training set, validation set and test set in a ratio of 8:1:1;

S2：利用CICFlowMeter工具对流量数据进行解析，提取流量特征，其中包括流量持续时间、包总数、数据包大小等特征，对数据进行清洗和归一化处理；S2: Use the CICFlowMeter tool to analyze the flow data, extract flow characteristics, including flow duration, total number of packets, data packet size, etc., and clean and normalize the data;

S2.1：解析流量数据包，分为包头解析，包数据解析，以太网首部解析，IP数据包首部解析，相关具体协议解析；S2.1: Parse traffic data packets, including packet header parsing, packet data parsing, Ethernet header parsing, IP data packet header parsing, and related specific protocol parsing;

S2.2：根据源IP、目标IP、源端口号、目标端口号和协议信息进行分组，前述五种信息都相同则记录为一条流量数据，并对每组数据进行提取特征；S2.2: Group according to source IP, destination IP, source port number, destination port number and protocol information. If the above five types of information are the same, they are recorded as one flow data, and features are extracted for each group of data;

S2.3：数据都被表述为数值型、离散型，以CSV格式存储，供后续训练使用；S2.3: The data are expressed as numerical and discrete types and stored in CSV format for subsequent training;

S3：利用集成学习模型EasyEnsemble进行模型训练，其中基准模型使用AdaBoost，训练得到一个可以进行异常流量检测的分类模型；S3: Use the ensemble learning model EasyEnsemble for model training. The baseline model uses AdaBoost to train a classification model that can detect abnormal traffic.

S4：系统登录模块中设置需要监控的信息，并进行数据清洗和标准化等操作；S4: Set the information to be monitored in the system login module, and perform data cleaning and standardization operations;

S4.1通过HTTP Header读取User-Agent属性获取用户的操作系统及其浏览器信息，并将信息进行标准化处理，浏览器类型记录为：IE、SOGOU_EXPLORER、CHROME、SAFARI、EDGE、FIREFOX、ANDROID_BROWSER以及其它类型；将操作系统类型记录为WINDOWS、MAC、ANDROID、IOS以及其他类型；S4.1 reads the User-Agent attribute through the HTTP Header to obtain the user's operating system and browser information, and standardizes the information. The browser types are recorded as: IE, SOGOU_EXPLORER, CHROME, SAFARI, EDGE, FIREFOX, ANDROID_BROWSER and other types; the operating system types are recorded as WINDOWS, MAC, ANDROID, IOS and other types;

S4.2通过HTTP Header获取用户IP信息，再通过ARP命令获取目标设备的MAC地址。其中还需要将IP信息转换为实际地区信息，例如北京、上海等；S4.2 obtains the user's IP information through the HTTP Header, and then obtains the MAC address of the target device through the ARP command. It is also necessary to convert the IP information into actual regional information, such as Beijing, Shanghai, etc.;

S4.3通过前端对输入框监控，获取用户输入密码的处理时间和用户实际登录的时间，并将用户实际登录时间处理为上午、中午，下午、傍晚、晚间、凌晨这六个时段；S4.3 monitors the input box through the front end to obtain the processing time of the user's password input and the user's actual login time, and processes the user's actual login time into six periods: morning, noon, afternoon, evening, night, and early morning;

S4.4将这些数据统一存入数据库，以便后续计算合法登录规则时使用；S4.4 stores these data in a unified database for use in subsequent calculation of legal login rules;

S5：对用户登录过程进行异常行为监控，其中包括合法登录规则提取和合规登录检测；S5: Monitor abnormal behavior during the user login process, including extracting legal login rules and detecting compliance logins;

S5.1：查询当前登录用户的历史登录信息，并利用FP-Growth数据挖掘算法分析用户历史登陆中用户登录时间、用户使用浏览器、用户操作系统三者之间的关系，形成多种频繁范式；S5.1: Query the historical login information of the current logged-in user, and use the FP-Growth data mining algorithm to analyze the relationship between the user's login time, the user's browser, and the user's operating system in the user's historical login, forming multiple frequent patterns;

S5.2：检测用户登录时用户登录时间、用户使用浏览器、用户操作系统任意组合是否同时存在于频繁范式中，如果不存在则触发用户异常行为处理机制；S5.2: Detect whether any combination of user login time, user browser, and user operating system exists in the frequent paradigm at the same time when the user logs in. If not, trigger the user abnormal behavior processing mechanism;

S5.3：继续检测用户真实地址、设备MAC地址、用户密码操作时间是否在历史数据的合法范围内，如果其中任何一项不存在则触发用户异常行为处理机制；S5.3: Continue to check whether the user's real address, device MAC address, and user password operation time are within the legal range of historical data. If any of them does not exist, the user abnormal behavior processing mechanism is triggered;

S5.4如果所有检测机制通过，则用户正常登录系统；S5.4 If all detection mechanisms pass, the user logs into the system normally;

S6用户使用系统过程中持续进行流量情况监控和敏感接口监控，并利用S3中训练好的模型检测是否存在恶意行为；S6 continuously monitors traffic and sensitive interfaces while users are using the system, and uses the model trained in S3 to detect whether there is malicious behavior;

S6.1系统中设置定时任务，定时收集系统中的流量数据包和每位用户的下载、查询接口访问次数；Set up a scheduled task in the S6.1 system to regularly collect the traffic data packets in the system and the number of download and query interface accesses of each user;

S6.2对数据包利用S2中的方式进行解析和特征提取，并将数据输入到检测模型中进行分析，如果发现异常流量行为，则得到该流量的源IP信息，判断此IP与哪位用户一致，强制该用户执行用户异常行为处理机制；S6.2 parses and extracts features from the data packets using the method in S2, and inputs the data into the detection model for analysis. If abnormal traffic behavior is found, the source IP information of the traffic is obtained, and it is determined which user the IP is consistent with, and the user is forced to execute the user abnormal behavior processing mechanism;

S6.3判断用户在一段时间内访问敏感接口的次数是否超过了最大限度。如果超过则强制该用户执行用户异常行为处理机制。S6.3 determines whether the number of times a user accesses a sensitive interface within a period of time exceeds the maximum limit. If so, the user is forced to execute the user abnormal behavior processing mechanism.

最后应说明的是：以上所述实施例，仅为本发明的具体实施方式，用以说明本发明的技术方案，而非对其限制，本发明的保护范围并不局限于此，尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that the above-described embodiments are only specific implementations of the present invention, which are used to illustrate the technical solutions of the present invention, rather than to limit them. The protection scope of the present invention is not limited thereto. Although the present invention is described in detail with reference to the above-described embodiments, those skilled in the art should understand that any person skilled in the art can still modify the technical solutions described in the above-described embodiments within the technical scope disclosed by the present invention, or can easily think of changes, or perform equivalent replacements on some of the technical features thereof; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention. They should all be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The abnormal user identification method based on behavior analysis and flow detection is characterized by comprising the following steps:

the detection step when the user logs in the system:

S1, after user password verification passes in a user login process, carrying out feature extraction and data preprocessing on login information of the time; the login information comprises a browser type used by a user, a user operating system type, a real address of the user, a user equipment MAC address, user login time and user password input time;

S2, utilizing a history record of the user login in a database to carry out association rule mining on the history record, and constructing a legal range during login; the history record is login information meeting legal login rules before the user logs in this time;

S3, judging whether the login information of the user is in a legal range or not, if so, allowing the user to log in, otherwise, intercepting abnormal behaviors;

the detection steps in the process of using the system by a user are as follows:

S4, collecting flow data and sensitive interface access records in a set time at regular time, and extracting features of the collected flow data to obtain flow features in a period of time corresponding to the IP; the sensitive interface is an operation interface which is preset and related to sensitive behaviors;

S6: and inputting the flow characteristics and the sensitive interface access record into a trained abnormality detection model, judging whether the user behavior has an abnormality or not by the abnormality detection model, and intercepting the abnormal behavior when the user behavior has the abnormality.

2. The method of claim 1, wherein the anomaly detection model uses previously collected known normal flow data and anomaly flow data as a training data set to complete detection model construction;

The anomaly detection model is an ensemble learning model EasyEnsemble.

3. The method of claim 1, wherein in step S1, when the user clicks to log in and interacts with the server in the form of HTTP, the HTTP header information is parsed to obtain the current log-in information:

After the HTTP request header is analyzed, extracting the browser type and the User operating system type used by a User when sending a request from a User-Agent field;

After resolving the HTTP request header, acquiring a user IP address, and acquiring an equipment MAC address through an address resolution protocol ARP; mapping the IP address of the user into the real address of the user by using an IP address library;

acquiring user input password time information through a front-end keyboard monitoring event of a WEB application system: the time of inputting the password by the user starts timing from the time of inputting the password by the user, and ends when the user clicks other positions to enable the password frame to lose focus;

and when the user sends an HTTP request to the rear end of the WEB application system, acquiring the time at the time and recording the time as user login time.

4. The method according to claim 1, wherein the data preprocessing in step S1 is specifically:

the user login time is converted into six periods of morning, noon, afternoon, evening and early morning of corresponding dates;

converting the BROWSER type used by the user into IE, SOGOU_ EXPLORER, CHROME, SAFARI, EDGE, FIREFOX or ANDROID_BROWSER;

converting the type of the operating system used by the user into WINDOWS, MAC, ANDROID or IOS;

converting the time of inputting the password by the user into a time length range, wherein the time length range is determined by the earliest time and the latest time of timing;

The real address is the city name.

5. The method of claim 1, wherein in step S2, the history of the user login in the database is utilized to perform association rule mining, and the specific method for constructing the legal scope of the login request information is as follows:

Inquiring the browser type, the operating system type and the user login time used in the previous login of the user from a database, and symbolizing the browser type, the operating system type and the user login time to be expressed as triples; according to the correlation analysis algorithm FP-Growth, an FP-Tree Tree is constructed for each triplet of history information, and then frequent item sets are obtained by mining frequent patterns from the FP-Tree Tree, so that a legal range of the user during login is formed.

6. The method of claim 5, wherein step S3 is to determine whether the current user login is within a legal range, specifically:

Obtaining a triplet of browser type, operating system type and user login time symbolized representation used by a user during the login; judging whether any combination in the triples logged in at the time exists in a legal range, if yes, entering other user behavior modes, and if no, intercepting abnormal behaviors;

Judging other user behavior modes: and comparing and judging the real address of the user, the MAC address of the user equipment, the login time of the user and the time length of inputting the password by the user when logging in according to the set legal behavior mode, if the user is in line with the legal behavior mode, allowing the user to log in, otherwise, intercepting abnormal behaviors.

7. The method of claim 6, wherein the other user behavior pattern determination is specifically: judging whether the real address of the user and the MAC address of the user equipment appear in the history record when the login is satisfied; the login time of the user is within a preset number of days from the last login time; the time length of inputting the password by the user is within a limited range; if all the actions are satisfied, the user is allowed to log in, otherwise, abnormal behavior interception is carried out.

8. The method of claim 1, wherein the traffic characteristics include traffic duration, total number of packets, and packet size.

9. The method of claim 1, wherein the anomaly detection model in step S6 judges whether the user behavior is abnormal, if so, the abnormal behavior interception is performed and the user related to the abnormal access is obtained, whether the user is online is checked, if so, the abnormal user processing is performed, and if not, the abnormal user processing is performed at the next login of the user.

10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of claim 1.