WO2025123802A1

WO2025123802A1 - Packet inspection and analysis system and method based on deep flow inspection

Info

Publication number: WO2025123802A1
Application number: PCT/CN2024/116693
Authority: WO
Inventors: 张巍; 代天雄
Original assignee: Exands Shanghai Information Technology Co Ltd
Current assignee: Exands Shanghai Information Technology Co Ltd
Priority date: 2023-12-11
Filing date: 2024-09-04
Publication date: 2025-06-19
Anticipated expiration: 2026-06-11
Also published as: CN117792701A; CN117792701B

Abstract

The present invention relates to the technical field of packet inspection. Disclosed are a packet inspection and analysis system and method based on deep flow inspection. The system comprises a data acquisition module, a deep flow inspection module, a packet analysis module, a data storage module, and a security module; the data acquisition module is used for acquiring production data stream information transmitted by a packet in a network; the deep flow inspection module is used for performing flow feature extraction on a packet data stream and determining an application type; the packet analysis module is used for determining whether the packet is normal; the data storage module is used for storing packet flow data; and the security module is used for checking packet information and network connection condition. According to the present invention, deep flow inspection technology is used to identify the application type of the data stream, without being affected by encryption of application layer data; and a Bayesian classifier is used for continuing to perform classification in a major category, which only depends on statistical properties of the data, thereby achieving high efficiency and accuracy when determining whether the encrypted packet is normal.

Description

A message detection and analysis system and method based on deep flow detection

Technical Field

本发明涉及报文检测技术领域，具体为一种基于深度流检测的报文检测分析系统及方法。The present invention relates to the technical field of message detection, and in particular to a message detection and analysis system and method based on deep flow detection.

Background Art

深度流检测技术是一种基于流量行为的应用识别技术，以流为基本研究对象，从庞大的网络流数据中提取流的特征，比如流大小、流速度等，深度流检测技术主要分为三部分：流特征选择、流特征提取、分类器；深度流检测采用的是一种基于流量行为的应用识别技术，这种技术通过分析网络流量中的数据包，检测和识别出不同的应用类型，不同的应用类型在网络流量中的表现形式是不同的，因此深度流检测技术可以根据数据包的流量行为来识别出不同的应用类型；深度流检测技术只能对应用类型进行笼统分类，无法精确定位具体的应用类型，这是因为深度流检测技术主要关注的是网络流量的行为特征，当一个流量满足某种特定行为模型时，深度流检测技术就会将其归类为相应的应用类型，例如P2P流量或VOIP流量；Deep flow detection technology is an application identification technology based on traffic behavior. It takes flow as the basic research object and extracts flow features, such as flow size and flow speed, from huge network flow data. Deep flow detection technology is mainly divided into three parts: flow feature selection, flow feature extraction, and classifier. Deep flow detection uses an application identification technology based on traffic behavior. This technology detects and identifies different application types by analyzing data packets in network traffic. Different application types have different manifestations in network traffic. Therefore, deep flow detection technology can identify different application types based on the traffic behavior of data packets. Deep flow detection technology can only classify application types in general and cannot accurately locate specific application types. This is because deep flow detection technology mainly focuses on the behavioral characteristics of network traffic. When a traffic meets a certain specific behavioral model, deep flow detection technology will classify it as the corresponding application type, such as P2P traffic or VOIP traffic.

判断业务是否正常通常需要了解应用的具体类型，不同的应用类型具有不同的流量特征和行为模式，HTTP和HTTPS都是基于TCP协议的网页浏览应用类型，它们的流量特征和行为模式非常相似；但是，HTTP和HTTPS的流特征的分布情况不完全相同，相同的流量特征，在应用类型为HTTP和HTTPS时，在HTTP和HTTPS的流特征分布图中所处的位置也不相同，需要分开进行判断业务是否正常；因此，在深度流检测技术只能对数据流应用类型进行大类区分的情况下，如何判断业务是否正常成为了一个需要解决的问题。To determine whether a service is normal, it is usually necessary to understand the specific type of application. Different application types have different traffic characteristics and behavior patterns. HTTP and HTTPS are both web browsing application types based on the TCP protocol, and their traffic characteristics and behavior patterns are very similar; however, the distribution of flow characteristics of HTTP and HTTPS is not exactly the same. The same traffic characteristics, when the application types are HTTP and HTTPS, are located in different positions in the flow characteristic distribution diagrams of HTTP and HTTPS, and need to be determined separately to determine whether the service is normal; therefore, when deep flow detection technology can only distinguish data flow application types by broad categories, how to determine whether the service is normal becomes a problem that needs to be solved.

发明内容Summary of the invention

本发明的目的在于提供一种基于深度流检测的报文检测分析系统及方法，以解决上述背景技术中提出的问题。The purpose of the present invention is to provide a packet detection and analysis system and method based on deep flow detection. To solve the problems raised in the above background technology.

在本发明的一个方面，提供一种基于深度流检测的报文检测分析系统，包括：数据采集模块、深度流检测模块、报文分析模块、数据存储模块和安全模块；所述数据采集模块的输出端与所述深度流检测模块和所述数据存储模块的输入端相互连接，用于采集报文在网络中传输的生产的数据流信息；所述深度流检测模块的输出端与所述报文分析模块的输入端相互连接，用于对报文数据流进行流特征提取，并判断出报文数据流的应用类型；所述报文分析模块的输出端与所述安全模块的输入端相互连接，用于判断报文是否正常；所述数据存储模块与所述深度流检测模块和报文分析模块相互连接，用于存储报文流量数据；所述安全模块，当报文分析模块判断报文数据流非正常时，对报文信息和网络连接情况进行检查，并通知管理员采取措施。In one aspect of the present invention, a message detection and analysis system based on deep flow detection is provided, comprising: a data acquisition module, a deep flow detection module, a message analysis module, a data storage module and a security module; the output end of the data acquisition module is interconnected with the input end of the deep flow detection module and the data storage module, and is used to collect data flow information produced by message transmission in the network; the output end of the deep flow detection module is interconnected with the input end of the message analysis module, and is used to extract flow features of the message data flow and determine the application type of the message data flow; the output end of the message analysis module is interconnected with the input end of the security module, and is used to determine whether the message is normal; the data storage module is interconnected with the deep flow detection module and the message analysis module, and is used to store message traffic data; when the message analysis module determines that the message data flow is abnormal, the security module checks the message information and network connection status, and notifies the administrator to take measures.

具体地，所述深度流检测模块还包括数据预处理单元、流特征提取单元、数据流分类单元和优化单元；所述数据预处理单元用于对报文流量数据进行预处理；所述流特征提取单元用于从报文流量数据中，提取出报文流量的流特征；所述数据流分类单元用于对报文数据流的应用类型进行有监督分类和识别；所述优化单元根据已知的报文流量数据和应用类型，对数据流分类单元进行训练和优化，提高准确性和鲁棒性；所述报文分析模块还包括无监督分类单元、应用类型判断单元和检测单元；所述无监督分类单元用于寻找与报文数据流相关的历史数据流，所述应用类型判断单元用于确定报文数据流属于不同应用类型的概率；所述监测单元用于确定报文数据流是否正常；所述数据存储模块通过3×n阶矩阵存储数据流属于各个应用类型的概率和各个应用类型的非正常率；第一行元素分别为1、2、…n，表示n个不同的应用类型；第二行元素表示数据流属于各个应用类型的概率，第三行元素表示报文数据流属于各个应用类型时，报文数据流非正常的概率；每个3×n阶矩阵对应一个报文的流特征数据，随着报文的流特征数据增加，只需更新3×n阶矩阵中对应位置的数据，当出现相同的流特征数据时，根据矩阵中的数据判断报文是否正常，不必再对报文流量数据进行分析。Specifically, the deep flow detection module also includes a data preprocessing unit, a flow feature extraction unit, a data flow classification unit and an optimization unit; the data preprocessing unit is used to preprocess the message flow data; the flow feature extraction unit is used to extract the flow features of the message flow from the message flow data; the data flow classification unit is used to perform supervised classification and identification of the application type of the message data flow; the optimization unit trains and optimizes the data flow classification unit according to the known message flow data and application type to improve the accuracy and robustness; the message analysis module also includes an unsupervised classification unit, an application type judgment unit and a detection unit; the unsupervised classification unit is used to find the historical data flow related to the message data flow, and the application type judgment unit is used to Determine the probability that the message data flow belongs to different application types; the monitoring unit is used to determine whether the message data flow is normal; the data storage module stores the probability that the data flow belongs to each application type and the abnormal rate of each application type through a 3×n-order matrix; the first row of elements are 1, 2, ... n, representing n different application types; the second row of elements represents the probability that the data flow belongs to each application type, and the third row of elements represents the probability that the message data flow is abnormal when the message data flow belongs to each application type; each 3×n-order matrix corresponds to the flow feature data of a message, as the flow feature data of the message increases, only the data at the corresponding position in the 3×n-order matrix needs to be updated, and when the same flow feature data appears, it is determined whether the message is normal according to the data in the matrix, and there is no need to update the message flow Analyze the quantity data.

在本发明的另一个方面，提供一种基于深度流检测的报文检测分析方法，包括以下步骤：In another aspect of the present invention, a packet detection and analysis method based on deep flow inspection is provided, comprising the following steps:

S5-1，获取数据流的流特征数据，所述数据流指报文在网络中传输时生成的数据流；S5-1, obtaining flow characteristic data of a data flow, where the data flow refers to a data flow generated when a message is transmitted in a network;

S5-2，基于数据流的流特征数据，通过有监督分类模型，首次判断数据流的应用类型，若数据流的应用类型不唯一，则进入步骤S5-3；若数据流的应用类型唯一，则进入步骤S5-4；S5-2, based on the flow characteristic data of the data flow, the application type of the data flow is first determined by a supervised classification model. If the application type of the data flow is not unique, the process proceeds to step S5-3; if the application type of the data flow is unique, the process proceeds to step S5-4;

若数据流的流特征与任何已知应用类型的流特征都不相符，则直接判断报文数据流是非正常的，数据流的异常流特征可能是网络问题或者数据本身存在问题导致的，交由安全模块对网络连接状况和报文信息进行检查；If the flow characteristics of the data flow do not match the flow characteristics of any known application type, the message data flow is directly judged to be abnormal. The abnormal flow characteristics of the data flow may be caused by network problems or problems with the data itself. The security module will check the network connection status and message information.

S5-3，基于首次判断的数据流应用类型，确定数据流属于各个应用类型的概率；根据数据流的应用类型和流特征数据，确定出在数据流属于不同应用类型时非正常的概率；综合所有非正常的概率，判断数据流是否正常；S5-3, based on the data flow application type determined for the first time, determine the probability that the data flow belongs to each application type; determine the probability that the data flow is abnormal when it belongs to different application types according to the application type and flow characteristic data of the data flow; and determine whether the data flow is normal by combining all abnormal probabilities;

S5-4，根据数据流的应用类型和流特征数据，判断数据流是否正常。S5-4, judging whether the data flow is normal according to the application type and flow characteristic data of the data flow.

在步骤S5-2中，所述通过有监督分类模型，首次判断数据流的应用类型，具体包括以下步骤：In step S5-2, the application type of the data stream is first determined by the supervised classification model, which specifically includes the following steps:

S6-1，获取历史的数据流信息，包括流特征和数据流的应用类型；S6-1, obtaining historical data flow information, including flow characteristics and application types of data flows;

S6-2，对每个应用类型，以历史数据流的流特征作为输入，应用类型作为输出，训练一个二分类模型，共计得到n个二分类模型，n是应用类型的数量；S6-2, for each application type, a binary classification model is trained with the flow features of the historical data flow as input and the application type as output, and a total of n binary classification models are obtained, where n is the number of application types;

S6-3，将当前数据流的流特征输入到n个二分类模型中，确定与当前数据流的流特征相符合的一个或多个应用类型；S6-3, inputting the flow characteristics of the current data flow into n binary classification models to determine one or more application types that match the flow characteristics of the current data flow;

对于属于同一大类的应用类型的数据流，流特征十分相似，难以通过分类模型对应用类型进行细致地区分，这里通过有监督分类模型确定与当前数据流的流特征相符合的一个或多个应用类型，找到当前数据流可能的应用类型；For data flows belonging to the same category of application types, the flow characteristics are very similar, and it is difficult to distinguish the application types in detail through the classification model. Here, a supervised classification model is used to determine the application types that are related to the current data flow. One or more application types that match the flow characteristics of the data flow are found to find the possible application type of the current data flow;

在步骤S5-3中，所述确定数据流属于各个应用类型的概率，包括以下步骤：In step S5-3, determining the probability that the data flow belongs to each application type includes the following steps:

S7-1，以a₁、a₂、…a_m表示首次判断的数据流应用类型对应的元素，m是与当前数据流的流特征相符合的应用类型的数量；当m为2时，数据流应用类型对应的元素为a₁和a₂；S7-1, _a1 , _a2 , ... _am represent the elements corresponding to the data flow application type determined for the first time, where m is the number of application types that match the flow characteristics of the current data flow; when m is 2, the elements corresponding to the data flow application type are _a1 and _a2 ;

S7-2，以X₁、X₂、…X_k表示数据流的流特征，以Y表示应用类型的随机变量，Y的取值范围是{a₁、a₂、…a_m}，k为流特征的数量；S7-2, X ₁ , X ₂ , ...X _k represent the flow characteristics of the data flow, Y represents the random variable of the application type, the value range of Y is {a ₁ , a ₂ , ... _am }, and k is the number of flow characteristics;

S7-3，计算出数据流属于应用类型a₁、a₂、…a_m的概率Pa₁、Pa₂、…Pa_m；S7-3, calculating the probability _Pa1 , _Pa2 , ...Pam that the data flow belongs to application types _a1 , _a2 , ... _am _;

计算出m个条件概率， Calculate m conditional probabilities,

确定Pa₁、Pa₂、…Pa_m， i的取值范围是区间[1,m]之间的正整数；Determine Pa ₁ , Pa ₂ , ... Pa _m , The value range of i is a positive integer between the interval [1, m];

数据流的流特征X₁、X₂、…X_k是已知的，在已知流特征X₁、X₂、…X_k的情况下，以条件概率公式计算出数据流属于各个应用类型的概率，条件概率越高，数据流属于某个应用类型的概率就越高，由于计算公式中分母全部相同，因此只需对分子进行比较；这里不是简单地以条件概率最大值对应的应用类型作为当前数据流应用类型，而是以概率形式表示当前数据流的应用类型；The flow characteristics X ₁ , X ₂ , ...X _k of the data flow are known. When the flow characteristics X ₁ , X ₂ , ...X _k are known, the probability that the data flow belongs to each application type is calculated using the conditional probability formula. The higher the conditional probability, the higher the probability that the data flow belongs to a certain application type. Since the denominators in the calculation formula are all the same, only the numerators need to be compared. Here, the application type corresponding to the maximum conditional probability is not simply used as the current data flow application type, but the application type of the current data flow is expressed in the form of probability;

P{Y＝a_i}为数据流属于应用类型a_i的概率，通过计算数据存储模块中应用类型a_i的数据流的数量与数据存储模块中数据流总数量的比值得到；P{X₁,X₂,…X_k|Y＝a_i}与P{X₁,X₂,…X_k，Y＝a_i}相等，从数据存储模块中确定应用类型为a_i且流特征为X₁,X₂,…X_k的数据流的数量quantity，计算quantity与数据存储模块中数据流总数量的比值得到P{X₁,X₂,…X_k|Y＝a_i}； P{Y＝a _i } is the probability that the data flow belongs to application type a _i , which is obtained by calculating the ratio of the number of data flows of application type a _i in the data storage module to the total number of data flows in the data storage module; P{X ₁ ,X ₂ ,…X _k |Y＝a _i } is equal to P{X ₁ ,X ₂ ,…X _k ,Y＝a _i }, the number quantity of data flows with application type a _i and flow characteristics X ₁ ,X ₂ ,…X _k in the data storage module is determined, and the ratio of quantity to the total number of data flows in the data storage module is calculated to obtain P{X ₁ ,X ₂ ,…X _k |Y＝a _i };

S7-4，将Pa₁、Pa₂、…Pa_m填入到3×n阶矩阵中第二行的对应位置。S7-4, fill Pa ₁ , Pa ₂ , ...Pa _m into the corresponding positions of the second row in the 3×n-order matrix.

在步骤S5-3中，所述确定出在数据流属于不同应用类型时非正常的概率包括以下步骤：In step S5-3, determining the probability of abnormality when the data stream belongs to different application types includes the following steps:

S8-1，将当前数据流的流特征和历史数据流的流特征作为输入，进行无监督分类，确定当前数据流的流特征所属的分类簇；S8-1, taking the flow characteristics of the current data flow and the flow characteristics of the historical data flow as input, performing unsupervised classification, and determining the classification cluster to which the flow characteristics of the current data flow belong;

S8-2，在当前数据流的流特征所属的分类簇中，找出应用类型为a₁、a₂、…a_m的流特征数据，分别记为上标表示流特征数据的应用类型，下标b₁、b₂、…b_m分别表示当前数据流的流特征所属的分类簇中，应用类型为a₁、a₂、…a_m的流特征数据的个数；S8-2, in the classification cluster to which the flow characteristics of the current data flow belong, find the flow characteristic data of application types _a1 , _a2 , ... _am , and record them as The superscript indicates the application type of the flow feature data, and the subscripts b ₁ , b ₂ , ... b _m respectively indicate the number of flow feature data with application types a ₁ , a ₂ , ... a _m in the classification cluster to which the flow feature of the current data flow belongs;

S8-3，对于步骤S8-2中的每个流特征数据，计算出非正常的概率，记为 S8-3, for each flow feature data in step S8-2, calculate the probability of abnormality, denoted as

特别地，在步骤S5-4中，当前数据流的流特征只与一个应用类型相匹配，只需计算出一个应用类型的非正常概率，即为当前数据流非正常的概率，若大于或等于阈值，则判断当前报文在网络中传输时生成的数据流是非正常的，若小于阈值，则判断当前报文在网络中传输时生成的数据流是正常的；In particular, in step S5-4, the flow characteristics of the current data flow only match one application type, and only the abnormal probability of one application type needs to be calculated, that is, the probability that the current data flow is abnormal. If it is greater than or equal to the threshold, it is judged that the data flow generated when the current message is transmitted in the network is abnormal. If it is less than the threshold, it is judged that the data flow generated when the current message is transmitted in the network is normal.

S8-4，确定出每种应用类型a₁、a₂、…a_m非正常的概率式中为流特征数据的连接权重；将填入到3×n阶矩阵中第三行的对应位置；S8-4, determine the probability that each application type _a1 , _a2 , ... _am is abnormal In the formula is the connection weight of the stream feature data; Fill in the corresponding position of the third row in the 3×n-order matrix;

每个流特征数据在不同的应用类型中分布在不同的位置，对判断报文是否正常的影响也不同；例如，HTTP和HTTPS流量特征和行为模式非常相似，由于HTTPS需要进行加密，需要在数据包的开头添加额外的信息，例如加密算法的标识符、密钥等，这些信息会使得整个数据包的大小增加；对于同一个数据包大小quan，同时符合HTTP和HTTPS的流量特征，在HTTP的流特征中，quan可能处于平均值处，而在HTTPS的流量特征中，quan则会处于平均值的前方，对报文数据是否正常的影响不完全相同；Each flow feature data is distributed in different locations in different application types, and has different effects on judging whether the message is normal. For example, HTTP and HTTPS traffic features and behavior patterns are very similar. Since HTTPS needs to be encrypted, additional information needs to be added to the beginning of the data packet, such as the encryption algorithm identifier and key. This information will increase the size of the entire data packet. For the same data packet size quan, which meets the traffic features of both HTTP and HTTPS, in the HTTP flow features, quan may be at the average value, while in the HTTPS traffic features, quan will be in front of the average value. The impact of whether the message data is normal or not is not exactly the same;

S8-5，计算出当前数据流非正常的概率PE，若PE大于或等于阈值，则判断当前报文在网络中传输时生成的数据流是非正常的，若PE小于阈值，则判断当前报文在网络中传输时生成的数据流是正常的；S8-5, calculate the probability PE of the current data flow being abnormal, If PE is greater than or equal to the threshold, the data flow generated when the current message is transmitted in the network is judged to be abnormal; if PE is less than the threshold, the data flow generated when the current message is transmitted in the network is judged to be normal;

阈值通过对数据存储模块中非正常历史数据流的流特征数据，按照上述步骤计算出数据流非正常的概率，取所有非正常历史数据流的流特征数据对应的非正常概率的25％分位数作为阈值，也可根据需求取平均值、最小值或中位数等其他值作为阈值。The threshold is calculated by the flow feature data of the abnormal historical data flow in the data storage module according to the above steps to obtain the probability that the data flow is abnormal, and the 25% quantile of the abnormal probability corresponding to the flow feature data of all abnormal historical data flows is taken as the threshold. Other values such as the average value, minimum value or median can also be used as the threshold according to needs.

在步骤S8-4中，流特征数据的连接权重通过以下步骤进行确定：In step S8-4, the connection weight of the stream feature data is determined by the following steps:

计算出流特征数据与当前数据流的流特征数据之间的欧氏距离，将欧氏距离分别记为将欧式距离通过反相关函数f进行映射，得到 Calculate flow characteristic data The Euclidean distance between the stream feature data of the current data stream and the Euclidean distance is recorded as Map the Euclidean distance through the anti-correlation function f and get

计算出每个流特征数据的权重求和公式中j仅代表序号，不具有实际意义；Calculate the weight of each stream feature data In the summation formula, j only represents a sequence number and has no practical meaning;

反相关函数f可以为等，其中c是不为零的正常数，x是反相关函数f的输入；当c为0且存在流特征数据与当前数据流的流特征数据的欧氏距离为0时，的值趋向于无穷大，故c为大于0的正常数；The anti-correlation function f can be etc., where c is a non-zero positive constant, and x is the input of the anti-correlation function f; when c is 0 and the Euclidean distance between the existing stream feature data and the stream feature data of the current data stream is 0, The value of tends to infinity, so c is a positive constant greater than 0;

特别地，流特征数据的权重可以用于计算出数据流属于应用类型a₁、a₂、…a_m的概率Pa₁、Pa₂、…Pa_m，执行步骤S8-1和S8-2并得到流特征数据的权重对每个流特征数据，执行步骤S7-3，对每一个流特征数据都得到一组 Pa₁、Pa₂、…Pa_m，将同一应用类型的流特征数据得到的Pa₁、Pa₂、…Pa_m通过权值进行连接可得到当前数据流属于应用类型a₁、a₂、…a_m的概率。In particular, the weight of the flow feature data It can be used to calculate the probability Pa ₁ , Pa ₂ , ...Pa _m that the data flow belongs to the application type a ₁ , a ₂ , ... _am , and execute steps S8-1 and S8-2 to obtain the weight of the flow feature data. For each stream feature data, step S7-3 is executed to obtain a set of Pa ₁ , Pa ₂ , ...Pa _m , connecting Pa ₁ , Pa ₂ , ...Pa _m obtained from flow feature data of the same application type through weights can obtain the probability that the current data flow belongs to application type a ₁ , a ₂ , ... _am .

具体地，在步骤S8-3中，所述对于步骤S8-2中的每个流特征数据，计算出非正常的概率包括以下步骤：Specifically, in step S8-3, for each flow feature data in step S8-2, calculating the probability of abnormality includes the following steps:

以表示任意一个流特征数据，x的取值范围是[1，b_i]之间的正整数；在应用类型为a_i的历史数据流的流特征数据中，确定与相同的流特征数据的总数，与相同的流特征数据非正常的次数，计算出非正常的次数与总数的比值作为流特征数据的非正常概率。by represents any stream feature data, the value range of x is a positive integer between [1, _bi ]; in the stream feature data of the historical data stream of application type a _i , determine The total number of the same stream feature data, The number of times the same flow feature data is abnormal is calculated as the ratio of the abnormal number to the total number as the flow feature data. abnormal probability.

与现有技术相比，本发明所达到的有益效果是：深度流检测技术根据流的特征来对数据流的应用类型进行识别，不同的应用类型体现在会话连接或数据流上的状态各有不同，对整个数据流的流特征进行分析，例如每个流的平均包长，每个包到达的时间间隔等，无须对应用层数据进行检测，因而检测效果不会受到应用层数据加密的影响；在确定应用类型的大类之后，借助贝叶斯分类器在应用程序的大类中继续进行分类，对噪声和异常值不敏感，只依赖于数据本身的统计性质，而不需要知道数据的具体内容，在判断加密后的报文是否正常时具有较高的效率和准确性。Compared with the prior art, the beneficial effects achieved by the present invention are as follows: the deep flow detection technology identifies the application type of the data flow according to the characteristics of the flow, different application types are reflected in different states on the session connection or data flow, and the flow characteristics of the entire data flow are analyzed, such as the average packet length of each flow, the time interval for the arrival of each packet, etc., without the need to detect the application layer data, so the detection effect will not be affected by the encryption of the application layer data; after determining the major category of the application type, the Bayesian classifier is used to continue classification in the major category of the application program, which is insensitive to noise and outliers, and only relies on the statistical properties of the data itself without knowing the specific content of the data, and has high efficiency and accuracy in judging whether the encrypted message is normal.

BRIEF DESCRIPTION OF THE DRAWINGS

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。在附图中：The accompanying drawings are used to provide a further understanding of the present invention and constitute a part of the specification. Together with the embodiments of the present invention, they are used to explain the present invention and do not constitute a limitation of the present invention. In the accompanying drawings:

图1是本发明实施例一种基于深度流检测的报文检测分析系统的结构示意图。FIG1 is a schematic diagram of the structure of a message detection and analysis system based on deep flow inspection according to an embodiment of the present invention.

DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the accompanying drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, ordinary technicians in this field will not make any creative efforts. All other embodiments obtained under the premise of sexual labor are within the scope of protection of the present invention.

在本发明的实施例中，请参阅图1，提供一种基于深度流检测的报文检测分析系统，包括：数据采集模块、深度流检测模块、报文分析模块、数据存储模块和安全模块；所述数据采集模块的输出端与所述深度流检测模块和所述数据存储模块的输入端相互连接，用于采集报文在网络中传输的生产的数据流信息；所述深度流检测模块的输出端与所述报文分析模块的输入端相互连接，用于对报文数据流进行流特征提取，并判断出报文数据流的应用类型；所述报文分析模块的输出端与所述安全模块的输入端相互连接，用于判断报文是否正常；所述数据存储模块与所述深度流检测模块和报文分析模块相互连接，用于存储报文流量数据；所述安全模块，当报文分析模块判断报文数据流非正常时，对报文信息和网络连接情况进行检查，并通知管理员采取措施。In an embodiment of the present invention, please refer to Figure 1, which provides a message detection and analysis system based on deep flow detection, including: a data acquisition module, a deep flow detection module, a message analysis module, a data storage module and a security module; the output end of the data acquisition module is interconnected with the input end of the deep flow detection module and the data storage module, and is used to collect data flow information produced by message transmission in the network; the output end of the deep flow detection module is interconnected with the input end of the message analysis module, and is used to extract flow features of the message data flow and determine the application type of the message data flow; the output end of the message analysis module is interconnected with the input end of the security module, and is used to determine whether the message is normal; the data storage module is interconnected with the deep flow detection module and the message analysis module, and is used to store message traffic data; when the message analysis module determines that the message data flow is abnormal, the security module checks the message information and network connection status, and notifies the administrator to take measures.

所述深度流检测模块还包括数据预处理单元、流特征提取单元、数据流分类单元和优化单元；所述数据预处理单元用于对报文流量数据进行预处理；所述流特征提取单元用于从报文流量数据中，提取出报文流量的流特征；所述数据流分类单元用于对报文数据流的应用类型进行有监督分类和识别；所述优化单元根据已知的报文流量数据和应用类型，对数据流分类单元进行训练和优化，提高准确性和鲁棒性。The deep flow detection module also includes a data preprocessing unit, a flow feature extraction unit, a data flow classification unit and an optimization unit; the data preprocessing unit is used to preprocess the message flow data; the flow feature extraction unit is used to extract the flow features of the message flow from the message flow data; the data flow classification unit is used to perform supervised classification and identification of the application type of the message data flow; the optimization unit trains and optimizes the data flow classification unit according to the known message flow data and application type to improve the accuracy and robustness.

所述报文分析模块还包括无监督分类单元、应用类型判断单元和检测单元；所述无监督分类单元用于寻找与报文数据流相关的历史数据流，所述应用类型判断单元用于确定报文数据流属于不同应用类型的概率；所述监测单元用于确定报文数据流是否正常。The message analysis module also includes an unsupervised classification unit, an application type judgment unit and a detection unit; the unsupervised classification unit is used to find historical data flows related to message data flows, the application type judgment unit is used to determine the probability that the message data flows belong to different application types; the monitoring unit is used to determine whether the message data flows are normal.

所述数据存储模块通过3×n阶矩阵存储数据流属于各个应用类型的概率和各个应用类型的非正常率；第一行元素分别为1、2、…n，表示n个不同的应用类型；第二行元素表示数据流属于各个应用类型的概率，第三行元素表示报文数据流属于各个应用类型时，报文数据流非正常的概率；每个3×n阶矩阵对应一个报文的流特征数据，随着报文的流特征数据增加，只需更新3×n阶矩阵中对应位置的数据，当出现相同的流特征数据时，根据矩阵中的数据判断报文是否正常，不必再对报文流量数据进行分析。The data storage module stores the probability of data flow belonging to each application type and the abnormal rate of each application type through a 3×n-order matrix; the first row of elements are 1, 2, ... n, representing n different application types; the second row of elements represents the probability of data flow belonging to each application type, and the third row of elements represents the probability of abnormal message data flow when the message data flow belongs to each application type; each 3×n-order matrix corresponds to the flow feature data of a message, and as the flow feature data of the message increases, only the 3×n-order matrix needs to be updated The data at the corresponding position in the matrix is used to determine whether the message is normal, and there is no need to analyze the message flow data.

在本发明的实施例中，提供一种基于深度流检测的报文检测分析方法，包括以下步骤：In an embodiment of the present invention, a packet detection and analysis method based on deep flow inspection is provided, comprising the following steps:

S5-1，获取数据流的流特征数据，所述数据流指报文在网络中传输时生成的数据流；流特征数据由深度流检测模块从数据流中提取，包括但不限于数据包大小、数据包数量、流量速率和流的持续时间；S5-1, obtaining flow characteristic data of a data flow, wherein the data flow refers to a data flow generated when a message is transmitted in a network; the flow characteristic data is extracted from the data flow by a deep flow inspection module, including but not limited to a packet size, a number of packets, a flow rate, and a duration of the flow;

首次判断数据流的应用类型步骤如下：The steps to determine the application type of the data flow for the first time are as follows:

获取历史的数据流信息，包括流特征和数据流的应用类型；Obtain historical data flow information, including flow characteristics and data flow application types;

对每个应用类型，以历史数据流的流特征作为输入，应用类型作为输出，训练一个二分类模型，共计得到n个二分类模型，n是应用类型的数量；For each application type, a binary classification model is trained with the flow features of the historical data flow as input and the application type as output. A total of n binary classification models are obtained, where n is the number of application types.

将当前数据流的流特征输入到n个二分类模型中，确定与当前数据流的流特征相符合的一个或多个应用类型；Inputting the flow characteristics of the current data flow into n binary classification models to determine one or more application types that match the flow characteristics of the current data flow;

首先确定数据流属于各个应用类型的概率，具体包括以下步骤：First, determine the probability that the data flow belongs to each application type, which includes the following steps:

以a₁、a₂、…a_m表示首次判断的数据流应用类型对应的元素，m是与当前数据流的流特征相符合的应用类型的数量；当m为2时，数据流应用类型对应的元素为a₁和a₂；a ₁ , a ₂ , ... a _m represent the elements corresponding to the data stream application type determined for the first time, where m is the number of application types that match the flow characteristics of the current data stream; when m is 2, the elements corresponding to the data stream application type are a ₁ and a ₂ ;

以X₁、X₂、X₃、X₄表示数据包大小、数据包数量、流量速率和流的持续时间，以Y表示应用类型的随机变量，Y的取值范围是{a₁、a₂、…a_m}；Let _X1 , _X2 , _X3 , _X4 represent the packet size, number of packets, flow rate and duration of the flow. Let Y represent the random variable of application type, and the value range of Y is {a ₁ , a ₂ , … a _m };

计算出数据流属于应用类型a₁、a₂、…a_m的概率Pa₁、Pa₂、…Pa_m； Calculate the probability Pa ₁ , Pa ₂ , ...Pa _m that the data flow belongs to application type a ₁ , a ₂ , ... _am ;

Pa₁、Pa₂、…Pa_m， i的取值范围是区间[1,m]之间的正整数；Pa ₁ , Pa ₂ , ... Pa _m , The value range of i is a positive integer between the interval [1, m];

将Pa₁、Pa₂、…Pa_m填入到3×n阶矩阵中第二行的对应位置；Fill Pa ₁ , Pa ₂ , ...Pa _m into the corresponding positions of the second row in the 3×n-order matrix;

接着确定各个当前数据流属于各个应用类型时非正常的概率，包括下列步骤：Next, determining the probability that each current data flow is abnormal when belonging to each application type includes the following steps:

S8-1，将当前数据流的和历史数据流的数据包大小、数据包数量、流量速率和流的持续时间作为输入，进行无监督分类，确定当前数据流的流特征所属的分类簇；S8-1, taking the packet size, number of packets, flow rate and flow duration of the current data flow and the historical data flow as input, performing unsupervised classification to determine the classification cluster to which the flow feature of the current data flow belongs;

以表示任意一个流特征数据，x的取值范围是[1，b_i]之间的正整数；在应用类型为a_i的历史数据流的流特征数据中，确定与相同的流特征数据的总数，与相同的流特征数据非正常的次数，计算出非正常的次数与总数的比值作为流特征数据的非正常概率；by represents any stream feature data, the value range of x is a positive integer between [1, _bi ]; in the stream feature data of the historical data stream of application type a _i , determine The total number of the same stream feature data, The number of times the same flow feature data is abnormal is calculated as the ratio of the abnormal number to the total number as the flow feature data. The probability of abnormality;

S8-4，计算出流特征数据与当前数据流的流特征数据之间的欧氏距离，将欧氏距离分别记为将欧式距离通过反相关函数e^-x进行映射，得到计算出每个流特征数据的权重求和公式中j仅代表序号，不具有实际意义；S8-4, calculate the flow characteristic data With the current data stream The Euclidean distance between the flow feature data is recorded as Map the Euclidean distance through the anti-correlation function e ^-x to obtain Calculate the weight of each stream feature data In the summation formula, j only represents a sequence number and has no practical meaning;

确定出每种应用类型a₁、a₂、…a_m非正常的概率式中为流特征数据的连接权重；将填入到3×n阶矩阵中第三行的对应位置；Determine the probability that each application type a ₁ , a ₂ , ... a _m is abnormal In the formula is the connection weight of the stream feature data; Fill in the corresponding position of the third row in the 3×n-order matrix;

当深度流检测模块经过二分类确定与当前数据流的流特征相符的应用类型为a₁、a₂和a₃，当前数据流的流特征所属的分类簇中，应用类型为a₁的历史流特征数据与当前数据流的流特征数据之间的欧氏距离分别为0、0、0.1、0.1，应用类型为a₂的历史流特征数据与当前数据流的流特征数据之间的欧氏距离分别0.3、0.3、0.2，应用类型为a₃的历史流特征数据与当前数据流的流特征数据之间的欧氏距离分别0.5、04时，首先计算出各个应用类型中的每个流特征数据的权值， When the deep flow detection module determines through binary classification that the application types that match the flow characteristics of the current data flow are _a1 , _a2 , and _a3 , the historical flow characteristic data of application type _a1 in the classification cluster to which the flow characteristics of the current data flow belong The Euclidean distances between the stream feature data of the current data stream are 0, 0, 0.1, and 0.1 respectively, and the historical stream feature data of application type a ₂ The Euclidean distances between the stream feature data of the current data stream are 0.3, 0.3, and 0.2 respectively, and the historical stream feature data of application type a ₃ When the Euclidean distances between the stream feature data of the current data stream are 0.5 and 0.4 respectively, the weight of each stream feature data in each application type is first calculated.

当前数据流的应用类型为a₁、a₂和a₃时，非正常的概率分别为：当前数据流的非正常概率 When the application types of the current data flow are a ₁ , a ₂ and a ₃ , the probability of abnormal They are: The abnormal probability of the current data flow

S5-4，根据当前数据流确定的唯一一个应用类型和流特征数据，计算出当前数据流非正常的概率，若大于或等于阈值，则判断当前报文在网络中传输时生成的数据流是非正常的，若小于阈值，则判断当前报文在网络中传输时生成的数据流是正常的。S5-4, based on the unique application type and flow characteristic data determined by the current data flow, calculate the probability that the current data flow is abnormal. If it is greater than or equal to the threshold, the data flow generated when the current message is transmitted in the network is judged to be abnormal. If it is less than the threshold, the data flow generated when the current message is transmitted in the network is judged to be normal.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device.

最后应说明的是：以上所述仅为本发明的优选实施例而已，并不用于限制本发明，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。 Finally, it should be noted that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art can still modify the technical solutions described in the aforementioned embodiments or replace some of the technical features therein by equivalents. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

A message detection and analysis system based on deep flow detection, characterized in that it includes: a data acquisition module, a deep flow detection module, a message analysis module, a data storage module and a security module; the output end of the data acquisition module is interconnected with the input end of the deep flow detection module and the data storage module, and is used to collect data flow information produced by message transmission in the network; the output end of the deep flow detection module is interconnected with the input end of the message analysis module, and is used to extract flow features of the message data flow and determine the application type of the message data flow; the output end of the message analysis module is interconnected with the input end of the security module, and is used to determine whether the message is normal; the data storage module is interconnected with the deep flow detection module and the message analysis module, and is used to store message traffic data; when the message analysis module determines that the message data flow is abnormal, the security module checks the message information and network connection status, and notifies the administrator to take measures.

According to a message detection and analysis system based on deep flow detection according to claim 1, it is characterized in that the deep flow detection module also includes a data preprocessing unit, a flow feature extraction unit, a data flow classification unit and an optimization unit; the data preprocessing unit is used to preprocess the message flow data; the flow feature extraction unit is used to extract the flow features of the message flow from the message flow data; the data flow classification unit is used to perform supervised classification and identification of the application type of the message data flow; the optimization unit trains and optimizes the data flow classification unit according to the known message flow data and application type to improve the accuracy and robustness.

According to a message detection and analysis system based on deep flow detection according to claim 1, it is characterized in that the message analysis module also includes an unsupervised classification unit, an application type judgment unit and a detection unit; the unsupervised classification unit is used to find historical data flows related to message data flows, and the application type judgment unit is used to determine the probability that the message data flows belong to different application types; the monitoring unit is used to determine whether the message data flow is normal.

According to a message detection and analysis system based on deep flow detection in claim 1, it is characterized in that the data storage module stores the probability of the data flow belonging to each application type and the abnormal rate of each application type through a 3×n-order matrix; the elements of the first row are 1, 2, ... n, indicating n different The elements in the second row represent the probability that the data flow belongs to each application type. The elements in the third row represent the probability that the message data flow is abnormal when the message data flow belongs to each application type. Each 3×n-order matrix corresponds to the flow feature data of a message. As the flow feature data of the message increases, it is only necessary to update the data at the corresponding position in the 3×n-order matrix. When the same flow feature data appears, it is determined whether the message is normal based on the data in the matrix, and there is no need to analyze the message traffic data.

A packet detection and analysis method based on deep flow inspection, characterized in that it comprises the following steps:

S5-1, obtaining flow characteristic data of a data flow, where the data flow refers to a data flow generated when a message is transmitted in a network;

S5-2, based on the flow characteristic data of the data flow, the application type of the data flow is first determined by a supervised classification model. If the application type of the data flow is not unique, the process proceeds to step S5-3; if the application type of the data flow is unique, the process proceeds to step S5-4;

S5-3, based on the data flow application type determined for the first time, determine the probability that the data flow belongs to each application type; determine the probability that the data flow is abnormal when it belongs to different application types according to the application type and flow characteristic data of the data flow; and determine whether the data flow is normal by combining all abnormal probabilities;

S5-4, judging whether the data flow is normal according to the application type and flow characteristic data of the data flow.

According to the method for packet detection and analysis based on deep flow inspection of claim 5, it is characterized in that in step S5-2, the application type of the data flow is first determined by the supervised classification model, which specifically includes the following steps:

S6-1, obtaining historical data flow information, including flow characteristics and application types of data flows;

S6-2, for each application type, a binary classification model is trained with the flow features of the historical data flow as input and the application type as output, and a total of n binary classification models are obtained, where n is the number of application types;

S6-3, input the flow characteristics of the current data flow into n binary classification models to determine one or more application types that are consistent with the flow characteristics of the current data flow.

According to the method for packet detection and analysis based on deep flow detection according to claim 6, its characteristics are The feature is that, in step S5-3, determining the probability that the data flow belongs to each application type includes the following steps:

S7-1, _a1 , _a2 , ... _am represent the elements corresponding to the data flow application type determined for the first time, where m is the number of application types that match the flow characteristics of the current data flow; when m is 2, the elements corresponding to the data flow application type are _a1 and _a2 ;

S7-2, X ₁ , X ₂ , ...X _k represent the flow characteristics of the data flow, Y represents the random variable of the application type, the value range of Y is {a ₁ , a ₂ , ... _am }, and k is the number of flow characteristics;

S7-3, calculating the probability _Pa1 , _Pa2 , ...Pam that the data flow belongs to application types _a1 , _a2 , ... _am _;

Calculate m conditional probabilities,

Determine Pa ₁ , Pa ₂ , ... Pa _m , The value range of i is a positive integer between the interval [1, m];

S7-4, fill Pa ₁ , Pa ₂ , ...Pa _m into the corresponding positions of the second row in the 3×n-order matrix.

According to the method for packet detection and analysis based on deep flow inspection of claim 7, it is characterized in that in step S5-3, the determining of the probability of abnormality when the data flow belongs to different application types comprises the following steps:

S8-1, taking the flow characteristics of the current data flow and the flow characteristics of the historical data flow as input, performing unsupervised classification, and determining the classification cluster to which the flow characteristics of the current data flow belong;

S8-2, in the classification cluster to which the flow characteristics of the current data flow belong, find the flow characteristic data of application types _a1 , _a2 , ... _am , and record them as The superscript indicates the application type of the flow feature data, and the subscripts b ₁ , b ₂ , ... b _m respectively indicate the number of flow feature data with application types a ₁ , a ₂ , ... a _m in the classification cluster to which the flow feature of the current data flow belongs;

S8-3, for each flow feature data in step S8-2, calculate the probability of abnormality, denoted as

S8-4, determine the probability that each application type _a1 , _a2 , ... _am is abnormal In the formula is the connection weight of the stream feature data; Fill in the corresponding position of the third row in the 3×n-order matrix;

S8-5, calculate the probability PE of the current data flow being abnormal, If PE is greater than or equal to the threshold, it is determined that the data flow generated when the current message is transmitted in the network is abnormal. If PE is less than the threshold, it is determined that the data flow generated when the current message is transmitted in the network is normal.

According to the packet detection and analysis method based on deep flow inspection of claim 8, it is characterized in that in step S8-4, the connection weight of the flow feature data is determined by the following steps:

Calculate flow characteristic data The Euclidean distance between the stream feature data of the current data stream and the Euclidean distance is recorded as Map the Euclidean distance through the anti-correlation function f and get

Calculate the weight of each stream feature data In the summation formula, j only represents a serial number and has no practical meaning.

According to the method for packet detection and analysis based on deep flow inspection of claim 9, it is characterized in that in step S8-3, for each flow feature data in step S8-2, calculating the probability of abnormality comprises the following steps:

by represents any stream feature data, the value range of x is a positive integer between [1, _bi ]; in the stream feature data of the historical data stream of application type a _i , determine The total number of the same stream feature data, The number of times the same flow feature data is abnormal, and the ratio of the abnormal number to the total number is calculated. As stream feature data abnormal probability.