CN113162818A

CN113162818A - Method and system for realizing distributed flow acquisition and analysis

Info

Publication number: CN113162818A
Application number: CN202110138388.1A
Authority: CN
Inventors: 颜靖华; 刘阳; 王益静; 黄雨晨; 王晗
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-07-23

Abstract

The invention relates to a method and a system for realizing distributed flow collection and analysis. The steps of the method include: collecting network traffic samples, marking the network traffic samples, and adding index labels of different dimensions; storing the collected network traffic samples in the Elasticsearch distributed search engine, and retrieving the network traffic samples according to different dimensions ; Count the network traffic samples and store them in the Redis database; use the TCPREPLAY technology to play back the network traffic samples, and replay the network traffic according to the speed of the data packets or the specified speed. The invention can realize the marking, retrieval, storage and playback of network traffic through the collection and analysis of network data packets, and can know the state of the network in real time by analyzing the captured network traffic.

Description

Method and system for realizing distributed flow acquisition and analysis

Technical Field

The invention belongs to the field of distributed traffic acquisition and analysis, and relates to a method and a system for realizing distributed traffic acquisition and analysis, which can realize acquisition, analysis, marking, retrieval, storage and playback of PCAP traffic.

Background

With the rapid development of the internet and network applications, network traffic is exhibiting explosive growth, and its potential value is being continuously mined and utilized. Networks are bearing increasing data transmission requirements as basic conditions for data exchange and sharing, and how to realize real-time acquisition, storage and analysis of network data is a problem that must be faced by network traffic analysis. Currently, the performance of a single server is far from meeting the requirement of network data analysis, and a distributed network data acquisition and analysis mode is a development direction and a necessary means of the work. Therefore, the adoption of a distributed architecture is currently a necessary option.

The distributed network traffic analysis system mainly solves the capabilities of network data acquisition, data storage, data analysis, visualization and the like under the condition of ultra-high speed, and realizes the distributed deployment of each functional module by adopting a loose coupling mode. Although the network traffic analysis technology in the industry can analyze the network, the dimension of the analysis is not so fine, and the analysis effect is yet to be improved.

Disclosure of Invention

The invention provides a method and a system for realizing distributed traffic collection and analysis, which can realize marking, searching, storing and replaying of network traffic through collection and analysis of network data packets, and can know the state of a network in real time through analyzing the captured network traffic.

The method divides the network flow from the region dimension, the bandwidth dimension, the time dimension, the code address dimension, the two-layer protocol dimension, the three-layer protocol dimension, the keyword dimension, the length range dimension, the flow length range and the like, and supports the inquiry according to various combined index conditions to obtain the corresponding flow sample, generate the information abstract file, and can play back, count and the like the network sample.

The technical scheme adopted by the invention is as follows:

a method for realizing distributed traffic collection and analysis comprises the following steps:

collecting a network flow sample, and adding index labels with different dimensions to the network flow sample;

storing the collected network traffic samples into an Elasticissearch distributed search engine, and retrieving the network traffic samples according to different dimensions;

counting network flow samples and storing the network flow samples into a Redis database;

and playing back the network traffic sample.

Further, the different dimensions include: region dimension, bandwidth dimension, time dimension, code address dimension, protocol dimension, keyword dimension, length range dimension, and traffic length range dimension.

Further, the Elasticisearch distributed search engine segments the index; when one index is created, the number of fragments of the index needs to be specified, the fragments are divided into main fragments and copy fragments, when one document is stored, the Elasticissearch distributed search stores the main fragments into the corresponding main fragments through calculation and then synchronizes the main fragments into the copy fragments, and the copy fragments not only perform redundant operation on the main fragments, but also can perform query and calculation to share the pressure of the main fragments.

Further, the network flow samples are counted, and the counted values include byte number, packet number, stream number, average duration value, maximum duration value and minimum duration value; and according to different protocol statistics proportion conditions, sample flow statistics display is carried out through a bar chart, a pie chart and a line chart, so that a user can conveniently understand a retrieval result.

Further, for the data stored in the Redis database, a dual-guarantee storage structure of a MySQL master-slave cluster and a HDFS high-availability cluster is adopted for data persistence storage.

Further, the playback of the network traffic sample is to replay the network traffic by using TCPREPLAY technology.

Further, the network flow sample is played back, the network flow is supported to be played back according to the speed of the data packet when the sample flow is captured or the designated speed, and the transmitted data packet sequence is strictly ensured to be consistent with the real flow data packet sequence when the sample flow is captured in the playback process; the number, playback time and current playback rate of the played back packets are fed back in real time in the playback process, and the played back data packets are dynamically modified according to the MAC addresses in the playback process.

A system for realizing distributed flow acquisition and analysis by adopting the method comprises the following steps:

the sample flow capturing module is used for collecting network flow samples;

the sample traffic marking module is used for adding index labels with different dimensions to the acquired network traffic samples;

the sample traffic retrieval module is used for storing the acquired network traffic samples into an Elasticissearch distributed search engine and retrieving the network traffic samples according to different dimensions;

the sample flow counting module is used for counting network flow samples and storing the network flow samples into a Redis database;

and the sample flow playback module is used for playing back the network flow sample.

The invention has the following advantages and positive effects:

(1) the Elasticisearch is used as a large distributed cluster, so that a new server can be easily expanded into an ES cluster; the system can also be operated on a single machine to be used as a lightweight search engine; compared with the traditional relational database, the ES provides the functions of full-text retrieval, synonym processing, relevancy ranking, complex data analysis, near-real-time processing of mass data and the like; the same index is divided into a plurality of shards (Shard), and the processing efficiency is improved by using the concept of divide and conquer; a copy (replay) mechanism is provided, one fragment can be provided with a plurality of copies, and even if some servers are down, the cluster can still work normally; and the simple and easy-to-use API is provided, and the construction, the deployment and the use of the service are easy to operate.

(2) TCPREPLAY, the sample flow can be played back to the designated position as it is or after any modification. Allowing any modification to the sample traffic, specifying the speed at which the sample traffic is replayed, etc.

(3) Regarding the analysis part of the PCAP package, the Data Plane Development Kit (DPDK) is used for analyzing the PCAP package into an ELOG log form, so that the Data package can be rapidly processed, secondary Development is performed on the DPDK, the Data processing performance and throughput can be greatly improved, and the working efficiency of a Data Plane application program is improved.

(4) In the aspect of data storage, the invention simultaneously utilizes two schemes of MySQL and HDFS, and also adopts a high-availability mode in the aspect of cluster construction, thereby avoiding the problem of single-point failure and further ensuring the persistence of data.

Drawings

Fig. 1 is a flow chart of the functional implementation of the present invention.

FIG. 2 is a general deployment diagram of the present invention. The master control node is used for collecting data collected by the collection server cluster and analyzing the PCAP file into an ELOG log; the distributed Redis cache is used for realizing the distributed storage of the PCAP flow samples; the MYSQL cluster and the HDFS file system are used for persisting data; an ES distributed retrieval system facilitates search queries according to different dimensions.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

The PCAP (packet capture, an application program interface for capturing network traffic) network traffic analysis in the invention is mainly divided into three parts: marking, retrieving and playing back. The marking is mainly to analyze the captured PCAP package to generate an ELOG file, mark the generated file data and increase indexes for some commonly used fields. The retrieval method mainly comprises the steps of storing captured data analyzed by a PCAP package into an Elasticissearch distributed search engine, inquiring through various combined index conditions to obtain a flow sample, generating an information summary file, classifying the data in the retrieval process, counting according to the minute level, and analyzing the flow sample through a column diagram, a list, a pie diagram and a line diagram, so that a user can understand retrieval results conveniently. The playback is mainly to use TCPREPLAY tool to replay the network traffic from the captured PCAP file according to the speed of the data packet or the designated speed, as long as the playback is within the range of the hardware tolerance. The method can lead the flow to be directly split between the two network cards, written into files, screened and edited in various modes according to the requirement, thereby providing a method for testing firewalls, NIDS and other network devices.

The key technology used by the system is as follows: data acquisition and data analysis; an Elasticissearch distributed search engine; redis distributed storage, MySQL persistence, HDFS file system.

The invention provides a method for realizing PCAP network flow analysis, which comprises the following steps:

step 1: PCAP traffic sample analysis and labeling. After receiving the PCAP traffic sample, analyzing the PCAP traffic sample, and adding index labels with different dimensions to commonly used fields, wherein the index labels mainly comprise a region dimension, a bandwidth dimension, a time dimension, a code address dimension, a two-layer protocol dimension, a three-layer protocol dimension, a keyword dimension, a length range dimension, a traffic length range dimension and the like.

a) And (5) regional dimension. The method mainly comprises an access office point, an access operator and an access direction.

b) The bandwidth dimension. According to 10GE, 100GE, 10GPS, etc.

c) The time dimension. The samples were analyzed by minutes, hours, days, months and years.

d) The code address dimension. And analyzing according to the source and destination IP address range and the port number range of the traffic.

e) The two-layer protocol dimension. The analysis is performed according to the source and destination MAC address range, VLAN, MPLS, ICMP, ARP, LACP, LLDP, etc.

f) Three protocol dimensions. The analysis is performed according to typical applications such as DHCP, DNS, HTTP, SMTP, POP3, IMAP, FTP, etc.

g) The keyword dimension. The method supports self-defining of keywords at any position and supports two, three and four layers of feature code definition.

h) Length range dimension. The analysis is performed according to the ranges of two-layer frame check, three-layer packet length and four-layer packet length.

i) The flow length range dimension. And analyzing according to the flow packet data range and the flow time length range.

Step 2: and (2) retrieving the PCAP flow sample data, further dividing the sample flow according to different dimensions listed in the step (1), storing the sample flow into an Elasticissearch distributed search engine, and segmenting the index to solve the problem of data loss caused by the fact that hardware of a single node reaches a stored critical value.

The method for segmenting the index comprises the following steps: the bottom layer of the elastic search is based on Lucene, one node is an instance of the elastic search, and each fragment is an instance of Lucene. Each segment contains all the basic functions of Lucene. When an index is created, the number of fragments of the index needs to be specified, and the fragments are divided into a main fragment and a copy fragment. The relationship between the main shards and the copy shards is that when a document is stored, the elastic search is stored in the corresponding main shards through calculation and then is synchronized into the copy shards, and the copy shards can be regarded as the redundant structure of the main shards. However, the copy fragment not only performs redundancy operation only on the main fragment, but also can perform query, calculation and the like to share the pressure of the main fragment.

For the retrieval of data with different dimensions, the process of reading the data is specifically as follows:

a) the client sends a read request (get request) to any node (node), which is then called a coordinating node.

b) The coordination node routes the document and forwards the request to the corresponding node, and at the moment, a random polling algorithm is used for randomly selecting one of a primary shard mechanism and a replication shard mechanism to balance the load of the read request.

c) The node receiving the request returns a document to the coordinating node.

d) And the coordination node returns the document to the client.

The flow samples are obtained by inquiring various combined index conditions, and the system can inquire the statistic value of each flow sample in a certain time node according to the time node, wherein the statistic value comprises the number of bytes, the number of packets, the number of streams, the average value of the duration, the maximum value of the duration and the minimum value of the duration. The statistical proportion condition is carried out according to different protocols (HTTP, TCP, UDP and the like), and the sample flow statistical display is carried out through a bar chart, a pie chart, a line chart and the like, so that a user can understand the retrieval result more conveniently, and a decision is made according to the retrieval result.

And step 3: and storing the PCAP traffic sample data. In step 2, the flow samples are further divided according to different dimensions, the quantity statistics is carried out on the sample flow according to the time dimension, and the sample flow is stored in a Redis database. Redis is an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent, and provides API of multiple languages.

Considering that Redis is memory-based, power is lost. We open both AOF and RDB persistence methods, in this case, when Redis restarts, AOF file is loaded to restore the original data preferentially, because in general, AOF file stores more complete data set than RDB file.

The Redis database is strong in real-time performance, but for data persistence, further operation is needed, and the data are persisted to two places simultaneously by adopting a double-guarantee storage structure of a MySQL master-slave cluster and a HDFS high-availability cluster.

And 4, step 4: and (4) playback of sample flow. The purpose of traffic playback is to prepare for testing firewalls, NIDS, and other network devices. The TCPREPLAY technology is adopted to playback the sample flow, the network flow can be played back according to the speed of a data packet when the sample flow is captured or the designated speed, the packet sending through the designated port of the special network card is supported at a specific moment, the controllable simulation of the current network flow is realized, the consistency of the sequence of the sent data packet and the sequence of the real flow data packet when the data packet is captured is strictly ensured in the playback process, the real-time feedback of the statistical information such as the number, the playback time, the current playback rate and the like of the played back packets in the playback process is supported, and the dynamic modification of the played back data packet according to the MAC address in the playback process is supported.

Based on the same inventive concept, another embodiment of the present invention provides a system for implementing distributed traffic collection and analysis, including:

the sample flow capturing module is used for collecting network flow samples;

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

Parts of the invention not described in detail are well known to the person skilled in the art.

The foregoing disclosure of the specific embodiments of the present invention and the accompanying drawings is directed to an understanding of the present invention and its implementation, and it will be appreciated by those skilled in the art that various alternatives, modifications, and variations may be made without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims

1. a realization method of distributed flow collection and analysis, is characterized in that, comprises the following steps:

Collect network traffic samples, and add index labels of different dimensions to the network traffic samples;

Store the collected network traffic samples in the Elasticsearch distributed search engine, and retrieve network traffic samples according to different dimensions;

Count network traffic samples and store them in the Redis database;

Play back network traffic samples.

2 . The method according to claim 1 , wherein the different dimensions include: a geographical dimension, a bandwidth dimension, a time dimension, a code address dimension, a protocol dimension, a keyword dimension, a length range dimension, and a traffic length range dimension. 3 .

3. method according to claim 1, is characterized in that, described Elasticsearch distributed search engine shards index; When an index needs to specify the shard quantity of index when creating, shard is divided into main shard and shard. Replica shards. When storing a document, Elasticsearch distributed search stores it into the corresponding primary shard by calculation, and then synchronizes it to its replica shard. The replica shard is not only redundant to the primary shard. In addition to other operations, it can also perform queries and calculations to share the pressure of its primary shard.

4. The method according to claim 1, wherein the network traffic samples are counted, and the statistical values include the number of bytes, the number of packets, the number of flows, the average value of the duration, the maximum duration of the duration, and the minimum duration of the duration; And according to the statistics of different protocols, the sample traffic statistics are displayed through bar charts, pie charts, and line charts to facilitate users to understand the search results.

5 . The method according to claim 1 , wherein, for the data stored in the Redis database, a dual-guaranteed storage structure of MySQL master-slave cluster and HDFS high-availability cluster is used for data persistent storage. 6 .

6 . The method according to claim 1 , wherein the replaying the network traffic sample is to replay the network traffic by using the TCPREPLAY technology. 7 .

7. method according to claim 1, is characterized in that, described network flow sample is played back, supports according to the speed of data packet or specified speed replay network flow when capturing sample flow, strictly guarantees the sending during playback. The data packet sequence is consistent with the real traffic data packet sequence at the time of capture; it supports real-time feedback of the number of playback packets, playback time and current playback rate during playback, and supports the playback process according to the MAC address during playback. Packets are dynamically modified.

8. A system for realizing distributed traffic collection and analysis using the method according to any one of claims 1 to 7, characterized in that, comprising:

The sample traffic capture module is used to collect network traffic samples;

The sample traffic labeling module is used to add index labels of different dimensions to the collected network traffic samples;

The sample traffic retrieval module is used to store the collected network traffic samples in the Elasticsearch distributed search engine, and retrieve the network traffic samples according to different dimensions;

The sample traffic statistics module is used to collect statistics on network traffic samples and store them in the Redis database;

The sample traffic playback module is used to play back network traffic samples.

9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes a program for executing claims 1- 7. Instructions for the method of claim 7.

10 . A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, the method according to any one of claims 1 to 7 is implemented. 11 .