Disclosure of Invention
The invention provides a method and a system for realizing distributed traffic collection and analysis, which can realize marking, searching, storing and replaying of network traffic through collection and analysis of network data packets, and can know the state of a network in real time through analyzing the captured network traffic.
The method divides the network flow from the region dimension, the bandwidth dimension, the time dimension, the code address dimension, the two-layer protocol dimension, the three-layer protocol dimension, the keyword dimension, the length range dimension, the flow length range and the like, and supports the inquiry according to various combined index conditions to obtain the corresponding flow sample, generate the information abstract file, and can play back, count and the like the network sample.
The technical scheme adopted by the invention is as follows:
a method for realizing distributed traffic collection and analysis comprises the following steps:
collecting a network flow sample, and adding index labels with different dimensions to the network flow sample;
storing the collected network traffic samples into an Elasticissearch distributed search engine, and retrieving the network traffic samples according to different dimensions;
counting network flow samples and storing the network flow samples into a Redis database;
and playing back the network traffic sample.
Further, the different dimensions include: region dimension, bandwidth dimension, time dimension, code address dimension, protocol dimension, keyword dimension, length range dimension, and traffic length range dimension.
Further, the Elasticisearch distributed search engine segments the index; when one index is created, the number of fragments of the index needs to be specified, the fragments are divided into main fragments and copy fragments, when one document is stored, the Elasticissearch distributed search stores the main fragments into the corresponding main fragments through calculation and then synchronizes the main fragments into the copy fragments, and the copy fragments not only perform redundant operation on the main fragments, but also can perform query and calculation to share the pressure of the main fragments.
Further, the network flow samples are counted, and the counted values include byte number, packet number, stream number, average duration value, maximum duration value and minimum duration value; and according to different protocol statistics proportion conditions, sample flow statistics display is carried out through a bar chart, a pie chart and a line chart, so that a user can conveniently understand a retrieval result.
Further, for the data stored in the Redis database, a dual-guarantee storage structure of a MySQL master-slave cluster and a HDFS high-availability cluster is adopted for data persistence storage.
Further, the playback of the network traffic sample is to replay the network traffic by using TCPREPLAY technology.
Further, the network flow sample is played back, the network flow is supported to be played back according to the speed of the data packet when the sample flow is captured or the designated speed, and the transmitted data packet sequence is strictly ensured to be consistent with the real flow data packet sequence when the sample flow is captured in the playback process; the number, playback time and current playback rate of the played back packets are fed back in real time in the playback process, and the played back data packets are dynamically modified according to the MAC addresses in the playback process.
A system for realizing distributed flow acquisition and analysis by adopting the method comprises the following steps:
the sample flow capturing module is used for collecting network flow samples;
the sample traffic marking module is used for adding index labels with different dimensions to the acquired network traffic samples;
the sample traffic retrieval module is used for storing the acquired network traffic samples into an Elasticissearch distributed search engine and retrieving the network traffic samples according to different dimensions;
the sample flow counting module is used for counting network flow samples and storing the network flow samples into a Redis database;
and the sample flow playback module is used for playing back the network flow sample.
The invention has the following advantages and positive effects:
(1) the Elasticisearch is used as a large distributed cluster, so that a new server can be easily expanded into an ES cluster; the system can also be operated on a single machine to be used as a lightweight search engine; compared with the traditional relational database, the ES provides the functions of full-text retrieval, synonym processing, relevancy ranking, complex data analysis, near-real-time processing of mass data and the like; the same index is divided into a plurality of shards (Shard), and the processing efficiency is improved by using the concept of divide and conquer; a copy (replay) mechanism is provided, one fragment can be provided with a plurality of copies, and even if some servers are down, the cluster can still work normally; and the simple and easy-to-use API is provided, and the construction, the deployment and the use of the service are easy to operate.
(2) TCPREPLAY, the sample flow can be played back to the designated position as it is or after any modification. Allowing any modification to the sample traffic, specifying the speed at which the sample traffic is replayed, etc.
(3) Regarding the analysis part of the PCAP package, the Data Plane Development Kit (DPDK) is used for analyzing the PCAP package into an ELOG log form, so that the Data package can be rapidly processed, secondary Development is performed on the DPDK, the Data processing performance and throughput can be greatly improved, and the working efficiency of a Data Plane application program is improved.
(4) In the aspect of data storage, the invention simultaneously utilizes two schemes of MySQL and HDFS, and also adopts a high-availability mode in the aspect of cluster construction, thereby avoiding the problem of single-point failure and further ensuring the persistence of data.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
The PCAP (packet capture, an application program interface for capturing network traffic) network traffic analysis in the invention is mainly divided into three parts: marking, retrieving and playing back. The marking is mainly to analyze the captured PCAP package to generate an ELOG file, mark the generated file data and increase indexes for some commonly used fields. The retrieval method mainly comprises the steps of storing captured data analyzed by a PCAP package into an Elasticissearch distributed search engine, inquiring through various combined index conditions to obtain a flow sample, generating an information summary file, classifying the data in the retrieval process, counting according to the minute level, and analyzing the flow sample through a column diagram, a list, a pie diagram and a line diagram, so that a user can understand retrieval results conveniently. The playback is mainly to use TCPREPLAY tool to replay the network traffic from the captured PCAP file according to the speed of the data packet or the designated speed, as long as the playback is within the range of the hardware tolerance. The method can lead the flow to be directly split between the two network cards, written into files, screened and edited in various modes according to the requirement, thereby providing a method for testing firewalls, NIDS and other network devices.
The key technology used by the system is as follows: data acquisition and data analysis; an Elasticissearch distributed search engine; redis distributed storage, MySQL persistence, HDFS file system.
The invention provides a method for realizing PCAP network flow analysis, which comprises the following steps:
step 1: PCAP traffic sample analysis and labeling. After receiving the PCAP traffic sample, analyzing the PCAP traffic sample, and adding index labels with different dimensions to commonly used fields, wherein the index labels mainly comprise a region dimension, a bandwidth dimension, a time dimension, a code address dimension, a two-layer protocol dimension, a three-layer protocol dimension, a keyword dimension, a length range dimension, a traffic length range dimension and the like.
a) And (5) regional dimension. The method mainly comprises an access office point, an access operator and an access direction.
b) The bandwidth dimension. According to 10GE, 100GE, 10GPS, etc.
c) The time dimension. The samples were analyzed by minutes, hours, days, months and years.
d) The code address dimension. And analyzing according to the source and destination IP address range and the port number range of the traffic.
e) The two-layer protocol dimension. The analysis is performed according to the source and destination MAC address range, VLAN, MPLS, ICMP, ARP, LACP, LLDP, etc.
f) Three protocol dimensions. The analysis is performed according to typical applications such as DHCP, DNS, HTTP, SMTP, POP3, IMAP, FTP, etc.
g) The keyword dimension. The method supports self-defining of keywords at any position and supports two, three and four layers of feature code definition.
h) Length range dimension. The analysis is performed according to the ranges of two-layer frame check, three-layer packet length and four-layer packet length.
i) The flow length range dimension. And analyzing according to the flow packet data range and the flow time length range.
Step 2: and (2) retrieving the PCAP flow sample data, further dividing the sample flow according to different dimensions listed in the step (1), storing the sample flow into an Elasticissearch distributed search engine, and segmenting the index to solve the problem of data loss caused by the fact that hardware of a single node reaches a stored critical value.
The method for segmenting the index comprises the following steps: the bottom layer of the elastic search is based on Lucene, one node is an instance of the elastic search, and each fragment is an instance of Lucene. Each segment contains all the basic functions of Lucene. When an index is created, the number of fragments of the index needs to be specified, and the fragments are divided into a main fragment and a copy fragment. The relationship between the main shards and the copy shards is that when a document is stored, the elastic search is stored in the corresponding main shards through calculation and then is synchronized into the copy shards, and the copy shards can be regarded as the redundant structure of the main shards. However, the copy fragment not only performs redundancy operation only on the main fragment, but also can perform query, calculation and the like to share the pressure of the main fragment.
For the retrieval of data with different dimensions, the process of reading the data is specifically as follows:
a) the client sends a read request (get request) to any node (node), which is then called a coordinating node.
b) The coordination node routes the document and forwards the request to the corresponding node, and at the moment, a random polling algorithm is used for randomly selecting one of a primary shard mechanism and a replication shard mechanism to balance the load of the read request.
c) The node receiving the request returns a document to the coordinating node.
d) And the coordination node returns the document to the client.
The flow samples are obtained by inquiring various combined index conditions, and the system can inquire the statistic value of each flow sample in a certain time node according to the time node, wherein the statistic value comprises the number of bytes, the number of packets, the number of streams, the average value of the duration, the maximum value of the duration and the minimum value of the duration. The statistical proportion condition is carried out according to different protocols (HTTP, TCP, UDP and the like), and the sample flow statistical display is carried out through a bar chart, a pie chart, a line chart and the like, so that a user can understand the retrieval result more conveniently, and a decision is made according to the retrieval result.
And step 3: and storing the PCAP traffic sample data. In step 2, the flow samples are further divided according to different dimensions, the quantity statistics is carried out on the sample flow according to the time dimension, and the sample flow is stored in a Redis database. Redis is an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent, and provides API of multiple languages.
Considering that Redis is memory-based, power is lost. We open both AOF and RDB persistence methods, in this case, when Redis restarts, AOF file is loaded to restore the original data preferentially, because in general, AOF file stores more complete data set than RDB file.
The Redis database is strong in real-time performance, but for data persistence, further operation is needed, and the data are persisted to two places simultaneously by adopting a double-guarantee storage structure of a MySQL master-slave cluster and a HDFS high-availability cluster.
And 4, step 4: and (4) playback of sample flow. The purpose of traffic playback is to prepare for testing firewalls, NIDS, and other network devices. The TCPREPLAY technology is adopted to playback the sample flow, the network flow can be played back according to the speed of a data packet when the sample flow is captured or the designated speed, the packet sending through the designated port of the special network card is supported at a specific moment, the controllable simulation of the current network flow is realized, the consistency of the sequence of the sent data packet and the sequence of the real flow data packet when the data packet is captured is strictly ensured in the playback process, the real-time feedback of the statistical information such as the number, the playback time, the current playback rate and the like of the played back packets in the playback process is supported, and the dynamic modification of the played back data packet according to the MAC address in the playback process is supported.
Based on the same inventive concept, another embodiment of the present invention provides a system for implementing distributed traffic collection and analysis, including:
the sample flow capturing module is used for collecting network flow samples;
the sample traffic marking module is used for adding index labels with different dimensions to the acquired network traffic samples;
the sample traffic retrieval module is used for storing the acquired network traffic samples into an Elasticissearch distributed search engine and retrieving the network traffic samples according to different dimensions;
the sample flow counting module is used for counting network flow samples and storing the network flow samples into a Redis database;
and the sample flow playback module is used for playing back the network flow sample.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
Parts of the invention not described in detail are well known to the person skilled in the art.
The foregoing disclosure of the specific embodiments of the present invention and the accompanying drawings is directed to an understanding of the present invention and its implementation, and it will be appreciated by those skilled in the art that various alternatives, modifications, and variations may be made without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.