Low-overhead file operation log acquisition method
Technical Field
The invention relates to the field of data protection, in particular to a low-overhead file operation log acquisition method.
Background
With the rapid development of the fields of internet, social media, cloud computing, internet of things, mobile short videos, electronic commerce and the like, the data volume generated every year around the world is explosively increased. The age of big data has come and data has become the most important digital asset in the world. The development of the technology brings great living convenience to the life of people, such as mobile phone payment, face recognition, intelligent voice, unmanned supermarket and the like. But at the same time, it also brings the risk of data leakage to people. A wide variety of data leakage events are layered endlessly. The demand for data protection is also increasing. At the present stage, 80% of data is stored in a file, a log for recording file operation is one of important measures for data protection, and when data leaks, backtracking can be performed through the log for file operation to find the source of the leakage. However, the existing log collection method has a main problem that the system overhead is too large. The existing log collection method has the following main reasons for high overhead:
(1) and recording file operation logs by adopting an interception system calling method with high overhead, and recording all file operation logs by intercepting system calls of all file operations.
(2) The log information is transferred from kernel space to user space through a costly printk function.
(3) Due to the fact that a large number of redundant logs and logs generated by temporary files exist in file operation logs, the system logs are too large, and the disk IO expense is large.
The existing file operation log collection method has large system overhead, is not beneficial to the deployment of the actual production environment, and causes storage overhead due to too large log quantity. Aiming at the problem of high cost of the existing file operation log collection method, the existing solution method is to collect the file operation log through a stackable file system with low cost and then record the operation of part of files instead of recording all the file operations in the system, or only record all the file operations of part of users instead of the file operations of all the users. Although the method can reduce the system overhead, all file operations of all users cannot be recorded, and when the files which are not recorded with logs are leaked, the source tracing cannot be carried out through the file operation logs, so that a leakage person and a leakage mode can be found.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a low-overhead file operation log collection method.
The purpose of the invention can be realized by the following technical scheme:
a low-overhead file operation log collection method comprises the following steps:
1) acquiring file operation log information in a kernel by adopting a kernel probe;
2) setting a shared memory for writing information acquired by the kernel probe in the kernel space, and reading the information acquired by the kernel probe from the shared memory by the user space;
3) the number of logs is reduced through a deduplication algorithm, and the log collection overhead is reduced.
In the step 2), the user space reads the information acquired by the kernel probe from the shared memory in real time through the mmap mechanism.
In the step 3), duplicate removal is performed by constructing a hash table, key values in the hash table are all structural bodies, keys in the hash table are the same parts in the file operation log, and values in the hash table are the log information after duplicate removal.
The duplication elimination method comprises a filtering module and a merging module, wherein the filtering module comprises kernel layer filtering and user layer filtering, the kernel layer filtering is used for filtering the file operation logs, the user layer filtering is used for filtering the temporary files, the merging module is used for merging the file read-write operations, when the same file has multiple read-write operations, the multiple continuous read operations are merged into one read log, and the multiple continuous write operations are merged into one write log.
The specific operation flow of the merging module is as follows:
firstly, searching whether the log information exists in the existing log information or not, if so, merging, and if not, inserting the log information into the hash table.
The same part in the file operation log comprises file information, process information and user information, and specifically comprises a process ID, a parent process ID, a user ID, a file name and the type of file operation.
The complexity of the hash table is O (1), a linked list method is adopted for solving hash conflicts, and a hash function selection division hash method is adopted.
In the step 1), file operation log information is collected by adopting eBPF in a virtual file layer of a kernel.
The filtering of temporary files is done by filtering the file names, including temporary files with suffix name. swp and.tmp.
Compared with the prior art, the invention has the following advantages:
firstly, the invention adopts a low-overhead kernel probe to acquire the file operation information in the kernel. Meanwhile, kernel information is transmitted to a user space through a low-overhead shared memory, and then the log amount is reduced in the user space through a deduplication algorithm, so that the overhead is low.
And secondly, the system overhead is reduced while all file operation logs of all users are recorded.
Drawings
Fig. 1 is a frame diagram of the present invention.
Figure 2 is a system overhead diagram of the present invention.
FIG. 3 is a bar chart comparing the overhead of the present invention with that of the prior log collection method.
FIG. 4 is a flow chart of the deduplication algorithm of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
As shown in fig. 1, the invention provides a low-overhead file operation log collection method, which aims at the problem that the overhead of the current file operation log system is large. Not only can the system overhead be reduced, but also all file operations can be recorded, specifically:
the method comprises the steps of collecting file operation information in a kernel by adopting a kernel probe technology with low cost, transmitting the information in the kernel to a user layer in a memory sharing mode, and reducing the number of logs by a deduplication algorithm so as to reduce the log collection cost.
The specific design scheme comprises the following steps:
according to the method, firstly, kernel information related to all file operations is acquired by adopting a kernel probe, the kernel probe can track almost all kernel functions, and the kernel probe specifically tracks functions related to a virtual file system layer aiming at the acquisition of file operation information in a kernel. When too many system call functions are tracked, corresponding processing needs to be performed on each file operation related system call function, and different system call functions may call the same function of the virtual file system layer. Different file systems need to be processed by tracking a file system layer, the file systems used by different systems are different, the related file operation functions of different file systems are different, the file system layer is tracked, the number of the file systems is too many, and different hook functions need to be used for different file systems. Therefore, for the collection of the related information of the file operation, the related functions of the virtual file system layer need to be tracked.
Then, the invention opens up a memory in the kernel space, and then maps the content of the memory to the user space through the mmap mechanism, thereby realizing the design of the shared memory, the kernel probe writes the collected information into the shared memory, the user space continuously reads the information of the kernel probe from the memory through the mmap mechanism, the mmap mechanism is mainly a technology of mapping a file into the memory, and the shared memory can be designed through the mmap mechanism.
Finally, the log amount is reduced by designing an online deduplication algorithm, and compared with the traditional sequential deduplication method, the deduplication is performed by adopting a hash table construction mode, so that a large amount of time is saved, the data deduplication efficiency is improved, and the requirement on data deduplication zero errors is met. The deduplication algorithm mainly comprises a filtering module and a merging module, wherein the filtering module mainly filters operation logs of some temporary files, such as temporary files generated when vim opens the files. The merging module mainly aims at the file read-write operation, when the same file has multiple read-write operations, the multiple continuous read operations can be merged into a read log, and the multiple continuous write operations are merged into a write log. The time complexity of traditional linked list and array searching is O (n), the time complexity of B tree and B + number is O (logn), the time complexity is too large, in order to reduce the cost, the invention selects the hash table with the time complexity of O (1), different logs are operated by the hash function to obtain unique and different keys, different logs are inserted into the hash table by utilizing the characteristic of the hash table, the same log is not inserted, and the value in the hash table is the log information after the duplication is removed.
Aiming at the file operation log, the key values of the hash table designed by the invention are all structural bodies, and the hash table key is mainly designed to comprise the same parts in the file operation log, such as file information, process information and user information. The specific process ID, parent process ID, user ID, file name and file operation type. The design of the hash table value mainly comprises different parts in a file read-write operation log, including information such as the number of read-write data of the file, the read-write times and the like. Meanwhile, the hash table value is designed to prevent the read-write data information from being lost when the log is read and written by the duplicate removal file. For example, when a file is read ten times, the read data amount is different. The hash table value is designed by continuously updating, so that the number of specific read-write data in the read-write operation log is reserved, in the embodiment, the hash collision is solved by a linked list method, and the hash function adopts a division hash method.
If the kernel probe is directly used for compiling the kernel module, development and debugging are difficult, system stability is affected, and the kernel probe cannot be compatible with operating systems of different versions, eBPF is high in safety, stable and compatible with operating systems of different versions, can be used in an actual production environment, and supports the kernel probe, so that information collection is carried out on a file system by using the eBPF in the embodiment, collection of file operation logs by the eBPF is mainly concentrated on a virtual file layer of a kernel, because the file systems are various and different systems are different in selected file systems, if log collection is carried out on data operation in the file system layer, log information collection needs to be carried out on each file system, and the workload is too large. There are various types of file operations, such as read, write, copy, delete, modify attributes, and the like. Different kernel functions are selected according to corresponding file operations, and then different kernel functions are tracked through the eBPF, so that all information of the file operation kernels can be acquired. Meanwhile, the eBPF supports the transmission of file operation information from the kernel space to the user space in a memory sharing mode.
When the file operation information is transmitted from the kernel space to the user space, the deduplication algorithm realized by the invention can effectively reduce the log amount written in the file, reduce the log amount, reduce the disk IO (input/output) and further reduce the system overhead. The realization of the deduplication algorithm mainly comprises the realization of a filtering module and the realization of a merging module. The implementation of the filtering module is further divided into kernel layer filtering and user layer filtering, where the kernel layer filtering is to filter some file operation logs inside the eBPF code, such as logs of some kernel daemon processes, where the kernel daemon processes continuously read the configuration files, and these processes need to be filtered through the process pid when the kernel layer is used. The user layer filtering is the filtering realized at the user layer when the information is transmitted from the kernel layer to the user layer, and mainly filters the temporary files. At the present stage of filtering the temporary files, the file names are mainly filtered, for example, the temporary files with suffix names of swp and tmp, merging is realized mainly after log information is transmitted from a kernel space to a user space through a shared memory, and read-write logs are merged through the hash table designed by the invention, so that redundant logs in the read-write logs are removed, and the number of the logs is reduced.
As shown in fig. 4, the deduplication algorithm process implemented by the present invention is as follows:
(1) firstly, judging whether the log information has a temporary file operation log or other logs needing filtering, and if so, filtering the log information. If there is no execution of the next operation.
(2) And creating a hash table and storing log information.
(3) And searching the hash table, judging whether redundant logs exist in the logs, updating the hash table and combining log information if the redundant logs exist in the logs. If not, the next operation is performed
(4) And writing the log information in the hash table into a log file.
Examples
The system overhead of the log collection method is tested, and the test environment is as follows: two 1.87GHz machines with 16-core Intel Xeon processors, 8GB of memory, 40GB of hard disk size and Linux 4.15.9 of operating system form a cluster.
As shown in fig. 2 to 3, in order to test the system performance overhead of the acquisition method provided by the present invention, the read-write performance of a machine loading the log acquisition system implemented in the present document is compared with that of a machine not loading the log acquisition system, so as to observe the influence of the log acquisition method provided by the present invention on the system overhead. Meanwhile, the effect of improving the performance of the text log system is observed by comparing the system performance of the text log acquisition system with the system performance of the conventional log acquisition system Progger. For the system overhead performance test, a Bonnie + + tool is selected to respectively create 100 small files of 1KB, IO times of a system using a log collection tool and IO times of a system not using the log collection tool are respectively tested, a performance loss percentage is calculated, and then performance losses of a log collection system DataLogger designed by using the method and a log system Progger of an existing open source are respectively calculated, as shown in table 1.
TABLE 1 DataLogiger overhead Performance test