WO2020168756A1

WO2020168756A1 - Cluster log feature extraction method, and apparatus, device and storage medium

Info

Publication number: WO2020168756A1
Application number: PCT/CN2019/118288
Authority: WO
Inventors: 吴超勇; 陈仕财
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-02-19
Filing date: 2019-11-14
Publication date: 2020-08-27
Anticipated expiration: 2021-08-19
Also published as: CN109992569A

Abstract

A cluster log feature extraction method, and an apparatus, a device and a storage medium. The method comprises: a flume client collecting a log of a server cluster, and sending same to a database (S10); performing data cleaning on log data to screen out original data (S30); extracting feature values, comprising the mean value, effective value, peak value, root amplitude, waveform index, impulse index and kurtosis index, of the original data (S50); and respectively performing operation of a Pearson correlation coefficient on the extracted feature values and the original data, comparing calculated correlation coefficients with a correlation threshold, regarding correlation coefficients higher than the correlation threshold as being valid data, and regarding correlation coefficients lower than the correlation threshold as being invalid data and removing same (S70). Valid information of production data of each host in a server cluster can be effectively screened out, and feature values of the production data are extracted from the valid information, thereby facilitating failure prediction and failure classification of a production system and reducing the occurrence of production accidents.

Description

Cluster log feature extraction method, device, equipment and storage medium

本申请要求于2019年02月19日提交的中国专利申请号201910123928.1的优先权益，上述案件全部内容以引用的方式并入本文中。This application claims the priority rights of Chinese Patent Application No. 201910123928.1 filed on February 19, 2019. The entire contents of the above cases are incorporated herein by reference.

Technical field

本申请涉及基架运维，具体地说，涉及一种集群日志特征提取方法、装置、设备及存储介质。This application relates to base frame operation and maintenance, and specifically to a method, device, equipment, and storage medium for extracting cluster log features.

Background technique

在信息爆炸式增长的时代，文件大小和数据规模迈向TB级甚至PB级已成现实，集群存储系统节点数已达到64节点集群数目，管理如此庞大的集群系统已经成为数据中心所面临的严峻挑战。及时跟踪集群节点运行状态，精确定位节点出错信息变得尤为重要。在集群存储系统实际的运行中，目前常用一种集群存储系统日志管理方法，可以定时或实时发送系统日志，实现了日志的集中传输，但是没有对日志进行分析和管理，不能全局的了解整个集群存储系统的运行情况，不能快速的定位到错误信息。但是随着集群节点数的增多，对集群系统管理变得越来越复杂。从海量服务器数据中，抽取出能反映服务器性能的特征，精确定位集群节点的潜在故障，提前做好相应的性能检测显得尤为重要。In the era of explosive growth of information, it has become a reality for file size and data scale to reach terabytes or even petabytes. The number of cluster storage system nodes has reached the number of clusters with 64 nodes. Managing such a large cluster system has become a severe problem for data centers. challenge. It is especially important to track the running status of cluster nodes in time and accurately locate node error information. In the actual operation of the cluster storage system, there is currently a commonly used method of cluster storage system log management, which can send system logs regularly or in real time to realize the centralized transmission of logs, but the logs are not analyzed and managed, and the entire cluster cannot be globally understood. The operating status of the storage system cannot quickly locate the error message. However, as the number of cluster nodes increases, the management of the cluster system becomes more and more complicated. From the massive server data, it is particularly important to extract the characteristics that can reflect the server performance, accurately locate the potential failures of the cluster nodes, and do the corresponding performance detection in advance.

发明内容Summary of the invention

为解决以上问题，本申请提供一种集群日志特征提取方法，应用于电子设备，包括以下步骤：通过flume客户端采集服务器集群的日志，发送至Hbase数据库，其中，flume客户端通过多个Agent进程对应采集服务器集群中的每台服务器的日志，Agent定时将对应的服务器上的日志数据收集并通过API接口发送到Hbase数据库；利用Hadoop对日志数据进行数据清洗，筛选出原始数据，其中原始数据至少包括服务器磁盘占用率、内存使用率、cpu占用率、业务接口调用量；对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取；运用皮尔逊相关系数筛选出有效特征，将提取的特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值进行比较，高于相关度阈值则认为是有效数据，低于相关度阈值则认为是无效数据并予以剔除。In order to solve the above problems, this application provides a cluster log feature extraction method, which is applied to electronic equipment, including the following steps: collect the log of the server cluster through the flume client and send it to the Hbase database, where the flume client processes multiple Agent processes Corresponding to the log of each server in the collection server cluster, the Agent regularly collects the log data on the corresponding server and sends it to the Hbase database through the API interface; uses Hadoop to clean the log data and filter out the original data. The original data is at least Including server disk occupancy rate, memory usage rate, cpu occupancy rate, and business interface call volume; extract the feature value of the original data including mean value, effective value, peak value, square root amplitude, waveform index, impulse index, and kurtosis index; Use the Pearson correlation coefficient to filter out the effective features, and perform the calculation of the Pearson correlation coefficient between the extracted feature values and the original data, and compare the calculated correlation coefficient with the correlation threshold. If the correlation threshold is higher than the correlation threshold, it is considered valid data , Below the correlation threshold, it is considered invalid data and removed.

本申请还提供一种集群日志特征提取装置，包括：日志采集模块、数据清洗模块、特征提取模块、有效特征筛选模块，其中，日志采集模块用于通过flume客户端采集服务器集群的日志，发送至Hbase数据库，其中，flume客户端通过多个Agent进程对应采集服务器集群中的每台服务器的日志，Agent定时将对应的服务器上的日志数据收集并通过API接口发送到Hbase数据库；数据清洗模块用于利用Hadoop对日志数据进行数据清洗，筛选出原始数据，其中原始数据至少包括服务器磁盘占用率、内存使用率、cpu占用率、业务接口调用量；特征提取模块用于对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取；有效特征筛选模块运用皮尔逊相关系数筛选出有效特征：将提取的特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值进行比较，高于相关度阈值则是有效数据，低于相关度阈值则是无效数据，并予以剔除。This application also provides a cluster log feature extraction device, including: a log collection module, a data cleaning module, a feature extraction module, and an effective feature screening module. The log collection module is used to collect logs of the server cluster through the flume client and send it to Hbase database, where the flume client collects the logs of each server in the server cluster through multiple agent processes. The agent regularly collects the log data on the corresponding server and sends it to the Hbase database through the API interface; the data cleaning module is used for Use Hadoop to clean the log data to filter out the original data. The original data includes at least the server disk occupancy rate, memory usage rate, cpu occupancy rate, and the amount of business interface calls; the feature extraction module is used to perform average and effective Extraction of feature values of value, peak value, square root amplitude, waveform index, impulse index, and kurtosis index; effective feature selection module uses Pearson correlation coefficient to filter out effective features: Pearson correlation between the extracted feature values and the original data The calculation of the coefficient is compared with the correlation threshold based on the calculated correlation coefficient. If the correlation coefficient is higher than the correlation threshold, it is valid data, and if the correlation threshold is lower, the data is invalid, and it will be eliminated.

本申请还提供一种电子设备，该电子设备包括：存储器和处理器，所述存储器中存储有集群日志特征提取程序，所述集群日志特征提取程序被所述处理器执行时实现如下步骤：通过flume客户端采集服务器集群的日志，发送至Hbase数据库，其中，flume客户端通过多个Agent进程对应采集服务器集群中的每台服务器的日志，Agent定时将对应的服务器上的日志数据收集并通过API接口发送到Hbase数据库；利用Hadoop对日志数据进行数据清洗，筛选出原始数据，其中原始数据至少包括服务器磁盘占用率、内存使用率、cpu占用率、业务接口调用量；对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取；运用皮尔逊相关系数筛选出有效特征，将提取的特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值进行比较，高于相关度阈值则认为是有效数据，低于相关度阈值则认为是无效数据并予以剔除。The application also provides an electronic device, the electronic device comprising: a memory and a processor, the memory stores a cluster log feature extraction program, and the cluster log feature extraction program is executed by the processor to implement the following steps: The flume client collects the logs of the server cluster and sends them to the Hbase database. The flume client collects the logs of each server in the server cluster through multiple agent processes. The agent regularly collects the log data on the corresponding server and passes the API The interface is sent to the Hbase database; the log data is cleaned using Hadoop, and the original data is filtered out. The original data includes at least the server disk occupancy rate, memory utilization rate, cpu occupancy rate, and business interface call volume; the original data includes average, Extraction of eigenvalues of effective value, peak value, square root amplitude, waveform index, impulse index, kurtosis index; use Pearson correlation coefficient to filter out effective features, and perform Pearson correlation coefficient calculation on the extracted eigenvalues and original data , According to the calculated correlation coefficient and the correlation threshold value, it is considered as valid data if it is higher than the correlation degree threshold value, and it is considered invalid data if it is lower than the correlation degree threshold value and removed.

本申请还提供一种计算机非易失性可读存储介质，所述计算机非易失性可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令被处理器执行时，实现以上所述的集群日志特征提取方法。The present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, Realize the above-mentioned cluster log feature extraction method.

本申请能有效筛选出服务器集群中各主机的生产数据的有效信息，且从有效信息中提取出生产数据的特征值，便于生产系统的故障预测和故障分类，减少生产事故的发生。This application can effectively filter out the effective information of the production data of each host in the server cluster, and extract the characteristic values of the production data from the effective information, which facilitates the failure prediction and classification of the production system and reduces the occurrence of production accidents.

Description of the drawings

通过结合下面附图对其实施例进行描述，本申请的上述特征和技术优点将会变得更加清楚和容易理解。By describing its embodiments in conjunction with the following drawings, the above-mentioned features and technical advantages of the present application will become clearer and easier to understand.

图1是本申请实施例的集群日志特征提取方法的流程示意图；FIG. 1 is a schematic flowchart of a cluster log feature extraction method according to an embodiment of the present application;

图2是本申请实施例的电子设备的硬件架构示意图；2 is a schematic diagram of the hardware architecture of an electronic device according to an embodiment of the present application;

图3是本申请实施例的集群日志特征提取程序的模块构成图；Fig. 3 is a block diagram of a cluster log feature extraction program according to an embodiment of the present application;

图4是本申请实施例的日志采集模块的单元构成图；FIG. 4 is a unit structure diagram of a log collection module of an embodiment of the present application;

图5是本申请实施例的特征提取模块的单元构成图；FIG. 5 is a unit structure diagram of a feature extraction module of an embodiment of the present application;

图6是本申请实施例的数据清洗模块的单元构成图；FIG. 6 is a unit structure diagram of a data cleaning module according to an embodiment of the present application;

图7是Flume的Agent进程读取数据的示意图。Figure 7 is a schematic diagram of Flume's Agent process reading data.

detailed description

下面将参考附图来描述本申请所述的集群日志特征提取方法、装置及存储介质的实施例。本领域的普通技术人员可以认识到，在不偏离本申请的精神和范围的情况下，可以用各种不同的方式或其组合对所描述的实施例进行修正。因此，附图和描述在本质上是说明性的，而不是用于限制权利要求的保护范围。此外，在本说明书中，附图未按比例画出，并且相同的附图标记表示相同的部分。Hereinafter, embodiments of the cluster log feature extraction method, device and storage medium described in this application will be described with reference to the accompanying drawings. A person of ordinary skill in the art may realize that the described embodiments can be modified in various different ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and descriptions are illustrative in nature and are not used to limit the scope of protection of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.

如图1所示，本实施例的集群日志特征提取方法，包括如下步骤：As shown in Figure 1, the cluster log feature extraction method of this embodiment includes the following steps:

步骤S10，通过flume(分布式的海量日志采集、聚合和传输系统)客户端采集服务器集群的日志，发送至Hbase数据库服务器。Flume以Agent进程为最小的独立运行单位，一个Agent进程就是一个完整的数据收集工具。如图7所示，Agent包含组件Source(数据收集组件)、Channel(中转临时存储)、 Sink，三者组建了一个Agent，source从服务器收集数据，传递给Channel，Channel保存由Source组件传递过来的Event(数据单元)，Sink从Channel中读取并移除Event，将Event传递到后台。Flume通过多个Agent来对应各服务器收集日志数据。对应每一台服务器设置一个Agent，定时将对应的服务器上的日志数据收集并通过API接口发送到后台。In step S10, the logs of the server cluster are collected by the flume (distributed mass log collection, aggregation and transmission system) client, and sent to the Hbase database server. Flume takes the Agent process as the smallest independent operation unit, and an Agent process is a complete data collection tool. As shown in Figure 7, the Agent includes components Source (data collection component), Channel (transit temporary storage), and Sink. The three form an Agent. The source collects data from the server and passes it to the Channel. The Channel saves the data passed by the Source component. Event (data unit), Sink reads and removes the Event from the Channel, and passes the Event to the background. Flume collects log data corresponding to each server through multiple agents. Set up an Agent for each server, collect log data on the corresponding server regularly and send it to the background through the API interface.

步骤S30，利用Hadoop(分布式系统基础架构)对日志数据进行数据清洗，筛选出原始数据，其中原始数据至少包括服务器磁盘占用率、内存使用率、cpu占用率、业务接口调用量。Step S30: Use Hadoop (distributed system infrastructure) to perform data cleaning on the log data to filter out the original data, where the original data includes at least the server disk occupancy rate, the memory usage rate, the cpu occupancy rate, and the amount of service interface calls.

步骤S50，对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取。In step S50, the original data is extracted with feature values including average value, effective value, peak value, root square amplitude, waveform index, impulse index, and kurtosis index.

步骤S70，运用皮尔逊相关系数筛选出有效特征：将提取的特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值进行比较，高于相关度阈值则认为是有效数据，低于相关度阈值则认为是无效数据并予以剔除。Step S70: Use the Pearson correlation coefficient to filter out the effective features: perform Pearson correlation coefficient calculations on the extracted feature values and the original data, and compare the calculated correlation coefficient with the correlation threshold. If it is higher than the correlation threshold, it is considered Data is valid, and data below the correlation threshold is considered invalid data and removed.

进一步地，数据清洗中采用拉依达准则剔除具有粗大误差的数据，包括以下步骤：Further, the Laida criterion is used to remove data with gross errors in data cleaning, including the following steps:

对日志数据x ₁,x ₂...,x _n，计算其算术平均值

及剩余误差

其中，x _i为单次Agent采集的日志数据； For log data x ₁ ,x ₂ ...,x _n , calculate the arithmetic average

And residual error

Among them, x _i is the log data collected by a single agent;

计算标准偏差S _x，

Calculate the standard deviation S _x ,

若日志数据中的x _b的剩余误差v _b(1≤b≤n)，满足公式

If the residual error v _b of x _b in the log data (1≤b≤n), it satisfies the formula

则认为x _b是含有粗大误差值的奇异值，并剔除奇异值。 It is considered that x _b is a singular value with a gross error value, and the singular value is eliminated.

进一步地，采用拉依达法则能有效地识别出生产数据的奇异值，但对于剔除掉的数据则会产生空值。因此，对识别出的日志数据的奇异值用中值替代，实现对生产数据信息的预处理。其中所述中值是指将各个变量值x ₁,x ₂...,x _n按大小顺序排列起来，形成一个数列，处于变量数列中间位置的变量值就称为中值。 Furthermore, using Laida's law can effectively identify the singular values of the production data, but will produce null values for the deleted data. Therefore, the singular value of the identified log data is replaced with the median value to realize the preprocessing of the production data information. The median value means that the variable values x ₁ , x ₂ ..., x _n are arranged in order of magnitude to form a sequence, and the variable value in the middle of the variable sequence is called the median value.

在一个可选实施例中，对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取，其中，In an optional embodiment, feature value extraction including mean value, effective value, peak value, root square amplitude, waveform index, impulse index, and kurtosis index is performed on the original data, wherein,

有效值采用如下公式计算：

The effective value is calculated using the following formula:

峰值采用如下公式计算：X _p＝max(x _i) The peak value is calculated using the following formula: X _p =max(x _i )

方根幅值采用如下公式计算：

The square root amplitude is calculated using the following formula:

波形指标采用如下公式计算：

The waveform index is calculated using the following formula:

脉冲指标采用如下公式计算：

The impulse index is calculated using the following formula:

峭度指标采用如下公式计算：

The kurtosis index is calculated using the following formula:

其中，x _i为单次Agent采集的日志数据；N为日志数据采集的次数；

为采集的日志数据的算术平均值；X _rms为采集的日志数据的有效值；X _p为采集的日志数据的峰值；X _r为采集的日志数据的方根幅值；X _ws为采集的日志数据的波形指标；X _if为采集的日志数据的脉冲指标；X _kv为采集的日志数据的峭度指标。 Among them, x _i is the log data collected by a single agent; N is the number of log data collection;

Is the arithmetic mean of the collected log data; X _rms is the effective value of the collected log data; X _p is the peak value of the collected log data; X _r is the square root amplitude of the collected log data; X _ws is the collected log The waveform index of the data; X _if is the impulse index of the collected log data; X _kv is the kurtosis index of the collected log data.

运用皮尔逊相关系数筛选出有效特征，具体说，是将以上特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值来比较，高于相关度阈值则认为是有效数据，低于相关度阈值则认为是无效数据，需要予以剔除，从而可筛选出有效的数据。例如，相关度阈值为0.7，方根幅值与原始数据的相关系数为0.2，则表明方根幅值为无效数据，峭度指标与原始数据的相关系数为0.85，则认定峭度指标为有效数据。其中，皮尔逊相关系数的公式如下：Use the Pearson correlation coefficient to filter out the effective features. Specifically, the above feature values are calculated with the original data for the Pearson correlation coefficient, and the calculated correlation coefficient is compared with the correlation threshold. If it is higher than the correlation threshold, it is considered It is valid data. If it is lower than the correlation threshold, it is considered invalid data and needs to be eliminated so that valid data can be filtered out. For example, if the correlation threshold is 0.7, the correlation coefficient between the square root amplitude and the original data is 0.2, it indicates that the square root amplitude is invalid data, and the correlation coefficient between the kurtosis index and the original data is 0.85, and the kurtosis index is deemed valid data. Among them, the formula of Pearson's correlation coefficient is as follows:

其中，x _i为单次Agent采集数据值；y _i为单次Agent采集数据中提取的某一特征值；

是日志数据x ₁,x ₂...,x _n的算数平均值；

是y ₁,y ₂...,y _n的算数平均值；N为日志数据采集的次数。 Among them, x _i is the value of data collected by a single agent; y _i is a characteristic value extracted from the data collected by a single agent;

Is the arithmetic average of log data x ₁ , x ₂ ..., x _n ;

Is the arithmetic mean of y ₁ , y ₂ ..., y _n ; N is the number of log data collections.

在一个可选实施例中，Flume包括多个第一层级Agent和一个第二层级 Agent，每个第一层级Agent分别对应的采集一个服务器的日志数据，多个第一层级Agent采集的日志数据汇集至第二层级Agent，并由第二层级Agent传输至HDFS(分布式文件系统)中。In an optional embodiment, Flume includes multiple first-level agents and one second-level agent. Each first-level agent collects log data from a server, and the log data collected by multiple first-level agents are collected. To the second-level agent, and the second-level agent transmits to HDFS (distributed file system).

参阅图2所示，是本申请电子设备的实施例的硬件架构示意图。本实施例中，所述电子设备2是一种能够按照事先设定或者存储的指令，自动进行数值计算和/或信息处理的设备。例如，可以是智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器，或者多个服务器所组成的服务器集群)等。如图2所示，所述电子设备2至少包括，但不限于，可通过系统总线相互通信连接的存储器21、处理器22、网络接口23。其中：所述存储器21至少包括一种类型的计算机可读存储介质，所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中，所述存储器21可以是所述电子设备2的内部存储单元，例如该电子设备2的硬盘或内存。在另一些实施例中，所述存储器21也可以是所述电子设备2的外部存储设备，例如该电子设备2上配备的插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)等。当然，所述存储器21还可以既包括所述电子设备2的内部存储单元也包括其外部存储设备。本实施例中，所述存储器21通常用于存储安装于所述电子设备2的操作系统和各类应用软件，例如所述集群日志特征提取程序代码等。此外，所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。Refer to FIG. 2, which is a schematic diagram of the hardware architecture of an embodiment of the electronic device of the present application. In this embodiment, the electronic device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. For example, it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers). As shown in FIG. 2, the electronic device 2 at least includes, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can be communicatively connected to each other through a system bus. Wherein: the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM) ), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the electronic device 2, for example, a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, for example, a plug-in hard disk equipped on the electronic device 2, a smart media card (SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the electronic device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the electronic device 2, such as the cluster log feature extraction program code. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.

所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit，CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述电子设备2的总体操作，例如执行与所述电子设备2进行数据交互或者通信相关的控制和处理等。本实施例中，所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据，例如运行所述的集群日志特征提取程序等。The processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 22 is generally used to control the overall operation of the electronic device 2, for example, perform data interaction or communication-related control and processing with the electronic device 2. In this embodiment, the processor 22 is configured to run the program code or process data stored in the memory 21, for example, run the cluster log feature extraction program.

所述网络接口23可包括无线网络接口或有线网络接口，该网络接口23 通常用于在所述电子设备2与其他电子设备之间建立通信连接。例如，所述网络接口23用于通过网络将所述电子设备2与推送平台相连，在所述电子设备2与推送平台之间建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication，GSM)、宽带码分多址(Wideband CodeDivision Multiple Access，WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。The network interface 23 may include a wireless network interface or a wired network interface. The network interface 23 is usually used to establish a communication connection between the electronic device 2 and other electronic devices. For example, the network interface 23 is used to connect the electronic device 2 with a push platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and the push platform. The network may be Intranet, Internet, Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network , Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.

可选地，该电子设备2还可以包括显示器，显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode，OLED)显示器等。显示器用于显示在电子设备2中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 2 may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) display, etc. The display is used to display the information processed in the electronic device 2 and to display a visualized user interface.

需要指出的是，图2仅示出了具有组件21-23的电子设备2，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。It should be pointed out that FIG. 2 only shows the electronic device 2 with components 21-23, but it should be understood that it is not required to implement all of the illustrated components, and more or fewer components may be implemented instead.

包含可读存储介质的存储器21中可以包括操作系统、集群日志特征提取程序50等。处理器22执行存储器21中集群日志特征提取程序50时实现如下步骤：The memory 21 containing a readable storage medium may include an operating system, a cluster log feature extraction program 50, and the like. The processor 22 implements the following steps when executing the cluster log feature extraction program 50 in the memory 21:

步骤S10，通过flume(分布式的海量日志采集、聚合和传输系统)客户端采集服务器集群的日志，发送至Hbase数据库服务器。Flume以Agent组件为最小的独立运行单位，一个Agent组件就是一个完整的数据收集工具。Flume通过多个Agent来对应各服务器收集日志数据。对应每一台服务器设置一个Agent，定时将对应的服务器上的日志数据收集并通过API接口发送到后台。In step S10, the logs of the server cluster are collected by the flume (distributed mass log collection, aggregation and transmission system) client, and sent to the Hbase database server. Flume takes the Agent component as the smallest independent operating unit, and an Agent component is a complete data collection tool. Flume collects log data corresponding to each server through multiple agents. Set up an Agent for each server, collect log data on the corresponding server regularly and send it to the background through the API interface.

步骤S70，运用皮尔逊相关系数筛选出有效特征，将提取的特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值进行比较，高于相关度阈值则认为是有效数据，低于相关度阈值则认为是无效数据并予以剔除。Step S70: Use the Pearson correlation coefficient to filter out the effective features, and calculate the Pearson correlation coefficient between the extracted feature values and the original data, and compare the calculated correlation coefficient with the correlation threshold. If it is higher than the correlation threshold, it is considered Data is valid, and data below the correlation threshold is considered invalid data and removed.

需要说明的是，本申请之电子设备的具体实施方式与上述集群日志提取方法的具体实施方式大致相同，在此不再赘述。It should be noted that the specific implementation of the electronic device of the present application is substantially the same as the specific implementation of the cluster log extraction method described above, and will not be repeated here.

在本实施例中，存储于存储器21中的所述集群日志特征提取程序可以被分割为一个或者多个程序模块，所述一个或者多个程序模块被存储于存储器21中，并可由一个或多个处理器(本实施例为处理器22)所执行，以完成本申请。例如，图3示出了所述集群日志特征提取程序的程序模块示意图，该实施例中，所述集群日志特征提取程序50可以被分割为日志采集模块501、数据清洗模块502、特征提取模块503、有效特征筛选模块504。其中，本申请所称的程序模块是指能够完成特定功能的一系列计算机程序指令段，比程序更适合于描述所述集群日志特征提取程序在所述电子设备2中的执行过程。通过所述程序模块的具体功能实现集群日志特征提取方法。In this embodiment, the cluster log feature extraction program stored in the memory 21 can be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and can be divided into one or more program modules. It is executed by two processors (in this embodiment, the processor 22) to complete the application. For example, FIG. 3 shows a schematic diagram of program modules of the cluster log feature extraction program. In this embodiment, the cluster log feature extraction program 50 can be divided into a log collection module 501, a data cleaning module 502, and a feature extraction module 503. , Effective feature screening module 504. Wherein, the program module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is more suitable than a program to describe the execution process of the cluster log feature extraction program in the electronic device 2. The cluster log feature extraction method is realized through the specific functions of the program module.

本申请还提供一种集群日志特征提取装置，包括：日志采集模块501、数据清洗模块502、特征提取模块503、有效特征筛选模块504。This application also provides a cluster log feature extraction device, including: a log collection module 501, a data cleaning module 502, a feature extraction module 503, and an effective feature screening module 504.

其中，日志采集模块501用于通过flume(分布式的海量日志采集、聚合和传输系统)客户端采集服务器集群的日志，发送至Hbase数据库服务器。Flume以Agent组件为最小的独立运行单位，一个Agent组件就是一个完整的数据收集工具。Flume通过多个Agent来对应各服务器收集日志数据。对应每一台服务器设置一个Agent，定时将对应的服务器上的日志数据收集并通过API接口发送到后台。Among them, the log collection module 501 is configured to collect logs of the server cluster through a flume (distributed mass log collection, aggregation and transmission system) client, and send them to the Hbase database server. Flume takes the Agent component as the smallest independent operating unit, and an Agent component is a complete data collection tool. Flume collects log data corresponding to each server through multiple agents. Set up an Agent for each server, collect log data on the corresponding server regularly and send it to the background through the API interface.

数据清洗模块502用于利用Hadoop(分布式系统基础架构)对日志数据进行数据清洗，筛选出原始数据，其中原始数据至少包括服务器磁盘占用率、内存使用率、cpu占用率、业务接口调用量。The data cleaning module 502 is configured to use Hadoop (distributed system infrastructure) to perform data cleaning on log data to filter out the original data, where the original data includes at least the server disk occupancy rate, memory usage rate, cpu occupancy rate, and business interface call volume.

特征提取模块503用于对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取。The feature extraction module 503 is used for extracting feature values including average value, effective value, peak value, square root amplitude, waveform index, impulse index, and kurtosis index from the original data.

有效特征筛选模块504运用皮尔逊相关系数筛选出有效特征，将提取的特征值分别与原始数据进行皮尔逊相关系数的运算，根据计算出的相关系数与相关度阈值进行比较，高于相关度阈值则认为是有效数据，低于相关度阈值则认为是无效数据并予以剔除。The effective feature selection module 504 uses the Pearson correlation coefficient to filter out the effective features, and performs the calculation of the Pearson correlation coefficient on the extracted feature values and the original data respectively, and compares the calculated correlation coefficient with the correlation threshold, which is higher than the correlation threshold Data is considered valid, and data below the correlation threshold is considered invalid data and removed.

在一个可选实施例中，如图6所示，数据清洗模块502包括拉依达准则判定单元5021，拉依达准则判定单元5021采用拉依达准则剔除具有粗大误差的数据，包括以下步骤：In an optional embodiment, as shown in FIG. 6, the data cleaning module 502 includes a Laida criterion determining unit 5021, and the Laida criterion determining unit 5021 uses the Laida criterion to remove data with gross errors, including the following steps:

对日志数据x ₁,x ₂...,x _n,计算其算术平均值

及剩余误差

其中，x _i为单次Agent采集数据值； For log data x ₁ , x ₂ ..., x _n , calculate its arithmetic average

And residual error

Among them, x _i is the data value collected by a single agent;

计算标准偏差S _x，

Calculate the standard deviation S _x ,

若数据x _b的剩余误差v _b(1≤b≤n)，满足下式

If the residual error v _b (1≤b≤n) of the data x _b satisfies the following formula

则认为x _b是含有粗大误差值的奇异值，并剔除该奇异值。 It is considered that x _b is a singular value with a gross error value, and the singular value is eliminated.

进一步地，数据清洗模块502还包括奇异值替换单元5022。采用拉依达法则能有效地识别出生产数据的奇异值，但对于剔除掉的数据则会产生空值。奇异值替换单元5022对识别出的日志数据的奇异值用中值替代，实现对生产数据信息的预处理。其中所述中值是指将各个变量值x ₁,x ₂...,x _n按大小顺序排列起来，形成一个数列，处于变量数列中间位置的变量值就称为中值。 Further, the data cleaning module 502 further includes a singular value replacement unit 5022. Using Laida's law can effectively identify the singular value of production data, but will produce null values for the deleted data. The singular value replacement unit 5022 replaces the identified singular value of the log data with the median value to realize the preprocessing of the production data information. The median value means that the variable values x ₁ , x ₂ ..., x _n are arranged in order of magnitude to form a sequence, and the variable value in the middle of the variable sequence is called the median value.

在一个可选实施例中，如图5所示，特征提取模块503包括均值提取单元5031、有效值提取单元5032、峰值提取单元5033、方根幅值提取单元5034、波形指标提取单元5035、脉冲指标提取单元5036、峭度指标提取单元5037。分别对原始数据进行包括均值、有效值、峰值、方根幅值、波形指标、脉冲指标、峭度指标的特征值提取，其中，In an optional embodiment, as shown in FIG. 5, the feature extraction module 503 includes a mean value extraction unit 5031, an effective value extraction unit 5032, a peak value extraction unit 5033, a square root amplitude extraction unit 5034, a waveform index extraction unit 5035, a pulse Index extraction unit 5036 and kurtosis index extraction unit 5037. Extract the eigenvalues of the original data, including the mean value, effective value, peak value, root square amplitude, waveform index, impulse index, and kurtosis index. Among them,

有效值采用如下公式计算：

The effective value is calculated using the following formula:

方根幅值采用如下公式计算：

The square root amplitude is calculated using the following formula:

波形指标采用如下公式计算：

The waveform index is calculated using the following formula:

脉冲指标采用如下公式计算：

The impulse index is calculated using the following formula:

峭度指标采用如下公式计算：

The kurtosis index is calculated using the following formula:

是日志数据x ₁，x ₂...,x _n的算数平均值；

是y ₁，y ₂...,y _n的算数平均值；N为数据采集的次数。 Among them, x _i is the value of data collected by a single agent; y _i is a characteristic value extracted from the data collected by a single agent;

Is the arithmetic mean of log data x ₁ , x ₂ ..., x _n ;

Is the arithmetic mean of y ₁ , y ₂ ..., y _n ; N is the number of data collections.

在一个可选实施例中，如图4所示，日志采集模块501还包括Agent设置单元5011，用于针对Flume进行包括多个第一层级Agent和一个第二层级Agent的设置，每个第一层级Agent分别对应的采集一个服务器的日志数据，多个第一层级Agent采集的日志数据汇集至第二层级Agent，并由第二层级Agent传输至HDFS中。In an optional embodiment, as shown in FIG. 4, the log collection module 501 further includes an agent setting unit 5011, which is configured to set a plurality of first-level agents and a second-level agent for Flume, and each first-level agent The level agents respectively collect log data of a server, and the log data collected by multiple first level agents are collected to the second level agent, and the second level agent is transmitted to HDFS.

需要说明的是，本申请之集群日志特征提取装置的具体实施方式与上述集群日志特征提取方法、电子设备的具体实施方式大致相同，在此不再赘述。It should be noted that the specific implementation of the cluster log feature extraction device of the present application is substantially the same as the specific implementation of the cluster log feature extraction method and the electronic device described above, and will not be repeated here.

此外，本申请实施例还提出一种计算机非易失性可读存储介质，所述计算机可读存储介质可以是硬盘、多媒体卡、SD卡、闪存卡、SMC、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器等等中的任意一种或者几种的任意组合。所述计算机非易失性可读存储介质中包括集群日志特征提取程序等，所述集群日志特征提取程序50被处理器22执行时实现如下操作：In addition, the embodiment of the present application also proposes a computer non-volatile readable storage medium. The computer readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), a Erasing programmable read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, etc., or any combination of several. The computer non-volatile readable storage medium includes a cluster log feature extraction program, etc., and the cluster log feature extraction program 50 implements the following operations when executed by the processor 22:

步骤S10，通过flume客户端采集服务器集群的日志，发送至Hbase数据库服务器。Flume以Agent组件为最小的独立运行单位，一个Agent组件就是一个完整的数据收集工具。Flume通过多个Agent来对应各服务器收集日志数据。对应每一台服务器设置一个Agent，定时将对应的服务器上的日志数据收集并通过API接口发送到后台。In step S10, logs of the server cluster are collected through the flume client and sent to the Hbase database server. Flume takes the Agent component as the smallest independent operating unit, and an Agent component is a complete data collection tool. Flume collects log data corresponding to each server through multiple agents. Set up an Agent for each server, collect log data on the corresponding server regularly and send it to the background through the API interface.

步骤S30，利用Hadoop对日志数据进行数据清洗，筛选出原始数据，其中原始数据至少包括服务器磁盘占用率、内存使用率、cpu占用率、业务接口调用量。Step S30: Use Hadoop to perform data cleaning on the log data to filter out the original data, where the original data includes at least server disk occupancy rate, memory usage rate, cpu occupancy rate, and business interface call volume.

本申请之计算机非易失性可读存储介质的具体实施方式与上述集群日志特征提取方法以及电子设备2的具体实施方式大致相同，在此不再赘述。The specific implementation of the computer non-volatile readable storage medium of the present application is substantially the same as the specific implementation of the cluster log feature extraction method and the electronic device 2 described above, and will not be repeated here.

以上所述仅为本申请的优选实施例，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The foregoing descriptions are only preferred embodiments of the application, and are not intended to limit the application. For those skilled in the art, the application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims

A cluster log feature extraction method, applied to electronic equipment, is characterized in that it includes the following steps:

Logs of the server cluster are collected through the flume client and sent to the Hbase database. The flume client collects the logs of each server in the server cluster through multiple agent processes. The agent regularly collects and passes the log data on the corresponding server. API interface is sent to Hbase database;

Use Hadoop to clean the log data and filter out the original data. The original data includes at least the server disk occupancy rate, memory usage rate, cpu occupancy rate, and business interface call volume;

Extract the feature value of the original data including mean value, effective value, peak value, square root amplitude, waveform index, impulse index, and kurtosis index;

Use Pearson correlation coefficient to filter out effective features: Perform Pearson correlation coefficient calculations on the extracted feature values with the original data, and compare the calculated correlation coefficient with the correlation threshold. If the correlation coefficient is higher than the correlation threshold, it is valid data. Data below the correlation threshold is invalid data and will be eliminated.

The cluster log feature extraction method according to claim 1, characterized in that:

In the data cleaning process, the Laida criterion is used to remove data with gross errors, including the following steps:

For log data x ₁ ,x ₂ ...,x _n , calculate the arithmetic average

And residual error

Among them, x _i is the log data collected by a single agent;

Calculate the standard deviation S _x ,

It is determined that x _b is a singular value with a gross error value, and the singular value is eliminated.

The cluster log feature extraction method according to claim 2, characterized in that:

The singular value of the log data is replaced with a median value, where the median value refers to arranging each log data x ₁ , x ₂ ..., x _n in order of size, and the value in the middle position is called the median value.

The original data is extracted with feature values including mean value, effective value, peak value, root square amplitude, waveform index, impulse index, and kurtosis index, among which,

The effective value is calculated using the following formula:

The peak value is calculated using the following formula: X _p =max(x _i )

The square root amplitude is calculated using the following formula:

The waveform index is calculated using the following formula:

The impulse index is calculated using the following formula:

The kurtosis index is calculated using the following formula:

Among them, x _i is the log data collected by a single agent; N is the number of data collection;

The formula of Pearson's correlation coefficient is as follows:

Among them, x _i is the log data collected by a single agent; y _i is a characteristic value extracted from the data collected by a single agent;

Is the arithmetic mean of log data x ₁ , x ₂ ..., x _n ;

Flume includes multiple first-level agents and one second-level agent. Each first-level agent collects log data from a server respectively. The log data collected by multiple first-level agents are collected to the second-level agent, and the The second-level agent is transmitted to HDFS.

A cluster log feature extraction device, which is characterized by comprising: a log collection module, a data cleaning module, a feature extraction module, and an effective feature screening module,

Among them, the log collection module is used to collect the logs of the server cluster through the flume client and send them to the Hbase database. Among them, the flume client collects the logs of each server in the server cluster through multiple agent processes, and the agent periodically sends the corresponding server The log data on the system is collected and sent to the Hbase database through the API interface;

The data cleaning module is used to use Hadoop to clean the log data and filter out the original data. The original data includes at least the server disk occupancy rate, memory usage rate, cpu occupancy rate, and business interface call volume;

The feature extraction module is used to extract the feature value of the original data including mean value, effective value, peak value, square root amplitude, waveform index, impulse index, and kurtosis index;

The effective feature screening module uses the Pearson correlation coefficient to filter out the effective features: the extracted feature value is calculated with the original data for the Pearson correlation coefficient, and the calculated correlation coefficient is compared with the correlation threshold. If it is higher than the correlation threshold The data is valid, and the data is invalid if it is lower than the correlation threshold, and it will be eliminated.

The cluster log feature extraction device according to claim 7, wherein the data cleaning module includes a Laida criterion determining unit, and the Laida criterion determining unit uses the Laida criterion to eliminate data with gross errors, including the following step:

For log data x ₁ , x ₂ ..., x _n , calculate its arithmetic average

And residual error

Among them, x _i is the data value collected by a single agent;

Calculate the standard deviation S _x ,

It is considered that x _b is a singular value with a gross error value, and the singular value is eliminated.

The cluster log feature extraction device according to claim 8, wherein the data cleaning module further comprises a singular value replacement unit, and the singular value replacement unit replaces the singular value of the log data with a median value, wherein the median value is Refers to arranging each log data x ₁ , x ₂ ..., x _n in order of size, and the value in the middle position is called the median value.

The cluster log feature extraction device according to claim 8, wherein the feature extraction module includes a mean value extraction unit, an effective value extraction unit, a peak value extraction unit, a square root amplitude extraction unit, a waveform index extraction unit, and an impulse index extraction unit , The kurtosis index extraction unit separately extracts the characteristic values of the original data including the mean value, the effective value, the peak value, the square root amplitude, the waveform index, the impulse index, and the kurtosis index.

The effective value is calculated using the following formula:

The peak value is calculated using the following formula: X _p =max(x _i )

The square root amplitude is calculated using the following formula:

The waveform index is calculated using the following formula:

The impulse index is calculated using the following formula:

The kurtosis index is calculated using the following formula:

Among them, x _i is the log data collected by a single agent; N is the number of log data collection;

The cluster log feature extraction device according to claim 8, wherein the formula of Pearson correlation coefficient is as follows:

Is the arithmetic mean of log data x ₁ , x ₂ ..., x _n ;

The cluster log feature extraction device according to claim 7, wherein the log collection module further comprises an agent setting unit, which is used to set a plurality of first-level agents and a second-level agent for Flume. The first-level agents respectively collect log data of a server, and the log data collected by multiple first-level agents are collected to the second-level agent, and the second-level agent is transmitted to HDFS.

An electronic device, characterized in that it includes a memory and a processor, the memory stores a cluster log feature extraction program, and the cluster log feature extraction program is executed by the processor to implement the following steps:

Use the Pearson correlation coefficient to filter out the effective features: the extracted feature values are calculated with the original data for the Pearson correlation coefficient, and the calculated correlation coefficient is compared with the correlation threshold. If it is higher than the correlation threshold, it is valid data. Data below the correlation threshold is invalid data and will be eliminated.

The electronic device according to claim 13, wherein:

In data cleaning, the Laida criterion is used to eliminate data with gross errors, including the following steps:

For log data x ₁ ,x ₂ ...,x _n , calculate the arithmetic average

And residual error

Among them, x _i is the data value collected by a single agent;

Calculate the standard deviation S _x ,

The electronic device according to claim 14, wherein:

The singular value in the log data is replaced with a median value, where the median value refers to arranging each log data x ₁ , x ₂ ..., x _n in order of size, and the value in the middle position is called the median value.

The electronic device according to claim 13, wherein Flume includes a plurality of first-level agents and a second-level agent, and each first-level agent collects log data of a server respectively, and multiple first-level agents The log data collected by the agent is collected to the second-level agent and transmitted to the HDFS by the second-level agent.

A computer nonvolatile readable storage medium, wherein the computer nonvolatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, Implement the cluster log feature extraction method of claim 1.

18. The computer non-volatile readable storage medium of claim 17, wherein in the data cleaning process, using the Laida criterion to remove data with gross errors includes the following steps:

For log data x ₁ ,x ₂ ...,x _n , calculate the arithmetic average

And residual error

Among them, x _i is the log data collected by a single agent;

Calculate the standard deviation S _x ,

The computer non-volatile readable storage medium according to claim 18, wherein the singular value of the log data is replaced with a median value, wherein the median value refers to each log data x ₁ , x ₂ .. ., x _n are arranged in order of size, and the value in the middle is called the median.

The computer non-volatile readable storage medium of claim 17, wherein Flume comprises a plurality of first-level agents and a second-level agent, and each first-level agent collects a log of a server respectively Data, the log data collected by multiple first-level agents are collected to the second-level agent, and transmitted to the HDFS by the second-level agent.