CN106130809B

CN106130809B - A kind of IaaS cloud platform network failure locating method and system based on log analysis

Info

Publication number: CN106130809B
Application number: CN201610808973.7A
Authority: CN
Inventors: 张竞慧; 罗军舟; 董坚
Original assignee: Southeast University; Focus Technology Co Ltd
Current assignee: Southeast University; Focus Technology Co Ltd
Priority date: 2016-09-07
Filing date: 2016-09-07
Publication date: 2019-06-25
Anticipated expiration: 2036-09-07
Also published as: CN106130809A

Abstract

The invention discloses an IaaS cloud platform network fault location method and system based on log analysis, including a fault injection module, a log collection and analysis module, a knowledge generation module, and a fault detection and location module. First, by injecting various typical network faults , and form corresponding various fault logs. Then, for various types of faults, log information related to network faults at various levels such as physical resources, operating systems, virtual machines, and OpenStack are collected, and the Apriori algorithm is used to mine the fault characteristics of the collected network fault log information. On this basis, according to the maximum frequent itemsets and parameters such as support and confidence, the Bayesian formula is used to generate association rules and knowledge corresponding to specific network faults. Finally, when a network failure occurs in the system again, the collected failure log can be compared and analyzed with the association rules of the knowledge base, so as to locate the level of the network failure.

Description

A method and system for network fault location of IaaS cloud platform based on log analysis

技术领域technical field

本发明涉及云计算、计算机网络和数据挖掘领域，特别涉及网络故障检测技术，具体涉及一种基于日志分析的云平台网络故障定位方法。The invention relates to the fields of cloud computing, computer networks and data mining, in particular to a network fault detection technology, and in particular to a cloud platform network fault location method based on log analysis.

背景技术Background technique

在当今互联网以及大数据应用快速发展的云时代，与云计算结合的各类新型网络应用不断涌现，云计算已经逐渐演变为新型信息化系统的主流计算泛型。云计算是并行计算、分布式计算、效用计算以及虚拟化等一系列网络技术和计算技术融合的产物。云计算平台按照提供服务层次的不同通常可分为IaaS、PaaS、SaaS：IaaS(Infrastructure as aService)提供虚拟化服务，即提供虚拟机及相应的虚拟计算、虚拟存储和虚拟网络资源。用户通常关注虚拟机的类型以及相关配置(CPU、内存、磁盘、网络等)，虚拟机上层的中间件以及应用由用户自己部署。PaaS(Platform as a Service)提供应用软件的运行环境以及中间件服务，用户往往只关注应用软件的开发及在PaaS中部署相关数据和应用。SaaS(Software as a Service)提供应用软件服务。In today's cloud era with the rapid development of the Internet and big data applications, various new network applications combined with cloud computing continue to emerge, and cloud computing has gradually evolved into the mainstream computing generic of new information systems. Cloud computing is the product of the integration of a series of network technologies and computing technologies such as parallel computing, distributed computing, utility computing, and virtualization. Cloud computing platforms can usually be divided into IaaS, PaaS, and SaaS according to different service levels: IaaS (Infrastructure as a Service) provides virtualization services, that is, provides virtual machines and corresponding virtual computing, virtual storage and virtual network resources. Users usually pay attention to the type of virtual machine and related configurations (CPU, memory, disk, network, etc.), and the middleware and applications on the upper layer of the virtual machine are deployed by the user. PaaS (Platform as a Service) provides the operating environment and middleware services of application software, and users often only focus on the development of application software and the deployment of related data and applications in PaaS. SaaS (Software as a Service) provides application software services.

作为云计算的支撑基础设施，IaaS云平台提供了弹性、可扩展的基础设施服务，能够给上层应用提供大规模、按需分配的计算服务、存储服务和网络服务。其中，IaaS云平台的网络服务作为其最为核心的服务，是影响各类云应用服务质量的关键。如图1所示，作为当前最流行的云管理平台，Openstack部署在云平台底层物理的计算、存储和网络资源之上，可实现计算、存储和网络资源的统一管理，提供IaaS层的云基础设施统一服务。特别地，Openstack的Nova和Neutron服务组件，对IaaS云平台的虚拟机服务以及网络服务起着至关重要的影响。其中，Nova作为OpenStack的核心服务，管理IaaS云平台中虚拟机的整个生命周期；Neutron提供IaaS云平台网络服务，为虚拟机创建虚拟网络以及与物理网络互联。Openstack已经成为当今工业界和学术界IaaS云平台事实上的部署标准。As the supporting infrastructure of cloud computing, IaaS cloud platform provides elastic and scalable infrastructure services, and can provide large-scale, on-demand computing services, storage services and network services for upper-layer applications. Among them, the network service of the IaaS cloud platform, as its core service, is the key to the service quality of various cloud applications. As shown in Figure 1, as the most popular cloud management platform at present, Openstack is deployed on the underlying physical computing, storage and network resources of the cloud platform, which can realize unified management of computing, storage and network resources, and provide the cloud foundation of the IaaS layer. Facility unified service. In particular, the Nova and Neutron service components of Openstack have a crucial impact on the virtual machine service and network service of the IaaS cloud platform. Among them, Nova, as the core service of OpenStack, manages the entire life cycle of virtual machines in the IaaS cloud platform; Neutron provides IaaS cloud platform network services, creating virtual networks for virtual machines and interconnecting with physical networks. Openstack has become the de facto deployment standard for IaaS cloud platforms in industry and academia today.

然而，随着数据中心IaaS云平台规模的不断扩大，其整体网络拓扑更为复杂，平台节点本身的网络服务也更为脆弱，云平台的网络故障也更加频繁地发生。随着云管理平台OpenStack的部署，当IaaS云平台发生网络故障时，故障根源可能发生在物理资源(如物理机宕机)、操作系统(如操作系统故障)、虚拟机(如虚拟机故障、配置文件错误)、OpenStack(如Nova、Neutron服务失效)等IaaS云平台的各层次。通过分析故障日志表象，很难直观地定位网络故障的根源，并且每次发生网络故障时，都需要排查各个层次以及各组件的日志，耗费大量的人力物力，取得的效果仍未必好。因此，当IaaS云平台发生网络故障时，如何能够快速准确地定位发生故障的原因，进而帮助快速的修复网络故障就显得十分重要和有意义。However, with the continuous expansion of the scale of the data center IaaS cloud platform, its overall network topology is more complex, the network services of the platform nodes themselves are also more vulnerable, and the network failures of the cloud platform occur more frequently. With the deployment of the cloud management platform OpenStack, when a network failure occurs on the IaaS cloud platform, the root cause of the failure may occur in physical resources (such as physical machine downtime), operating system (such as operating system failure), virtual machines (such as virtual machine failure, Configuration file error), OpenStack (such as Nova, Neutron service failure) and other layers of IaaS cloud platforms. By analyzing the appearance of the fault log, it is difficult to intuitively locate the root cause of the network fault, and every time a network fault occurs, it is necessary to check the logs of each level and each component, which consumes a lot of manpower and material resources, and the effect is not necessarily good. Therefore, when a network failure occurs on the IaaS cloud platform, it is very important and meaningful to quickly and accurately locate the cause of the failure, and then help to quickly repair the network failure.

发明内容SUMMARY OF THE INVENTION

发明目的：为了克服现有技术中存在的不足，本发明提供一种基于日志分析的IaaS云平台网络故障定位方法及系统，本发明能够诊断和定位网络故障发生的位置。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides an IaaS cloud platform network fault location method and system based on log analysis. The present invention can diagnose and locate the location of network faults.

技术方案：为实现上述目的，本发明采用的技术方案为：Technical scheme: In order to realize the above-mentioned purpose, the technical scheme adopted in the present invention is:

一种基于日志分析的IaaS云平台网络故障定位方法，包括以下步骤：A method for locating network faults in an IaaS cloud platform based on log analysis, comprising the following steps:

步骤1，注入各类典型的网络故障，形成相应的故障日志信息。Step 1, inject various typical network faults to form corresponding fault log information.

步骤2，分别采集步骤1中注入的各类典型的网络故障产生的物理资源层、操作系统层、虚拟机层以及OpenStack网络服务组件层的网络故障相关的日志信息，对采集到的故障日志信息进行预处理并形成布尔型的故障日志数据，使用Apriori算法进行故障特征挖掘。Step 2: Collect log information related to network faults at the physical resource layer, operating system layer, virtual machine layer, and OpenStack network service component layer caused by various typical network faults injected in step 1, respectively, and analyze the collected fault log information. Preprocess and form Boolean fault log data, and use Apriori algorithm to mine fault features.

步骤3，根据支持度、置信度参数，将故障特征通过贝叶斯公式生成对应特定网络故障的关联规则和知识，并将得到的知识加入到故障知识库中。Step 3: According to the parameters of support and confidence, the fault features are used to generate association rules and knowledge corresponding to specific network faults through Bayesian formula, and the obtained knowledge is added to the fault knowledge base.

步骤4，网络故障定位，在网络故障发生时，可根据所采集的故障日志与故障知识库的关联规则进行对比和分析，从而定位云平台网络故障发生的层次。Step 4, network fault location. When a network fault occurs, the collected fault log can be compared and analyzed according to the association rules of the fault knowledge base, so as to locate the level of the cloud platform network fault occurrence.

所述步骤1中各类典型的网络故障形成相应的故障日志信息包括物理资源层、操作系统层、虚拟机层以及OpenStack网络服务组件层网络故障相关的日志信息。所述OpenStack网络服务组件层包括Nova、Neutron、Open vSwitch、Libvirt的故障日志信息。In the step 1, various types of typical network faults form corresponding fault log information, including log information related to network faults at the physical resource layer, the operating system layer, the virtual machine layer, and the OpenStack network service component layer. The OpenStack network service component layer includes fault log information of Nova, Neutron, Open vSwitch, and Libvirt.

所述的步骤2中日志信息的采集主要通过将步骤1中形成的故障日志集中式地汇聚到对日志进行数据挖掘的节点上。故障日志信息的预处理包括数据清理、数据规约、数据选择、数据集成，进而生成可以用于Apriori算法进行数据挖掘的布尔型事务数据。主要执行如下步骤：The collection of the log information in the step 2 is mainly by centrally aggregating the fault logs formed in the step 1 to the nodes that perform data mining on the logs. The preprocessing of fault log information includes data cleaning, data reduction, data selection, and data integration, and then generates Boolean transaction data that can be used for data mining by Apriori algorithm. The main steps are as follows:

步骤201，数据清理，主要是消除部分与数据挖掘无关的日志，并对日志中的空缺值通过利用全局变量填写或者用样本均值填写。Step 201, data cleaning, mainly eliminates some logs irrelevant to data mining, and fills in the vacancies in the logs by using global variables or filling in the sample mean.

步骤202，数据规约，利用正则表达式进行模式匹配，通过描述日志格式的正则表达式，将日志的各个属性分开并对时间戳、日志内容关键字分别进行泛化处理，提取同网络故障相关的关键数据。Step 202, data reduction, using regular expressions to perform pattern matching, separates each attribute of the log by describing the regular expression of the log format, and generalizes the timestamp and log content keywords respectively, and extracts the information related to the network fault. key data.

步骤203，数据选取，选择与数据挖掘有关的日志属性。Step 203, data selection, select log attributes related to data mining.

步骤204，数据集成，利用时间窗口的思想将时间间隔很小的日志进行集成，将故障日志信息统一化，并通过格式转换将集成后的关系型日志转换成布尔型事务数据。Step 204, data integration, using the idea of time window to integrate logs with small time intervals, unify fault log information, and convert the integrated relational logs into Boolean transaction data through format conversion.

步骤205，数据挖掘，利用Apriori算法对日志进行按故障种类的挖掘，生成故障对应的相应的最大频繁项集，Apriori算法的输入是布尔型事务数据，输出是最大频繁项集。Step 205 , data mining, use the Apriori algorithm to mine the log according to the fault type, and generate the corresponding maximum frequent itemset corresponding to the fault. The input of the Apriori algorithm is Boolean transaction data, and the output is the maximum frequent itemset.

所述的步骤3中将故障特征通过贝叶斯公式生成对应特定网络故障的关联规则和知识的方法：将步骤2中挖掘的最大频繁项集，根据支持度、置信度参数利用贝叶斯公式生成对应特定网络故障的关联规则和知识，并将这些知识加入到故障知识库中。主要执行如下步骤：In the step 3, the fault feature is used to generate the association rules and knowledge corresponding to the specific network fault through the Bayesian formula: the maximum frequent itemsets excavated in step 2 are used according to the support and confidence parameters using the Bayesian formula. Generate association rules and knowledge corresponding to specific network faults, and add these knowledge to the fault knowledge base. The main steps are as follows:

步骤301，参数设定，设置支持度以及置信度参数。Step 301, parameter setting, setting support and confidence parameters.

步骤302，知识生成，根据贝叶斯公式，并根据设定的支持度以及置信度参数，生成相应的网络故障知识，并加入到故障知识库中。Step 302, knowledge generation, according to the Bayesian formula, and according to the set support and confidence parameters, generate the corresponding network fault knowledge, and add it to the fault knowledge base.

所述步骤4定位云平台网络故障发生的层次的方法：根据步骤3中知识库中形成的特定故障知识，当网路故障再次发生时，采集网络故障形成的故障日志，将采集到的故障日志经过步骤2的方法得到形成的布尔型事务数据，进而得出产生故障日志的模块，再根据故障知识库中的知识，按照置信度从大到小的顺序来进行故障定位。The method for locating the level at which the cloud platform network fault occurs in the step 4: according to the specific fault knowledge formed in the knowledge base in step 3, when the network fault occurs again, the fault log formed by the network fault is collected, and the collected fault log is collected. Through the method of step 2, the formed Boolean transaction data is obtained, and then the module that generates the fault log is obtained, and then according to the knowledge in the fault knowledge base, the fault is located in descending order of confidence.

所述步骤4中故障定位包括以下步骤：The fault location in step 4 includes the following steps:

步骤401，日志采集，将故障形成的日志集中式地汇聚到对日志进行数据挖掘的节点上，通过对日志包括数据清理、数据规约、数据选择、数据集成操作在内的处理，生成布尔型事务数据，进而得出产生故障日志的模块。Step 401, log collection, centralized aggregation of the logs formed by the failure to the nodes that perform data mining on the logs, and by processing the logs including data cleaning, data reduction, data selection, and data integration operations, a Boolean transaction is generated. data, and then obtain the module that generates the fault log.

步骤402，故障定位，根据故障知识库中的知识，并结合产生故障日志的模块的信息，按照置信度从大到小的顺序来进行故障定位。Step 402, fault location, according to the knowledge in the fault knowledge base and in combination with the information of the module that generates the fault log, the fault location is performed in descending order of confidence.

一种基于日志分析的IaaS云平台网络故障定位系统，包括故障注入模块、日志采集和分析模块、知识生成模块以及故障检测与定位模块，其中：An IaaS cloud platform network fault location system based on log analysis, comprising a fault injection module, a log collection and analysis module, a knowledge generation module, and a fault detection and location module, wherein:

故障注入模块，用于注入各类典型的网络故障，形成相应的故障日志信息。所述故障日志信息包括物理资源层故障日志信息、操作系统层故障日志信息、虚拟机层故障日志信息以及OpenStack网络服务组件故障日志信息。The fault injection module is used to inject various typical network faults to form corresponding fault log information. The fault log information includes physical resource layer fault log information, operating system layer fault log information, virtual machine layer fault log information, and OpenStack network service component fault log information.

日志采集和分析模块，用于采集故障注入模块中形成的故障日志信息一以及根据故障检测与定位模块控制信号采集网络故障发生时的故障日志信息二，对采集到的故障日志信息一、故障日志信息二进行预处理并形成相应的布尔型的故障日志数据一和故障日志数据二，使用Apriori算法对故障日志数据一进行故障特征挖掘。The log collection and analysis module is used to collect the fault log information formed in the fault injection module 1 and collect the fault log information when the network fault occurs according to the control signal of the fault detection and location module. Information 2 is preprocessed to form corresponding Boolean fault log data 1 and fault log data 2, and the Apriori algorithm is used to mine fault log data 1.

知识生成模块，用于根据支持度、置信度参数，将日志采集和分析模块得到的故障特征通过贝叶斯公式生成对应特定网络故障的关联规则和知识，并将得到的知识加入到故障知识库中。The knowledge generation module is used to generate the association rules and knowledge corresponding to specific network faults through the Bayesian formula of the fault features obtained by the log collection and analysis module according to the parameters of support and confidence, and add the obtained knowledge to the fault knowledge base middle.

故障检测与定位模块，用于在网络故障发生时，控制日志采集和分析模块采集网络故障发生时的故障日志信息，并根据日志采集和分析模块得到的布尔型的故障日志数据二与故障知识库中的关联规则进行对比和分析，从而定位云平台网络故障发生的层次。The fault detection and location module is used to control the log collection and analysis module to collect fault log information when a network fault occurs, and based on the Boolean fault log data obtained by the log collection and analysis module and the fault knowledge base The association rules in the cloud platform are compared and analyzed, so as to locate the level of the cloud platform network failure.

本发明相比现有技术，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

(1)实时采集IaaS云平台中各物理资源、虚拟机和OpenStack平台网络组件的日志，通过前期训练可挖掘出云平台中特定网络故障所对应的知识；(1) Collect the logs of each physical resource, virtual machine and OpenStack platform network components in the IaaS cloud platform in real time, and through preliminary training, the knowledge corresponding to specific network faults in the cloud platform can be mined;

(2)在IaaS云平台有未知的网络故障发生时，可以通过故障注入的方式形成特定的故障知识，从而为故障定位提供知识。(2) When an unknown network fault occurs on the IaaS cloud platform, specific fault knowledge can be formed through fault injection, thereby providing knowledge for fault location.

(3)在Iaas云平台发生网络故障时，通过将故障相关的日志和已有的故障类型做比对，可以较好地诊断和定位网络故障发生的位置；(3) When a network fault occurs on the Iaas cloud platform, by comparing the fault-related logs with the existing fault types, the location of the network fault can be better diagnosed and located;

(4)模块化的设计使得网络故障定位系统的各组件之间耦合度较低可以适应新的需求和扩展。(4) The modular design makes the coupling between the components of the network fault location system low and can adapt to new requirements and expansion.

附图说明Description of drawings

图1为IaaS云平台中网络组件层次图。Figure 1 is a hierarchical diagram of network components in the IaaS cloud platform.

图2为本发明实现的IaaS云平台网络故障定位模块交互图。FIG. 2 is an interaction diagram of the network fault location module of the IaaS cloud platform implemented by the present invention.

图3为本发明实现的IaaS云平台网络故障定位流程图。FIG. 3 is a flowchart of network fault location of the IaaS cloud platform implemented by the present invention.

图4为IaaS云平台中网络组件定位流程示意图。FIG. 4 is a schematic diagram of the positioning flow of network components in the IaaS cloud platform.

具体实施方式Detailed ways

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with the accompanying drawings and specific embodiments, the present invention will be further clarified. It should be understood that these examples are only used to illustrate the present invention and are not used to limit the scope of the present invention. Modifications in the form of valence all fall within the scope defined by the appended claims of the present application.

一种基于日志分析的IaaS云平台网络故障定位方法，如图1、3、4所示，包括以下步骤：An IaaS cloud platform network fault location method based on log analysis, as shown in Figures 1, 3, and 4, includes the following steps:

各类典型的网络故障形成相应的故障日志信息包括物理资源层、操作系统层、虚拟机层以及OpenStack网络服务组件层网络故障相关的日志信息。所述OpenStack网络服务组件层包括Nova、Neutron、Open vSwitch、Libvirt的故障日志信息。Various types of typical network faults form the corresponding fault log information, including log information related to network faults at the physical resource layer, operating system layer, virtual machine layer, and OpenStack network service component layer. The OpenStack network service component layer includes fault log information of Nova, Neutron, Open vSwitch, and Libvirt.

通过将步骤1中形成的故障日志集中式地汇聚到对日志进行数据挖掘的节点上。故障日志信息的预处理包括数据清理、数据规约、数据选择、数据集成，进而生成可以用于Apriori算法进行数据挖掘的布尔型事务数据。主要执行如下步骤：By centrally aggregating the fault logs formed in step 1 to the nodes that perform data mining on the logs. The preprocessing of fault log information includes data cleaning, data reduction, data selection, and data integration, and then generates Boolean transaction data that can be used for data mining by Apriori algorithm. The main steps are as follows:

将步骤2中挖掘的最大频繁项集，根据支持度、置信度参数利用贝叶斯公式生成对应特定网络故障的关联规则和知识，并将这些知识加入到故障知识库中。主要执行如下步骤：Based on the maximum frequent itemsets mined in step 2, the Bayesian formula is used to generate association rules and knowledge corresponding to specific network faults according to the parameters of support and confidence, and these knowledges are added to the fault knowledge base. The main steps are as follows:

根据步骤3中知识库中形成的特定故障知识，当网路故障再次发生时，采集网络故障形成的故障日志，将采集到的故障日志经过步骤2的方法得到形成的布尔型事务数据，进而得出产生故障日志的模块，再根据故障知识库中的知识，按照置信度从大到小的顺序来进行故障定位。主要包括以下步骤：According to the specific fault knowledge formed in the knowledge base in step 3, when the network fault occurs again, the fault log formed by the network fault is collected, and the collected fault log is processed by the method of step 2 to obtain the formed Boolean transaction data, and then obtain Find the module that generates the fault log, and then locate the fault according to the knowledge in the fault knowledge base and in descending order of confidence. It mainly includes the following steps:

一种基于日志分析的IaaS云平台网络故障定位系统，如图2、4所示，在IaaS云平台的体系结构中，底层为物理节点，节点之间通过以太网进行连接。在物理节点上安装虚拟化软件KVM，对底层物理节点进行虚拟化，实现硬件的虚拟化管理，将分散的服务器计算资源整合为统一管理的资源池。资源池的上层为IaaS平台层，该层主要部署Openstack软件，提供对虚拟机的开启，关闭，重启，快照等管理。最上层为用户接口层，供用户对IaaS服务进行访问。在此IaaS基本结构的基础上，增加故障注入模块、日志采集和分析模块、知识生成模块以及故障检测与定位模块，从而具体定位发生在物理资源层、操作系统层、虚拟机层或者OpenStack服务组件(如：Nova、Neutron、Open vSwitch)的网络故障的位置。其中：An IaaS cloud platform network fault location system based on log analysis is shown in Figures 2 and 4. In the architecture of the IaaS cloud platform, the bottom layer is the physical node, and the nodes are connected through Ethernet. Install the virtualization software KVM on the physical nodes, virtualize the underlying physical nodes, realize the virtualization management of hardware, and integrate the scattered server computing resources into a unified management resource pool. The upper layer of the resource pool is the IaaS platform layer, which mainly deploys Openstack software to provide management such as opening, closing, restarting, and snapshotting of virtual machines. The top layer is the user interface layer for users to access IaaS services. On the basis of this IaaS basic structure, a fault injection module, a log collection and analysis module, a knowledge generation module, and a fault detection and location module are added, so that the specific location occurs at the physical resource layer, operating system layer, virtual machine layer or OpenStack service components. Location of network failures (eg Nova, Neutron, Open vSwitch). in:

在故障注入模块中，根据经验总结得出IaaS云平台的典型网络故障类型，分别注入各类可能导致网络问题的故障：可注入的典型网络故障包括物理资源故障、操作系统故障、虚拟机故障、OpenStack故障等，从而形成各类典型的网络故障所对应的故障日志信息，包括了物理资源层、操作系统层、虚拟机层以及OpenStack网络服务组件(如：Nova、Neutron、Open vSwitch)的日志信息，以OpenStack的简化版本的DevStack为例，Nova和Neutron等其他大部分的组件的日志统一存放在/opt/stack/logs的目录下。In the fault injection module, based on experience, the typical network fault types of the IaaS cloud platform are obtained, and various types of faults that may cause network problems are injected respectively: The typical network faults that can be injected include physical resource faults, operating system faults, virtual machine faults, OpenStack faults, etc., thus forming fault log information corresponding to various typical network faults, including the log information of the physical resource layer, operating system layer, virtual machine layer, and OpenStack network service components (such as: Nova, Neutron, Open vSwitch) , Taking the simplified version of OpenStack DevStack as an example, the logs of most of the other components such as Nova and Neutron are stored in the /opt/stack/logs directory.

日志采集和分析模块，用于采集故障注入模块中形成的故障日志信息一以及根据故障检测与定位模块控制信采集网络故障发生时的故障日志信息二，对采集到的故障日志信息一、故障日志信息二进行预处理并形成相应的布尔型的故障日志数据一和故障日志数据二，使用Apriori算法对故障日志数据一进行故障特征挖掘。The log collection and analysis module is used to collect the fault log information formed in the fault injection module and collect the fault log information when the network fault occurs according to the control information of the fault detection and location module. Information 2 is preprocessed to form corresponding Boolean fault log data 1 and fault log data 2, and the Apriori algorithm is used to mine fault log data 1.

在日志采集和分析模块中，根据相关网络故障日志所在文件系统中具体的文件位置，对数据中心所有的网络故障日志进行实时采集，采集的日志传输到日志分析的节。根据网络故障日志进行数据清理、数据规约、数据选择以及数据集成等操作后，生成可以用于Apriori算法进行数据挖掘的布尔型事务数据。采用Linux系统远程文件拷贝SCP的方式将网络故障日志拷贝到分析节点上。通过正则表达式的方式，统一将操作系统日志、OpenStack的日志、Libvert日志、OpenVSwitch日志等格式化成如下的基本格式：<时间戳><日志等级><代码模块><Request ID><日志内容><源代码位置>。在此基础上，通过数据清理主要是消除网络故障知识的数据挖掘无关的日志，并对日志中的空缺值通过利用全局变量填写或者用样本均值填写。通过数据规约，将各个属性的具体值泛化到适合数据挖掘的层次。对于数据规约的结果，选择对后续模块有用的字段，抛弃无用的字段。假设原始网络故障日志如下所示：2015-12-10 20:46:49.671ERROR nova.compute.manager[req-5c973fff-e9ba-4317-bfd9-76678cc96584None None]No compute node record for hostdevstack-controller。In the log collection and analysis module, according to the specific file location in the file system where the relevant network failure logs are located, all network failure logs in the data center are collected in real time, and the collected logs are transmitted to the log analysis section. After data cleaning, data reduction, data selection, and data integration operations are performed according to the network fault log, Boolean transaction data that can be used for data mining by the Apriori algorithm is generated. The network fault log is copied to the analysis node by using the Linux system remote file copy SCP method. Through regular expressions, the operating system logs, OpenStack logs, Libvert logs, OpenVSwitch logs, etc. are uniformly formatted into the following basic formats: <timestamp><log level><code module><Request ID><log content> <source code location>. On this basis, data cleaning is mainly to eliminate the data mining irrelevant logs of network fault knowledge, and fill in the vacancies in the logs by using global variables or filling in the sample mean. Through data reduction, the specific value of each attribute is generalized to a level suitable for data mining. For the results of the data reduction, select fields that are useful to subsequent modules and discard useless fields. Suppose the original network failure log looks like this: 2015-12-10 20:46:49.671ERROR nova.compute.manager[req-5c973fff-e9ba-4317-bfd9-76678cc96584None None]No compute node record for hostdevstack-controller.

根据以上步骤，对上述日志进行处理得到的结果如下：2015-12-10 20:46nova.compute[5c973fff-e9ba-4317-bfd9-76678cc96584]。According to the above steps, the result of processing the above log is as follows: 2015-12-10 20:46nova.compute[5c973fff-e9ba-4317-bfd9-76678cc96584].

在数据集成时，对于时间间隔很小的故障日志是对应同一个故障的，可以把一个故障导致所有产生故障日志的代码模块集成在一起。最终对于一个特定的网络故障，形成的日志格式为：<序数><代码模块1><代码模块2>……<代码模块x>。During data integration, fault logs with small time intervals correspond to the same fault, and one fault can cause all code modules that generate fault logs to be integrated together. Finally, for a specific network failure, the log format is: <ordinal number><code module 1><code module 2>...<code module x>.

下面的表格描述了通过注入虚拟机层面的10个网络故障，经过上述所有的数据处理之后得到的结果：The following table describes the results obtained after all the above data processing by injecting 10 network faults at the virtual machine level:

<1><os><n-sch><q-dhcp><openvswitch><libvirt><1><os><n-sch><q-dhcp><openvswitch><libvirt> <2><os><n-cpu><n-net><2><os><n-cpu><n-net> <3><os><n-cpu><n-sch><q-dhcp><q-l3><openvswitch><3><os><n-cpu><n-sch><q-dhcp><q-l3><openvswitch> <4><n-cpu><n-net><q-dhcp><q-l3><openvswitch><libvirt><4><n-cpu><n-net><q-dhcp><q-l3><openvswitch><libvirt> <5><os><n-cpu><openvswitch><libvirt><5><os><n-cpu><openvswitch><libvirt> <6><os><n-sch><libvirt><6><os><n-sch><libvirt> <7><os><n-cpu><n-net><q-dhcp><q-l3><openvswitch><libvirt><7><os><n-cpu><n-net><q-dhcp><q-l3><openvswitch><libvirt> <8><os><n-cpu><n-sch><libvirt><8><os><n-cpu><n-sch><libvirt> <9><os><n-cpu><n-sch><n-net><openvswitch><libvirt><9><os><n-cpu><n-sch><n-net><openvswitch><libvirt> <10><os><n-cpu><n-sch><n-net><q-dhcp><q-l3><openvswitch><libvirt><10><os><n-cpu><n-sch><n-net><q-dhcp><q-l3><openvswitch><libvirt>

其中第一个故障代表着操作系统、nova-schedule、neutron-dhcp、openvswitch以及libvirt四个模块会产生错误日志。其他的条目亦类似。由于数据挖掘的Apriori算法要求输入布尔型数据，需要对上述数据进行简单的格式转换。得到的布尔型数据如下表格所示：The first fault represents that the four modules of the operating system, nova-schedule, neutron-dhcp, openvswitch and libvirt will generate error logs. Other entries are similar. Since the Apriori algorithm of data mining requires input Boolean data, it is necessary to perform simple format conversion on the above data. The resulting boolean data is shown in the following table:

Apriori算法的输入是布尔型事务数据，输出是最大频繁项集。易得出对于上述示例数据的最大频繁项集：ABGH。The input of the Apriori algorithm is Boolean transaction data, and the output is the maximum frequent itemset. It is easy to derive the largest frequent itemset for the above example data: ABGH.

上述的数据清理主要是消除部分与网络故障知识的数据挖掘无关的日志，并对日志中的空缺值通过利用全局变量填写或者用样本均值填写。数据规约利用正则表达式进行模式匹配，通过描述日志格式的正则表达式，将日志的各个属性分开并对时间戳、日志内容关键字等分别进行泛化处理，提取同网络故障相关的关键数据。数据选取，选择与数据挖掘有关的日志属性。数据集成，利用时间窗口的思想将时间间隔很小的日志进行集成，将故障日志信息统一化，并通过格式转换将集成后的关系型日志转换成布尔型事务数据。The above data cleaning is mainly to eliminate some logs that are not related to the data mining of network fault knowledge, and fill in the vacancies in the logs by using global variables or filling in the sample mean. Data reduction uses regular expressions to perform pattern matching. By describing the regular expressions of the log format, it separates the attributes of the log and generalizes the timestamps, log content keywords, etc., to extract key data related to network failures. Data selection, select log attributes related to data mining. Data integration uses the idea of time window to integrate logs with small time intervals, unifies fault log information, and converts the integrated relational logs into Boolean transaction data through format conversion.

知识生成模块，根据最大频繁项集和支持度、置信度等参数，将日志采集和分析模块得到的故障特征通过贝叶斯公式生成对应特定网络故障的关联规则和知识，并将得到的知识加入到故障知识库中。因此其主要执行如下步骤：The knowledge generation module uses the Bayesian formula to generate association rules and knowledge corresponding to specific network faults from the fault features obtained by the log collection and analysis module according to the maximum frequent itemsets, support, confidence and other parameters, and adds the obtained knowledge to into the fault knowledge base. Therefore, it mainly performs the following steps:

步骤1，参数设定。设置支持度以及置信度等参数。Step 1, parameter setting. Set parameters such as support and confidence.

步骤2，知识生成。根据贝叶斯公式，生成相应网络故障知识，并加入到知识库中。Step 2, knowledge generation. According to the Bayesian formula, the corresponding network fault knowledge is generated and added to the knowledge base.

在知识获取模块中，采用贝叶斯条件概率公式：In the knowledge acquisition module, the Bayesian conditional probability formula is used:

其中N代表计数，那么如果我们用支持数代替其中的计数N的话，可以得到如下公式：where N represents the count, then if we replace the count N with the support number, we can get the following formula:

利用最大频繁项集和相关的参数(如：置信度)来进一步生成关联规则。关联规则的生成规则如下：对于一个频繁项集B，对于它的每个非空子集A，如果有：P(B|A)>min_conf，那么可以生成关联规则：A->B。P(B|A)的值就是该规则的置信度，其中min_conf是最小置信度。Using the maximum frequent itemsets and related parameters (such as: confidence) to further generate association rules. The generation rules of association rules are as follows: For a frequent itemset B, for each of its non-empty subsets A, if there is: P(B|A)>min_conf, then the association rules can be generated: A->B. The value of P(B|A) is the confidence of the rule, where min_conf is the minimum confidence.

对于示例，设定支持度为0.5(50％)，那么支持数为10*0.5＝5个。对于之前得到的最大频繁项集ABGH。列举出它的所有非空子集，然后计算其的条件概率，比如：For the example, set the degree of support to 0.5 (50%), then the number of supports is 10*0.5=5. For the previously obtained maximum frequent itemsets ABGH. Enumerate all its non-empty subsets, and then calculate its conditional probability, such as:

其他的计算过程类似，这里不一一列举出来，最终得到的概率如下所示：The other calculation processes are similar and will not be listed here. The final probability is as follows:

子集Subset AA BB GG HH ABAB AGAG AHAH 概率probability 55.6％55.6% 62.5％62.5% 71.4％71.4% 62.5％62.5% 71.4％71.4% 83.3％83.3% 71.4％71.4% 子集Subset BGBG BHBH GHGH ABGABG ABHABH AGHAGH BGHBGH 概率probability 100％100% 83.3％83.3% 83.3％83.3% 100％100% 100％100% 100％100% 100％100%

如果设定置信度为0.7，可得到如下知识：If the confidence level is set to 0.7, the following knowledge can be obtained:

1)如果检测到管理虚拟网络的Open vSwitch组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为71.4％。1) If it is detected that the Open vSwitch component that manages the virtual network generates error logs, then we can infer that a network failure occurs at the virtual machine level. The confidence level is 71.4%.

2)如果检测到Linux操作系统以及Nova组件的计算模块Nova-compute产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为71.4％。2) If it is detected that the Linux operating system and the Nova-compute computing module of the Nova component generate error logs, then we can infer that a network failure has occurred at the virtual machine level. The confidence level is 71.4%.

3)如果检测到Linux操作系统、管理虚拟网络的Open vSwitch组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为83.3％。3) If an error log is generated by the Linux operating system and the Open vSwitch component that manages the virtual network, then we can infer that there is a network failure at the virtual machine level. The confidence level is 83.3%.

4)如果检测到Linux操作系统以及管理虚拟机的Libvirt组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为71.4％。4) If it is detected that the Linux operating system and the Libvirt component that manages the virtual machine generate error logs, then we can infer that a network failure has occurred at the virtual machine level. The confidence level is 71.4%.

5)如果检测到Nova组件的计算模块Nova-compute、管理虚拟网络的Open vSwitch组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为很大可能性。5) If it is detected that the Nova-compute computing module Nova-compute and the Open vSwitch component that manages the virtual network generate error logs, then we can infer that a network failure has occurred at the virtual machine level. The confidence level is high probability.

6)如果检测到Nova组件的计算模块Nova-compute以及管理虚拟机的Libvirt组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为83.3％。6) If it is detected that the computing module Nova-compute of the Nova component and the Libvirt component that manages the virtual machine generate error logs, then we can infer that a network failure has occurred at the virtual machine level. The confidence level is 83.3%.

7)如果检测到管理虚拟网络的Open vSwitch组件以及管理虚拟机的Libvirt组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为83.3％。7) If it is detected that the Open vSwitch component that manages the virtual network and the Libvirt component that manages the virtual machine generate error logs, then we can infer that a network failure occurs at the virtual machine level. The confidence level is 83.3%.

8)如果检测到Linux操作系统、Nova组件的计算模块Nova-compute、管理虚拟网络的Open vSwitch组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为很大可能性。8) If it is detected that the Linux operating system, the Nova-compute computing module of the Nova component, and the Open vSwitch component that manages the virtual network generate error logs, then we can infer that a network failure has occurred at the virtual machine level. The confidence level is high probability.

9)如果检测到Linux操作系统、Nova组件的计算模块Nova-compute以及管理虚拟机的Libvirt组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为很大可能性。9) If it is detected that the Linux operating system, the Nova-compute computing module of the Nova component, and the Libvirt component that manages the virtual machine generate error logs, then we can infer that a network failure has occurred at the virtual machine level. The confidence level is high probability.

10)如果检测到Linux操作系统、管理虚拟网络的Open vSwitch组件以及管理虚拟机的Libvirt组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为很大可能性。10) If error logs are detected from the Linux operating system, the Open vSwitch component that manages the virtual network, and the Libvirt component that manages the virtual machine, then we can infer that there is a network failure at the virtual machine level. The confidence level is high probability.

11)如果检测到Nova组件的计算模块Nova-compute、管理虚拟网络的OpenvSwitch组件以及管理虚拟机的Libvirt组件产生错误日志，那么我们可以推断出是虚拟机层面发生了网络故障。置信度为很大可能性。11) If it is detected that the computing module Nova-compute of the Nova component, the OpenvSwitch component that manages the virtual network, and the Libvirt component that manages the virtual machine generate error logs, then we can infer that a network failure occurs at the virtual machine level. The confidence level is high probability.

在故障检测与定位模块中，当网路故障再次发生时，根据生成网络故障知识库，可以采集故障日志，经过同样的日志处理程序，得出产生故障日志的网络组件和模块，再根据知识库中的网络故障知识，按照置信度从大到小的顺序实现故障定位。In the fault detection and location module, when the network fault occurs again, the fault log can be collected according to the generated network fault knowledge base. After the same log processing procedure, the network components and modules that generate the fault log can be obtained. Based on the knowledge of network faults in the network, the fault location is realized in the order of confidence from large to small.

本发明针对部署OpenStack的IaaS云平台，提供了一种基于日志分析的云平台网络故障定位方法，其流程如图3所示，能够有效解决包括物理资源、操作系统、虚拟机、OpenStack等IaaS云平台各层次发生网络故障时的网络故障定位问题。本发明通过注入各类典型的网络故障，形成相应的各类故障日志。针对各类故障分别采集物理资源、操作系统、虚拟机、OpenStack等各层次网络故障相关的日志信息，并对采集到的网络故障日志信息使用Apriori算法进行故障特征挖掘。在此基础上，根据最大频繁项集和支持度、置信度等参数，利用贝叶斯公式生成对应特定网络故障的关联规则和知识。当系统再次发生网络故障时，可根据所采集的故障日志并与知识库的关联规则进行对比和分析，从而定位网络故障发生的层次。The present invention provides a method for locating network faults in a cloud platform based on log analysis for an IaaS cloud platform where OpenStack is deployed. The problem of network fault location when network faults occur at all levels of the platform. The present invention forms various types of corresponding fault logs by injecting various typical network faults. For various types of faults, log information related to network faults at various levels such as physical resources, operating systems, virtual machines, and OpenStack are collected, and the Apriori algorithm is used to mine the fault characteristics of the collected network fault log information. On this basis, according to the maximum frequent itemsets and parameters such as support and confidence, the Bayesian formula is used to generate association rules and knowledge corresponding to specific network faults. When a network fault occurs again in the system, the collected fault logs can be compared and analyzed with the association rules of the knowledge base, so as to locate the level of the network fault.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

1. a kind of IaaS cloud platform network failure locating method based on log analysis, which comprises the following steps:

Step 1, all kinds of typical network failures are injected, corresponding fault log information is formed；

Step 2, physical resource layer that all kinds of typical network failures injected in acquisition step 1 respectively generate, operating system layer, The relevant log information of network failure of virtual machine layer and OpenStack web services component layer, to collected failure day Will information is pre-processed and is formed the fault log data of Boolean type, carries out fault signature excavation using Apriori algorithm；

Step 3, according to support, confidence level parameter, fault signature is generated into corresponding particular network failure by Bayesian formula Correlation rule and knowledge, and obtained knowledge is added in fault knowledge library；

Step 4, network failure positions, can be according to fault log collected and fault knowledge library when network failure occurs Correlation rule is compared and is analyzed, to position the level of cloud platform network failure generation.

2. the IaaS cloud platform network failure locating method according to claim 1 based on log analysis, it is characterised in that: In the step 1 all kinds of typical network failures formed corresponding fault log information include physical resource layer, operating system layer, Virtual machine layer and the relevant log information of OpenStack web services component layer network failure；The OpenStack network clothes Business component layer includes the fault log information of Nova, Neutron, Open vSwitch, Libvirt.

3. the IaaS cloud platform network failure locating method according to claim 1 based on log analysis, it is characterised in that: The acquisition of log information is mainly by converging to the fault log formed in step 1 to day centralizedly in the step 2 Will carries out on the node of data mining；The pretreatment of fault log information includes data scrubbing, hough transformation, data selection, number According to integrated, and then generate and can be used for the Boolean type Transaction Information that Apriori algorithm carries out data mining.

4. the IaaS cloud platform network failure locating method according to claim 3 based on log analysis, it is characterised in that: Boolean type Transaction Information is generated in the step 2 mainly executes following steps:

Step 201, the part log unrelated with data mining is mainly eliminated in data scrubbing, and logical to the vacancy value in log It crosses and is filled in using global variable or filled in sample average；

Step 202, hough transformation carries out pattern match using regular expression, by describing the regular expression of journal format, Each attribute of log is separated and extensive processing is carried out respectively to timestamp, log content keyword, extracts same network failure Relevant critical data；

Step 203, data decimation selects log properties related with data mining；

Step 204, data integration is integrated the log of time interval very little using the thought of time window, by failure day Will information unification, and Boolean type Transaction Information is converted by the relationship type log that format will be converted after integrating；

Step 205, data mining carries out the excavation by failure mode to log using Apriori algorithm, it is corresponding to generate failure Corresponding maximum frequent itemsets, the input of Apriori algorithm are Boolean type Transaction Informations, and output is maximum frequent itemsets.

5. the IaaS cloud platform network failure locating method according to claim 1 based on log analysis, it is characterised in that: Fault signature is generated to the correlation rule of corresponding particular network failure and the side of knowledge in the step 3 by Bayesian formula Method: the maximum frequent itemsets that will be excavated in step 2 generate corresponding spy using Bayesian formula according to support, confidence level parameter Determine the correlation rule and knowledge of network failure, and these knowledge are added in fault knowledge library.

6. the IaaS cloud platform network failure locating method according to claim 1 based on log analysis, it is characterised in that: Knowledge is added in network failure knowledge base in the step 3 and mainly executes following steps:

Step 301, support and confidence level parameter is arranged in parameter setting；

Step 302, knowledge formation generates corresponding according to Bayesian formula, and according to the support of setting and confidence level parameter Network failure knowledge, and be added in fault knowledge library.

7. the IaaS cloud platform network failure locating method according to claim 1 based on log analysis, it is characterised in that: The method for the level that the step 4 positioning cloud platform network failure occurs: according to the specific fault formed in knowledge base in step 3 Knowledge, when network, failure occurs again, the fault log that acquisition network failure is formed, by collected fault log by walking The Boolean type Transaction Information that rapid 2 method is formed, and then obtain the module for generating fault log, further according to fault knowledge library In knowledge, carry out fault location according to the sequence of confidence level from big to small.

8. the IaaS cloud platform network failure locating method according to claim 1 based on log analysis, it is characterised in that: Fault location in the step 4 the following steps are included:

Step 401, the log that failure is formed is converged to the node that data mining is carried out to log by log collection centralizedly On, by the processing to log including data scrubbing, hough transformation, data selection, data integration operation, generate Boolean type Transaction Information, and then obtain the module for generating fault log；

Step 402, fault location according to the knowledge in fault knowledge library, and combines the information for generating the module of fault log, presses Fault location is carried out according to the sequence of confidence level from big to small.

9. a kind of IaaS cloud platform network fault positioning system based on log analysis, it is characterised in that: including direct fault location mould Block, log collection and analysis module, knowledge formation module and fault detection and location module, in which:

Direct fault location module forms corresponding fault log information for injecting all kinds of typical network failures；The failure day Will information includes physical resource layer fault log information, operating system layer fault log information, virtual machine layer fault log information And OpenStack web services component fault log information；

Log collection and analysis module, for acquiring the fault log information one formed in direct fault location module, and for adopting Collection acquires the fault log information two when network failure occurs according to fault detection and location module control signal, to collected Fault log information one, fault log information two are pre-processed and form the fault log data one and event of corresponding Boolean type Hinder daily record data two, fault signature excavation is carried out to fault log data one using Apriori algorithm；

Knowledge formation module is used for according to support, confidence level parameter, the fault signature that log collection and analysis module are obtained The correlation rule and knowledge of corresponding particular network failure are generated by Bayesian formula, and obtained knowledge is added to failure and is known Know in library；

Fault detection and location module, for when network failure occurs, controlling log collection and analysis module acquisition network event Fault log information when barrier occurs, and the fault log data two of the Boolean type obtained according to log collection and analysis module with Correlation rule in fault knowledge library is compared and is analyzed, to position the level of cloud platform network failure generation.