[go: up one dir, main page]

CN100456687C - Method and system for real-time correlation analysis of network faults - Google Patents

Method and system for real-time correlation analysis of network faults Download PDF

Info

Publication number
CN100456687C
CN100456687C CNB031347290A CN03134729A CN100456687C CN 100456687 C CN100456687 C CN 100456687C CN B031347290 A CNB031347290 A CN B031347290A CN 03134729 A CN03134729 A CN 03134729A CN 100456687 C CN100456687 C CN 100456687C
Authority
CN
China
Prior art keywords
network
fault
event
events
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB031347290A
Other languages
Chinese (zh)
Other versions
CN1529455A (en
Inventor
谭俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Service Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CNB031347290A priority Critical patent/CN100456687C/en
Publication of CN1529455A publication Critical patent/CN1529455A/en
Application granted granted Critical
Publication of CN100456687C publication Critical patent/CN100456687C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明提供了一种网络故障实时相关性分析方法及系统,属于计算机网络通信领域。来自各种网络设备和业务对象的故障事件信息写入原始事件列表中,分析控制引擎从原始事件列表中按照原始事件级别和类型选择性读取事件进行相关性分析,在动态的分析算法中综合运用历史故障分析情景、网络动态性能参数、动态拓扑信息和事件时间特征等各种领域信息,克服了现有的故障关联分析方法中忽视动态网络状态信息、推理过程过于依赖预设规则和缺乏自动学习能力等不足,能够对故障引起的原始事件集合进行有效的相关性分析,较好解决了网络故障风暴发生时的实时故障原因分析和故障定位问题。

The invention provides a real-time correlation analysis method and system for network faults, belonging to the field of computer network communication. Fault event information from various network devices and business objects is written into the original event list, and the analysis control engine selectively reads events from the original event list according to the original event level and type for correlation analysis, and synthesizes them in the dynamic analysis algorithm Using various domain information such as historical fault analysis scenarios, network dynamic performance parameters, dynamic topology information and event time characteristics, it overcomes the neglect of dynamic network status information, excessive reliance on preset rules and lack of automatic fault correlation analysis methods in the existing fault correlation analysis methods. Insufficient learning ability, etc., can conduct effective correlation analysis on the original event set caused by the fault, and better solve the problem of real-time fault cause analysis and fault location when the network fault storm occurs.

Description

网络故障实时相关性分析方法及系统 Method and system for real-time correlation analysis of network faults

所属技术领域Technical field

本发明属于计算机网络通信领域,具体涉及一种网络管理中基于领域综合信息对网络故障事件进行实时相关性分析的方法及系统。The invention belongs to the field of computer network communication, and in particular relates to a method and system for performing real-time correlation analysis on network failure events based on domain comprehensive information in network management.

背景技术 Background technique

在计算机和通信网络中,当某个设备或者服务发生故障时,会因为设备、服务和业务之间紧密联系而引起一系列网络事件,负责监控该网络的网络管理系统通过设备发来的事件通知或者网管系统的轮询监控,会发现大量的异常事件,并通过SNMP Trap、Syslog或者Indication反映到网络管理员的管理界面上,从而表现为“网络故障风暴”。由于这种故障风暴往往在很短的时间内导致大量的事件,淹没了最根本的故障事件,让管理员难以从中发现故障发生的真正原因,要解决故障,就需要从中分析出最根本的故障原因,也就是分析这些事件之间的相关性,寻求根源事件。为了进行事件相关性分析,业界发展出几种典型的方法:如基于规则的分析(Rule Based Reasoning)、基于模型的分析(ModelBased Reasoning)、基于状态转移图(State Transition Graph)的分析、基于代码簿(CodeBook)的分析及基于案例的分析(Case-Based Reasoning),这些方法都能在一定程度上解决故障相关性分析的问题,并且各有优点。但是这些方法均无法完全解决以下问题:In computer and communication networks, when a certain device or service fails, a series of network events will be caused due to the close connection between the device, service and business, and the network management system responsible for monitoring the network will send event notifications through the device Or the polling monitoring of the network management system will find a large number of abnormal events, which will be reflected on the management interface of the network administrator through SNMP Trap, Syslog or Indication, thus appearing as a "network failure storm". Because this kind of fault storm often leads to a large number of events in a short period of time, submerging the most fundamental fault events, it is difficult for the administrator to find the real cause of the fault. To solve the fault, it is necessary to analyze the most fundamental fault. The reason is to analyze the correlation between these events and seek the root events. In order to analyze event correlation, the industry has developed several typical methods: such as rule-based analysis (Rule Based Reasoning), model-based analysis (Model-Based Reasoning), analysis based on State Transition Graph (State Transition Graph), code-based CodeBook analysis and case-based analysis (Case-Based Reasoning), these methods can solve the problem of fault correlation analysis to a certain extent, and each has its own advantages. However, none of these methods can completely solve the following problems:

(1)无法动态的考虑网络拓扑连结信息;(1) It is impossible to dynamically consider the network topology link information;

(2)无选择的处理所有的输入事件,效率难以提高,资源消耗大;(2) All input events are processed without selection, which is difficult to improve efficiency and consumes a lot of resources;

(3)推理过程过于依赖预设规则、特征表或模型,缺乏自动学习能力,缺少对知识库以外的新情况的适应能力和处理能力;(3) The reasoning process relies too much on preset rules, feature tables or models, lacks automatic learning ability, and lacks the ability to adapt and deal with new situations outside the knowledge base;

(4)在固定的时间范围内观察事件序列,不能动态的改变关联分析的时间范围;(4) Observe the event sequence within a fixed time range, and cannot dynamically change the time range of correlation analysis;

(5)在分析过程中缺少对条件概率和时间因素的考虑;(5) Lack of consideration of conditional probability and time factors in the analysis process;

(6)不能在基于静态信息的分析过程中结合实时获取的网络运行参数。(6) The network operating parameters obtained in real time cannot be combined in the analysis process based on static information.

发明内容Contents of the invention

本发明提供一种基于领域综合信息对网络故障事件进行实时相关性分析的方法及系统,克服了现有的故障关联分析方法中忽视动态网络状态信息、推理过程过于依赖预设规则和缺乏自动学习能力等不足,可有效的识别故障源头的关键事件并将其在网络中定位。The present invention provides a method and system for real-time correlation analysis of network fault events based on domain comprehensive information, which overcomes the neglect of dynamic network status information, excessive reliance on preset rules and lack of automatic learning in the existing fault correlation analysis method Insufficient capabilities, etc., can effectively identify key events at the source of faults and locate them in the network.

本发明的技术内容:一种网络故障实时相关性分析方法,包括:Technical content of the present invention: a real-time correlation analysis method for network faults, comprising:

(1)事件提取接口采集网络中产生的各种故障事件,并写入原始事件列表中;(1) The event extraction interface collects various fault events generated in the network and writes them into the original event list;

(2)从原始事件列表中读取一条事件,通过历史故障情景信息进行事件匹配,对网络设备、服务运行参数进行实时检测;(2) Read an event from the original event list, perform event matching through historical fault scenario information, and perform real-time detection of network equipment and service operating parameters;

(3)如果未有匹配事件,基于信息模型、拓扑依赖关系选取出与当前处理的事件相关的网络对象进行实时检测,并将实时检测的结果作为条件应用回推理过程中;(3) If there is no matching event, select network objects related to the currently processed event based on the information model and topology dependencies for real-time detection, and apply the real-time detection results back to the reasoning process as conditions;

(4)返回原始事件列表继续查找与当前处理事件相关的事件或者与实时检测结果吻合的事件,并将该事件加入到工作列表中;(4) Return to the original event list and continue to search for events related to the current processing event or events that match the real-time detection results, and add the event to the work list;

(5)在原始事件列表中已经没有其他可以加入工作列表的事件,则从工作列表中的事件构造一个新的故障情景并加入到历史故障情景信息中,清空工作列表;(5) In the original event list, there are no other events that can be added to the work list, then a new fault scenario is constructed from the events in the work list and added to the historical fault scenario information, and the work list is cleared;

(6)从原始事件列表中读取下一个符合选择策略的事件,返回到第二步,如果没有事件在列表中,则挂起等待有事件输入。(6) Read the next event that meets the selection strategy from the original event list, return to the second step, if there is no event in the list, then hang up and wait for event input.

所述的信息模型包括:The information model described includes:

(1)对被管理网络中的各种被管理对象进行面向对象抽象;(1) Object-oriented abstraction of various managed objects in the managed network;

(2)按照抽象后的被管理类之间的继承关系组成一个层次化的信息模型;(2) Form a hierarchical information model according to the inheritance relationship between the abstracted managed classes;

(3)在信息模型中用关联类定义被管理类之间的相互关系。:(3) In the information model, use the association class to define the relationship between the managed classes. :

所述拓扑依赖关系包括:The topological dependencies include:

(1)在网络运行中保持拓扑依赖关系与网络实际拓扑的一致;(1) Keep the topology dependency consistent with the actual topology of the network during network operation;

(2)将故障相关性分析程序运行的网络节点设为参考点;(2) Set the network node where the fault correlation analysis program runs as a reference point;

(3)通过参考点计算到达其他各个节点的可达性依赖关系。(3) Calculate the reachability dependencies to other nodes through the reference point.

(4)利用来自设备的拓扑改变的通告触发拓扑同步程序由最新的拓扑重新计算拓扑依赖关系;(4) Utilize the notification of the topology change from the device to trigger the topology synchronization program to recalculate the topology dependency from the latest topology;

所述推理过程包括:The reasoning process includes:

(1)为每一步推理赋予一个置信概率,并通过计算每步的概率得出最后分析结果的概率;(1) Assign a confidence probability to each step of reasoning, and calculate the probability of the final analysis result by calculating the probability of each step;

(2)在故障情景创建中定义时间约束函数来描述事件的时间特性以及相关联的事件之间的时间关系;(2) Define a time constraint function in fault scenario creation to describe the time characteristics of events and the time relationship between associated events;

(3)用形式化方法进行告警内容的表示和匹配。(3) Represent and match the alarm content with a formal method.

将历史故障情景信息构造为一张便于快速查询的故障情景表。The historical failure situation information is constructed as a failure situation table for quick query.

所述原始故障事件采集进一步包括:The collection of original fault events further includes:

(1)在处理不同的事件类型时,按照预定规则动态改变原始事件队列的长度;(1) When processing different event types, dynamically change the length of the original event queue according to predetermined rules;

(2)按照事件级别和用户定义规则来决定哪些事件作为相关性分析的起始点;(2) Determine which events are used as the starting point of correlation analysis according to the event level and user-defined rules;

(3)对原始事件进行预处理,针对不同协议的故障事件提供可扩展的事件获取接口,将它们转化为统一的内部格式并过滤。(3) Preprocess the original events, provide an extensible event acquisition interface for fault events of different protocols, convert them into a unified internal format and filter them.

所述构造新的故障情景包括:Described construction new failure situation comprises:

(1)提取故障特征参数;(1) Extracting fault characteristic parameters;

(2)提取故障传播路径;(2) Extracting the fault propagation path;

(3)利用故障特征参数和传播路径构造新的故障解决情景。(3) Construct new fault resolution scenarios by using fault characteristic parameters and propagation paths.

一种网络故障实时相关性分析系统,包括:A real-time correlation analysis system for network faults, comprising:

分析控制引擎:用于按照分析控制引擎算法调用其他模块和接口来完成故障相关性分析;Analysis control engine: used to call other modules and interfaces according to the analysis control engine algorithm to complete fault correlation analysis;

事件提取接口:用于接收网络设备发来的各种网络事件,将事件转化为统一的格式,写入原始事件列表,供分析控制引擎调用;Event extraction interface: used to receive various network events sent by network devices, convert the events into a unified format, write the original event list, and call it for the analysis control engine;

实时网络参数检测接口:用于检测网络中各种设备和服务的属性、性能和可达性等实时信息,被分析控制引擎所调用,接受故障分析引擎的参数以决定对哪个网络设备进行实时检测,并将结果返回给分析控制引擎;Real-time network parameter detection interface: used to detect real-time information such as attributes, performance, and accessibility of various devices and services in the network. It is called by the analysis control engine and accepts the parameters of the fault analysis engine to determine which network device to perform real-time detection , and return the result to the analysis control engine;

信息模型:描述一系列对应于网络协议对象和设备对象的管理类,以及它们之间的相互依赖关系;Information model: describe a series of management classes corresponding to network protocol objects and device objects, and the interdependence between them;

信息模型查询接口:用于从信息模型中查询管理类、管理类属性和管理类之间关系的函数,在运行时为分析控制引擎提供来自信息模型的信息;Information model query interface: a function used to query management classes, management class attributes, and relationship between management classes from the information model, and provide information from the information model for the analysis control engine at runtime;

拓扑同步模块:用于被网络拓扑改变事件触发运行拓扑依赖关系生成算法,生成正确反映当前网络拓扑连结关系的拓扑依赖关系并存入拓扑依赖关系库,拓扑依赖关系库为分析控制引擎提供相关信息;Topology synchronization module: used to be triggered by network topology change events to run the topology dependency generation algorithm, generate topology dependencies that correctly reflect the current network topology connection relationship and store them in the topology dependency library, which provides relevant information for the analysis control engine ;

故障情景表生成模块:用于在已经找到相关性的一组事件上建立一个故障情景,并将此情景存入故障情景表中,通过故障情景表与后续的事件进行匹配。Fault scenario table generation module: used to establish a fault scenario on a group of events that have found correlation, and store this scenario in the fault scenario table, and match subsequent events through the fault scenario table.

所述信息模型以散列表文件方式存储,分析控制引擎在分析过程中通过模型查询接口提取信息模型的信息。The information model is stored in the form of a hash table file, and the analysis control engine extracts the information of the information model through the model query interface during the analysis process.

进一步包括预处理模块:按照预定的预处理规则对接收到的原始事件进行预先处理。It further includes a preprocessing module: preprocessing the received original event according to a predetermined preprocessing rule.

本发明的技术效果:充分利用了网络中各种动态和静态信息,实时信息和历史信息,在网络出现故障时,从复杂的故障现象及其引起的事件风暴中,有效的识别故障源头的关键事件并将其在网络中定位;此外,因为在分析中应用了与实际网络拓扑状况同步的拓扑依赖关系,以及实时获取的网络运行参数,提高了故障定位的准确性;通过对原始输入事件进行预处理(包括协议格式转换、过滤和选择),避免了从所有输入的事件入手进行相关性分析,提高了处理效率;利用构造故障处理历史情景表,使本方法具有了从历史经验中自我学习的能力,而且用情景表对事件进行快速匹配,使得有的事件可以直接在情景表中得到匹配,从而避免了对所有的事件都进行全过程的相关性分析,处理效率得到提高;且由于在分析算法中应用概率逻辑和时间约束函数、正则表达式模糊匹配,能够更加灵活的处理事件之间的复杂关系,提高了相关性分析的适用能力。The technical effect of the present invention: making full use of various dynamic and static information, real-time information and historical information in the network, when the network fails, the key to effectively identify the source of the failure from the complex failure phenomenon and the event storm caused by it events and locate them in the network; in addition, because the topology dependencies synchronized with the actual network topology conditions are applied in the analysis, as well as the network operating parameters obtained in real time, the accuracy of fault location is improved; by analyzing the original input events Preprocessing (including protocol format conversion, filtering and selection) avoids correlation analysis from all input events and improves processing efficiency; using the construction of fault processing history scenario table makes this method self-learning from historical experience ability, and use the scenario table to quickly match events, so that some events can be directly matched in the scenario table, thereby avoiding the correlation analysis of the whole process for all events, and improving the processing efficiency; and because in The application of probabilistic logic, time constraint function, and regular expression fuzzy matching in the analysis algorithm can more flexibly handle the complex relationship between events and improve the applicability of correlation analysis.

附图说明 Description of drawings

图1是本发明网络故障实时相关性分析系统的结构示意图;Fig. 1 is the structural representation of network failure real-time correlation analysis system of the present invention;

图2是本发明网络故障实时相关性分析方法的流程图;Fig. 2 is the flow chart of network failure real-time correlation analysis method of the present invention;

图3是本发明网络故障实时相关性分析方法的拓扑依赖生成算法流程图;Fig. 3 is the topology dependence generating algorithm flow chart of network failure real-time correlation analysis method of the present invention;

图4是本发明网络故障实时相关性分析方法的一个具体实施例的网络示意图;Fig. 4 is the network schematic diagram of a specific embodiment of network failure real-time correlation analysis method of the present invention;

图5是本发明网络故障实时相关性分析方法的一个具体实施例中的信息模型的示意图。Fig. 5 is a schematic diagram of an information model in a specific embodiment of the method for analyzing real-time correlation of network faults according to the present invention.

具体实施方式 Detailed ways

参考图1,本发明以分析控制引擎为控制模块,通过与信息模型查询接口,事件提取接口和预处理模块、实时网络参数检测接口、故障情景表生成模块、拓扑同步模块的交互来实施网络故障实时相关性分析。具体步骤为:Referring to Fig. 1, the present invention takes the analysis control engine as the control module, and implements the network fault by interacting with the information model query interface, event extraction interface and preprocessing module, real-time network parameter detection interface, fault scenario table generation module, and topology synchronization module Real-time correlation analysis. The specific steps are:

1、事件提取接口以不同的协议(SNMP/SYSLOG等)提取来自各种网络设备和业务对象的故障事件信息,并将它们的格式转化为统一的内部格式,然后通过事件预处理模块,对这些事件信息进行压缩、过滤(按照预设的过滤器),写入原始事件列表中;通过对原始事件进行预处理,可有效提高处理效果;1. The event extraction interface uses different protocols (SNMP/SYSLOG, etc.) to extract fault event information from various network devices and business objects, and converts their format into a unified internal format, and then through the event preprocessing module, these The event information is compressed and filtered (according to the preset filter), and written into the original event list; the processing effect can be effectively improved by preprocessing the original event;

2、分析控制引擎从原始事件列表中按照原始事件级别和类型选择性读取一条事件进行相关性分析;在分析过程中综合应用故障情景表、信息模型信息、实时检测信息和拓扑信息,在分析过程中会按照需要继续从原始事件列表中读取事件来构造事件传播路径,直到无法再找到下一个可以匹配的事件为止;2. The analysis control engine selectively reads an event from the original event list according to the original event level and type for correlation analysis; in the analysis process, it comprehensively applies the fault scenario table, information model information, real-time detection information and topology information, and analyzes During the process, events will continue to be read from the original event list as needed to construct an event propagation path until no next matching event can be found;

(1)将历史故障情景信息构造为一张便于快速查询的故障情景表。在情景表中可进行事件的快速匹配;(1) Construct the historical failure situation information into a failure situation table for quick query. Events can be quickly matched in the scenario table;

(2)构造面向对象的层次化网络信息模型:对网路中的硬件、链路、软件和网络服务等被管理对象进行面向对象抽象,按照这些抽象后的管理类之间的继承关系组织成为一个层次化的信息模型。在此模型中同时用关联类定义了被管理类之间的包含、依赖、连结等相互关系。模型以散列表(Hash)文件方式存储,可通过模型对象管理接口访问,利用模型定义的管理类的层次和相互依赖关系来进行推导;在信息模型中描述了一系列对应于网络协议对象和设备对象的管理类,以及它们之间各种各样的关系。信息模型中定义的管理类可以分为拓扑子模型、开放服务子模型和网络通信子模型三个大类。(2) Construct an object-oriented hierarchical network information model: perform object-oriented abstraction on managed objects such as hardware, links, software, and network services in the network, and organize them according to the inheritance relationship between these abstracted management classes into A hierarchical information model. In this model, association classes are used to define the relationship among managed classes, such as containment, dependence, and connection. The model is stored in the form of a hash table (Hash) file, which can be accessed through the model object management interface, and is derived by using the hierarchy and interdependence of the management classes defined by the model; a series of corresponding network protocol objects and devices are described in the information model The management classes of objects, and the various relationships between them. The management classes defined in the information model can be divided into three categories: topology sub-model, open service sub-model and network communication sub-model.

以下用开放服务系统子模型作为例子来介绍管理类的定义:开放服务系统子模型主要用于描述数据通信网络中的各个节点设备及其内部各个模块,它将一切提供数据传输服务或者数据处理服务的网络节点抽象为一个开放的服务系统,由软件、硬件按照一种可扩展和剪裁的方式进行组合构成不同的系统,其中管理类为:The following uses the open service system sub-model as an example to introduce the definition of the management class: the open service system sub-model is mainly used to describe each node device and its internal modules in the data communication network, and it will provide data transmission services or data processing services The network nodes of the network are abstracted into an open service system, and different systems are formed by combining software and hardware in an expandable and tailored manner. The management class is:

a、开放服务系统:(Open Service System)代表一切在数据通信网络上提供各层数据服务的系统;包括路由器、交换机或者服务器等;a. Open Service System: (Open Service System) represents all systems that provide data services at various levels on the data communication network; including routers, switches or servers;

b、软件(software):开放服务系统中通过软件实现的功能模块;b. Software (software): the functional modules implemented by software in the open service system;

c、硬件(hardware):开放服务系统中通过硬件和固件实现的功能模块;c. Hardware (hardware): functional modules realized by hardware and firmware in the open service system;

d、应用(application):各种应用程序,如邮件客户端;d. Application: various applications, such as mail client;

e、操作系统(os):各种实时和分时操作系统;如VxWorks,Windows,Unix,Linux等;e. Operating system (os): various real-time and time-sharing operating systems; such as VxWorks, Windows, Unix, Linux, etc.;

f、资源(resource):系统中基本的共享对象:如内存、磁盘、CPU、中断等;f. Resource (resource): basic shared objects in the system: such as memory, disk, CPU, interrupt, etc.;

g、设备(device):组成硬件的各个模块;g. Device: each module that makes up the hardware;

h、服务(service):h. Service:

i、协议栈(protocol stack):i. Protocol stack (protocol stack):

j、内核(kernel):j. Kernel:

k、驱动(driver):k. Driver:

l、内存(memory):l. Memory:

m、硬盘(harddisk):m. Hard disk (hard disk):

n、中央处理器(cpu):n. Central processing unit (cpu):

o、总线(bus):o. Bus (bus):

p、适配器(adapter):p. Adapter:

q、网络适配器(network adapter):q. Network adapter (network adapter):

u、控制器(controller):u. Controller:

在该信息模型中存在管理类之间的各种依赖关系,如协议依赖关系、开发服务依赖关系等。In this information model, there are various dependencies between management classes, such as protocol dependencies, development service dependencies, and so on.

(3)实时检测:将推理过程和对网络设备、服务运行参数的实时检测结合起来。(3) Real-time detection: combine the reasoning process with real-time detection of network equipment and service operating parameters.

(4)基于指定参考点进行拓扑依赖关系实时计算:将故障相关性分析程序运行的网络节点设为参考点,在此基础上计算到达其他各个节点的可达性依赖关系,并在网络运行中保持与网络拓扑的同步;拓扑依赖关系描述了节点和节点之间的物理性连结,是协议互通性和服务可用性的基础。其中参考点,指当我们考虑到拓扑图中某个节点的可达性时,作为出发点的那一节点,在实际的被管网络中,往往就是网管平台所处的节点,或者是网络探测器(软件或硬件)所处的节点位置。参考图3,建立依赖关系是一个递归算法,每次拓扑发生改变后,都会触发自动运行算法,更新依赖依赖关系,保证当前故障定位和关联的准确性,从而达到下一步需要检测的可能关联的网络实例对象的集合。(4) Real-time calculation of topology dependencies based on the specified reference point: set the network node where the fault correlation analysis program runs as the reference point, and calculate the reachability dependencies to other nodes on this basis, and perform the calculation during network operation Maintain synchronization with the network topology; topology dependencies describe the physical connections between nodes and are the basis for protocol interoperability and service availability. The reference point refers to the node as the starting point when we consider the accessibility of a node in the topology diagram. In the actual managed network, it is often the node where the network management platform is located, or the network detector. The node location where (software or hardware) resides. Referring to Figure 3, the establishment of dependencies is a recursive algorithm. Every time the topology changes, the automatic operation algorithm will be triggered to update the dependencies and dependencies to ensure the accuracy of current fault location and correlation, so as to achieve the possible correlations that need to be detected in the next step. A collection of network instance objects.

(5)在控制分析引擎内部完成相关性分析方法最核心的逻辑,参考图2,(5) Complete the core logic of the correlation analysis method within the control analysis engine, refer to Figure 2,

a、从列表中读取一个事件Ei(i=1~n),在情景表中用该事件进行匹配,看是否有跟该事件相关的故障历史情景(该故障情景的特征事件与该事件匹配),对每一个符合的情景,按照步骤(b)处理;a. Read an event Ei (i=1~n) from the list, and use the event to match in the scenario table to see if there is a fault history scenario related to the event (the characteristic event of the fault scenario matches the event ), for each applicable scenario, follow step (b);

b、调用实时检测模块,对该情境中的相关对象类的相关实例(同时考虑与该事件产生节点相关的拓扑依赖的节点)进行实时状态检测,看返回结果是否符合情景描述的特征范围;然后再到原始事件列表中搜索有没有相关实例产生的后继事件,看是否符合情景定义的特征;如果以上检查通过,则标记这些相关的事件并调用输出模块格式化输出分析结果;b. Call the real-time detection module to perform real-time state detection on the relevant instances of the relevant object classes in the situation (while considering the topologically dependent nodes related to the event generation node) to see whether the returned result meets the characteristic scope of the situation description; and then Then go to the original event list to search for subsequent events generated by relevant instances to see if they conform to the characteristics of the scenario definition; if the above checks pass, then mark these related events and call the output module to format and output the analysis results;

c、如果(b)中检测不符合,则调用模型查询接口,在网络信息模型中查询与产生该事件的对象对应的管理类;同时考虑与该事件产生节点相关的拓扑依赖的节点,得到下一步需要检测的可能相关的网络实例对象的集合;c. If the detection in (b) does not match, call the model query interface to query the management class corresponding to the object that generated the event in the network information model; at the same time, consider the topologically dependent nodes related to the node that generated the event, and get the following A collection of potentially related network instance objects that need to be detected in one step;

d、调用实时检测模块检测这些对象的当前状态是否符合星系模型中定义的关系所描述的特征范围,然后检查在原始事件列表中是否有这些对象发出的相关事件,如果有,则将这些事件加入到工作事件列表,转步骤(e);如果以上检测不通过,则检查工作事件列表是否为空,如果为空转步骤(e)如果不为空,则调用故障情景构造模块为这些事件构造新的故障情景并加入到故障情景表中,同时清空工作事件列表;然后再标记和移除这些事件并格式化输出分析结果,转步骤(e);d. Call the real-time detection module to detect whether the current state of these objects conforms to the characteristic range described by the relationship defined in the galaxy model, and then check whether there are related events sent by these objects in the original event list, and if so, add these events to Go to the work event list, go to step (e); if the above detection fails, then check whether the work event list is empty, if it is empty, go to step (e) if it is not empty, then call the fault scenario construction module to construct a new one for these events The failure scenario is added to the failure scenario table, and the work event list is cleared; then these events are marked and removed and the output analysis results are formatted, and then step (e);

e、从原始事件列表中读取下一个符合选择策略的事件,然后转步骤(a),如果没有事件在列表中,则挂起等待有事件输入;e. Read the next event that meets the selection strategy from the original event list, then go to step (a), if there is no event in the list, then hang up and wait for event input;

其中,在上述步骤提及的匹配和实时状态检测的推理过程包括:基于概率的规则推理:为每一步推理赋予一个置信概率,并通过计算每步的概率得出最后分析结果的概率;对时间约束因素的处理:在故障情景创建中定义时间约束函数来描述事件的时间特性以及相关联的事件之间的时间关系;用正则表达式进行告警内容的模糊匹配。Among them, the reasoning process of matching and real-time state detection mentioned in the above steps includes: rule-based reasoning based on probability: assign a confidence probability to each step of reasoning, and obtain the probability of the final analysis result by calculating the probability of each step; Handling of constraint factors: Define time constraint functions in fault scenario creation to describe the time characteristics of events and the time relationship between associated events; use regular expressions to perform fuzzy matching of alarm content.

3、当完成一遍相关性分析后(完成对当前事件列表中所有事件的扫描),为本遍分析中关联到一起的事件构造故障情景并加入到故障情景表,然后将这些事件移出原始事件列表并构造输出分析结果;3. After completing one pass of correlation analysis (complete scanning of all events in the current event list), construct fault scenarios for the events associated together in this pass analysis and add them to the fault scenario table, and then remove these events from the original event list And construct the output analysis results;

4、在与分析控制引擎进行以上工作的同时,事件采集模块(包括事件采集接口和事件预处理模块)还在同步的向原始事件列表中写入新接收到的事件,拓扑同步模块也同时监控网络拓扑的变化,随时刷新网络拓扑依赖关系库;如果原始事件列表中没有事件了,分析控制引擎将挂起,等待有新的事件写入;事件预处理模块将新的事件写入原始事件列表时,如果发现分析控制引擎挂起,将唤醒该进程。4. While performing the above work with the analysis control engine, the event collection module (including the event collection interface and event preprocessing module) is also synchronously writing new events to the original event list, and the topology synchronization module is also monitoring When the network topology changes, refresh the network topology dependency library at any time; if there are no events in the original event list, the analysis control engine will hang, waiting for new events to be written; the event preprocessing module writes new events into the original event list , if the analysis control engine is found to be hung, the process will be woken up.

具体采用一个局域网的例子说明,参考图4,其中A,C,D是局域网中运行Linux操作系统的主机,S是一台三层交换机,R是一台连接此局域网与Web服务器的路由器,也是此局域网的网关。A、C直接与S相连,D直接与R相连,RP是一台运行Windows的PC,也是我们执行相关性分析的参考点,相关性分析系统就运行在这台主机上。Specifically using a local area network as an example, refer to Figure 4, where A, C, and D are hosts running the Linux operating system in the local area network, S is a three-layer switch, and R is a router connecting the local area network and the Web server. Gateway for this LAN. A and C are directly connected to S, and D is directly connected to R. RP is a PC running Windows, and it is also a reference point for us to perform correlation analysis. The correlation analysis system runs on this host.

首先,参考图5,本实施例采用一个简化的信息模型,在此网络中:主机A,C,D,RP,路由器R,交换机S都可以被看作是开放服务系统,每个开放服务系统包含了一个协议栈,协议栈负责完成应用与网络上其他开放服务系统中对等实体间的通信。数据向下流经应用、操作系统、协议、接口,然后进入物理网络,经过二层转发和三层路由到达另一个开放服务系统,向上经过接口、协议、操作系统直到另一端的应用。First, with reference to Fig. 5, this embodiment adopts a simplified information model, in this network: host A, C, D, RP, router R, switch S can all be regarded as the open service system, each open service system Contains a protocol stack, which is responsible for completing the communication between the application and other peer entities in other open service systems on the network. Data flows downward through applications, operating systems, protocols, and interfaces, then enters the physical network, passes through layer-2 forwarding and layer-3 routing to another open service system, and passes through interfaces, protocols, and operating systems upwards to the application at the other end.

1)信息模型实例化1) Information model instantiation

以上的模型将在实际的网络环境中生成一些对应于以上模型实体的实例:如路由器R上的应用,我们将其命名为Application_R,R上的操作系统,命名为:The above model will generate some instances corresponding to the above model entities in the actual network environment: such as the application on the router R, we will name it Application_R, and the operating system on R, named:

OS_R,OS_R,

与此类推,我们得到其他实例:Protocols_R,Interface_R;By analogy, we get other instances: Protocols_R, Interface_R;

同样:same:

对于主机A,我们得到Application_A,Service_A,OS_A,Protocols_A,Interface_A;For host A, we get Application_A, Service_A, OS_A, Protocols_A, Interface_A;

对于主机C,我们得到Application_C,Service_B,OS_C,Protocols_C,Interface_C;For host C, we get Application_C, Service_B, OS_C, Protocols_C, Interface_C;

对于主机D,我们得到Application_D,Service_D,OS_D,Protocols_D,Interface_D;For host D, we get Application_D, Service_D, OS_D, Protocols_D, Interface_D;

而且存在以下依赖关系:And there are the following dependencies:

Application->Service;Application->Service;

Service->OS;Service->OS;

OS->Protocols;OS->Protocols;

Protocols->Interface;(注意:这是一个简化的模型);Protocols->Interface; (note: this is a simplified model);

假设模型中有定义web_browse_in_url->DNS service;Assume that web_browse_in_url->DNS service is defined in the model;

X.interface.fail等价于X.down;X.interface.fail is equivalent to X.down;

2)拓扑依赖关系生成2) Topological dependency generation

对于图4所示的网络,网络管理平台将通过自动发现得到其拓扑数据,然后运行拓扑依赖关系生成算法,(以RP为参考点)得到以下拓扑依赖关系集合:For the network shown in Figure 4, the network management platform will obtain its topology data through automatic discovery, and then run the topology dependency generation algorithm (with RP as the reference point) to obtain the following topology dependency set:

RD={A->S,C->S,S->R,D->R,Internet->R,R->RP}RD={A->S, C->S, S->R, D->R, Internet->R, R->RP}

其中:’X->Y’的含义可以解释为“要访问X,必须先经过Y”;Among them: the meaning of 'X->Y' can be interpreted as "to access X, you must first pass through Y";

R->RP表示R是与参考点RP直接相连的网络节点;R->RP indicates that R is a network node directly connected to the reference point RP;

当网络拓扑或参考点发生改变时,该算法自动更新依赖关系,从而保持依赖关系能够反映实际的网络运行状况。When the network topology or reference point changes, the algorithm automatically updates the dependencies so that the dependencies can reflect the actual network operation status.

3)事件提取接口开始接收网络中产生的各种事件。3) The event extraction interface starts to receive various events generated in the network.

假设在主机A上运行了一个DNS服务(可以看作一个服务),而在主机D上有个程序在不断的访问Web服务器上的主页www.harbournetworks.com,可以将其看作一个Applicaion,我们命名为web_browse_in_url。Suppose a DNS service is running on host A (it can be regarded as a service), and there is a program on host D that is constantly accessing the home page www.harbournetworks.com on the web server, which can be regarded as an Application, we Name it web_browse_in_url.

假设在某个时刻,事件提取接口从各个主机的SNMP代理接收到以下事件,这事件被格式化后表示如下:Assume that at a certain moment, the event extraction interface receives the following events from the SNMP agents of each host, and the events are formatted as follows:

{{

E0=RP.ping.S.fail:t0,表示t0时刻从RP上无法ping到交换机S,E0=RP.ping.S.fail:t0, indicating that the switch S cannot be pinged from the RP at time t0,

E1=RP.ping.C.fail:t1,表示t1时刻从RP上无法ping到主机C,E1=RP.ping.C.fail:t1, which means that the host C cannot be pinged from the RP at time t1,

E2=RP.ping.C.fail:t2,表示t2时刻从RP上无法ping到主机C,E2=RP.ping.C.fail:t2, which means that the host C cannot be pinged from the RP at time t2,

E3=D.web_browse_in_url.Web_Server.fail:t3表示t3时刻主机D上无法访问Web服务器。E3=D.web_browse_in_url.Web_Server.fail: t3 indicates that the host D cannot access the Web server at time t3.

E4=RP.ping.A.fail:t4,表示t4时刻从RP上无法ping到主机A,E4=RP.ping.A.fail:t4, which means that the host A cannot be pinged from the RP at time t4,

E5=RP.ping.A.fail:t5,表示t5时刻从RP上无法ping到主机A,E5=RP.ping.A.fail:t5, means that the host A cannot be pinged from the RP at time t5,

E6=R.down:t6,表示t6时刻R失效,E6=R.down:t6, means that R is invalid at time t6,

E7=RP.web_browse_in_url.web_server.fail:t7表示t7时刻主机RP上无法访问Web服务器。E7=RP.web_browse_in_url.web_server.fail: t7 indicates that the host RP cannot access the Web server at time t7.

E8=R.up:t8,表示t8时刻R恢复工作,E8=R.up:t8, means that R resumes work at t8 moment,

}}

4)E0…E4随后被送给预处理模块处理后,得到压缩后的原始事件集合,注意这里过滤了重复的事件(E2,E5)和故障状态已经解除的成对事件(E6,E8);4) E0...E4 is then sent to the preprocessing module for processing, and the compressed original event set is obtained. Note that repeated events (E2, E5) and paired events (E6, E8) whose fault status has been resolved are filtered here;

{{

E0=RP.ping.S.fail:t0,表示t0时刻从RP上无法ping到交换机S,E0=RP.ping.S.fail:t0, indicating that the switch S cannot be pinged from the RP at time t0,

E1=RP.ping.C.fail:t1,表示t1时刻从RP上无法ping到主机C,E1=RP.ping.C.fail:t1, which means that the host C cannot be pinged from the RP at time t1,

E3=D.web_browse_in_url.Web_server.fail:t3表示t3时刻主机D上无法访问Web服务器。E3=D.web_browse_in_url.Web_server.fail: t3 indicates that the host D cannot access the Web server at time t3.

E4=RP.ping.A.fail:t4,表示t4时刻从RP上无法ping到主机A,E4=RP.ping.A.fail:t4, which means that the host A cannot be pinged from the RP at time t4,

E7=RP.web_browse_in_url.Web_Server.fail:t7表示t7时刻主机RP上无法访问Web服务器。E7=RP.web_browse_in_url.Web_Server.fail: t7 indicates that the host RP cannot access the Web server at time t7.

}}

5)利用领域综合信息对通信网络中的故障事件进行实时相关性分析:5) Real-time correlation analysis of fault events in the communication network by using comprehensive domain information:

(a)分析控制引擎从原始事件列表中读取一条事件:E0=RP.ping_S.fail:t0;(a) The analysis control engine reads an event from the original event list: E0=RP.ping_S.fail:t0;

从中解析出parsed out of

节点对象:源节点RP,目的节点S,Node objects: source node RP, destination node S,

应用对象:RP.ping,ping属于Applications;Application object: RP.ping, ping belongs to Applications;

应用对象状态:fail;Application object status: fail;

将E0标记并加入工作事件列表;Mark E0 and add it to the work event list;

(b)打开并查询情景表中有无与RP,S,ping相关的情景,发现情景表为空(系统第一次初始化,还没有加入新的情景),关闭情景表;(b) Open and inquire about whether there are relevant scenarios with RP, S, and ping in the scenario table, and find that the scenario table is empty (the system is initialized for the first time, and no new scenarios have been added), and close the scenario table;

(c)调用信息模型查询接口,查询ping(Application),得到关系:Applications->Services,Services->Protocols,Protocols->Interface;再查询拓扑依赖关系库,得到R->RP,S->R;(c) Call the information model query interface, query ping (Application), and get the relationship: Applications->Services, Services->Protocols, Protocols->Interface; then query the topology dependency library, get R->RP, S->R ;

(d)调用网络状态实时检测接口,检查S.Interface,发现S.Interface状态为fail,则根据依赖关系可以推断出以下结果:(d) Call the network status real-time detection interface, check the S.Interface, and find that the S.Interface status is fail, then the following results can be deduced according to the dependencies:

S.Interface.fail==S.down;S.Interface.fail == S.down;

S.down=>A.down and C.down;S.down=>A.down and C.down;

A.down==A.Interface.fail=>A.application.fail and A.services.failA.down==A.Interface.fail=>A.application.fail and A.services.fail

C.down==C.Interface.fail=>C.application.fail and C.services.fail;C.down==C.Interface.fail=>C.application.fail and C.services.fail;

A.services.fail=>A.DNS.fail=>*.browse_web_in_url.failA.services.fail=>A.DNS.fail=>*.browse_web_in_url.fail

(e)从E1开始检查原始事件列表。读取E1(e) Check the raw event list starting from E1. read E1

E1=RP.ping.C.fail:t1,从中解析出E1 = RP.ping.C.fail:t1, parsed from

节点对象:源节点RP,目的节点C,Node objects: source node RP, destination node C,

应用对象:RP.ping,ping属于Applications;Application object: RP.ping, ping belongs to Applications;

应用对象状态:fail;Application object status: fail;

ping属于application,要求RP和C,以及拓扑依赖的S,R上的applications,services,protocols,interface均保持正常,则S.down,C.down均可推出E1,所以E1被关联上,分析引擎将E1标记并加入到工作事件列表中;Ping belongs to the application, which requires RP and C, as well as topology-dependent S, applications, services, protocols, and interfaces on R to remain normal, then S.down and C.down can launch E1, so E1 is associated, and the analysis engine Mark and add E1 to the list of work events;

继续往下读取E3:Read on to read E3:

E3=D.web_browse_in_url.Web_server.fail:t3解析得到:E3=D.web_browse_in_url.Web_server.fail: t3 parses and gets:

节点对象:D,Web_server;Node object: D, Web_server;

应用对象:web_browse_in_url;Application object: web_browse_in_url;

应用对象状态:fail;Application object status: fail;

根据前面得到的:A.services.fail=>A.DNS.fail=>*.browse_web_in_url.fail,可以得出E3也是E1的相关事件,于是E3被标记并加入到工作事件列表中。According to the above: A.services.fail=>A.DNS.fail=>*.browse_web_in_url.fail, it can be concluded that E3 is also a related event of E1, so E3 is marked and added to the work event list.

同理,可以分析出E4和E7都是E1的相关事件,于是标记该事件被加入到工作列表。Similarly, it can be analyzed that both E4 and E7 are related events of E1, so the event is marked and added to the work list.

(f)发现原始事件列表中已经没有未标记的事件,则调用输出模块对原始事件列表进行格式化输出:(f) find that there is no unmarked event in the original event list, then call the output module to format the original event list:

输出告警:Output warning:

Alarm1=Alarm1=

{{

Cause:RP.ping.S.fail:t0Cause:RP.ping.S.fail:t0

Affects:Affects:

[[

RP.ping.C.fail:t1RP.ping.C.fail:t1

D.web_browse_in_url.Web_server.fail:t3D.web_browse_in_url.Web_server.fail:t3

RP.ping.A.fail:t4RP.ping.A.fail:t4

RP.web_browse_in_url.Web_Server.fail:t7RP.web_browse_in_url.Web_Server.fail:t7

    ]]

}}

(g)利用故障特征参数和故障传播路径为这些事件构造新的故障解决情景Scene1:S.down=>{A.down and C.down and*.web_browse_in_url.fail}并加入到故障情景表中。(g) Construct a new fault resolution scenario Scene1 for these events by using fault characteristic parameters and fault propagation paths: S.down=>{A.down and C.down and*.web_browse_in_url.fail} and add it to the fault scenario table.

(h)清空工作事件列表;从原始事件列表中移除这些事件。(h) Empty the job event list; remove these events from the original event list.

(j)如果此时有新的事件加入到原始事件引擎则转(3),否则挂起,等待新的事件输入;(j) If a new event is added to the original event engine at this time, then turn to (3), otherwise hang up and wait for new event input;

(k)假设有新的事件来到:(k) Suppose a new event arrives:

E9=D.web_browse_in_url.Web_Server.fail:t9E9=D.web_browse_in_url.Web_Server.fail:t9

E10=A.down:t10;E10=A.down:t10;

(1)事件分析引擎读取E9,在事件情景表中查询,发现在Scene1中有*.web_browse_in_url.fail这个事件特征模式与之匹配,将E9加入到工作事件列表中,继续查看在原始事件列表中是否有特征事件:A.down和C.down,读取到E10,满足A.down,将E10加入工作事件列表;这时候列表中没有其他的事件了,还余下一个特征C.down需要被证实,于是调用实时检测接口,检测发现:C.down=true;于是情景得到匹配,直接得出结果S.down。以下同(1)描述的步骤。(1) The event analysis engine reads E9, queries in the event scenario table, and finds that there is an event feature pattern of *.web_browse_in_url.fail in Scene1 that matches it, and adds E9 to the work event list, and continues to view the original event list Is there any feature event in: A.down and C.down, read E10, meet A.down, add E10 to the work event list; at this time there are no other events in the list, and there is still a feature C.down that needs to be After confirming, the real-time detection interface is called, and the detection finds: C.down=true; then the scenario is matched, and the result S.down is obtained directly. The following steps are the same as described in (1).

在上一步中,如果对C的实时检测结果C.down=false;则上述情景不能完全被置信,可以给予一个置信概率。表示还可能有其他的原因。In the previous step, if the real-time detection result of C is C.down=false; then the above scenario cannot be completely trusted, and a confidence probability can be given. Indicates that there may be other reasons.

通过运用领域综合信息,包括基于网络信息模型的管理对象层次信息及相互关系、自动学习的故障处理历史信息、实时采集的网络运行参数、网络动态拓扑信息、事件时间特征等,并在推理过程中运用动态分析方法,较好解决了在复杂网络环境中的故障相关性分析问题。Through the use of domain comprehensive information, including management object level information and mutual relations based on network information model, automatically learned fault handling history information, real-time collected network operating parameters, network dynamic topology information, event time characteristics, etc., and in the reasoning process By using the dynamic analysis method, the problem of fault correlation analysis in the complex network environment is better solved.

参考图1,本发明网络故障实时相关性分析系统,包括:With reference to Fig. 1, the network failure real-time correlation analysis system of the present invention comprises:

分析控制引擎:分析过程的主要控制逻辑执行者,用于按照分析控制引擎算法调用其他模块和接口来完成故障相关性分析;Analysis control engine: the main control logic executor of the analysis process, used to call other modules and interfaces according to the analysis control engine algorithm to complete the fault correlation analysis;

信息模型:描述了一系列对应于网络协议对象和设备对象的管理类,以及它们之间各种各样的关系,信息模型中定义的管理类可以分为拓扑子模型、开放服务子模型和网络通信子模型三个大类;Information model: describes a series of management classes corresponding to network protocol objects and device objects, and various relationships between them. The management classes defined in the information model can be divided into topology sub-model, open service sub-model and network Three categories of communication sub-models;

信息模型查询接口:用于从信息模型中查询管理类、管理类属性和管理类之间关系的函数,在运行时为分析控制引擎提供来自信息模型的信息;Information model query interface: a function used to query management classes, management class attributes, and relationship between management classes from the information model, and provide information from the information model for the analysis control engine at runtime;

事件提取接口:用于接收网络设备发来的各种网络事件,包括SNMPTRAP、SYSLOG、CMIP Event Report等各种协议的事件通告,将该事件转化为统一的格式,并交给预处理模块;Event extraction interface: used to receive various network events sent by network devices, including event notifications of various protocols such as SNMPTRAP, SYSLOG, CMIP Event Report, etc., convert the events into a unified format, and send them to the preprocessing module;

预处理模块:用于对接收到的原始事件进行简单的过滤(按照设定的规则去除一些管理人员无需关心的事件)、压缩(去除重复的事件)、重定义(把一个或多个事件重新定义为一个新的事件)等预先处理,有利于相关性分析;Preprocessing module: used for simple filtering of received original events (removing events that managers do not need to care about according to set rules), compression (removing duplicate events), redefinition (resetting one or more events Defined as a new event) and other pre-processing, which is conducive to correlation analysis;

实时网络参数检测接口:用于检测网络中各种设备和服务的属性、性能和可达性等实时信息,被故障分析引擎所调用,接受故障分析引擎的参数以决定对哪个网络设备进行实时检测,并将结果返回给故障分析引擎;Real-time network parameter detection interface: used to detect real-time information such as attributes, performance, and accessibility of various devices and services in the network. It is called by the fault analysis engine and accepts the parameters of the fault analysis engine to determine which network device to perform real-time detection on. , and return the result to the fault analysis engine;

故障情景表生成模块:用于在已经找到相关性的一组事件上建立一个故障情景,并将此情景存入故障情景表中,这些建立的故障情景供后续分析快速查找使用,建立的故障情景可供后续分析快速查找并使用;Fault scenario table generation module: used to create a fault scenario based on a group of events that have found correlation, and store this scenario in the fault scenario table. These established fault scenarios are used for subsequent analysis and quick search. The established fault scenarios It can be quickly found and used for subsequent analysis;

拓扑同步模块:用于被网络拓扑改变事件触发运行拓扑依赖关系生成算法,生成正确反映当前网络拓扑连结关系的拓扑依赖关系并存入拓扑依赖关系库,供故障相关性分析使用。Topology synchronization module: used to be triggered by network topology change events to run the topology dependency generation algorithm, generate topology dependencies that correctly reflect the current network topology connection relationship, and store them in the topology dependency library for fault correlation analysis.

Claims (10)

1.一种网络故障实时相关性分析方法,包括:1. A real-time correlation analysis method for network faults, comprising: (1)事件提取接口采集网络中产生的各种故障事件,并写入原始事件列表中;(1) The event extraction interface collects various fault events generated in the network and writes them into the original event list; (2)从原始事件列表中读取一条事件,通过历史故障情景信息进行事件匹配,对网络设备和服务运行参数进行实时检测;(2) Read an event from the original event list, perform event matching through historical fault scenario information, and perform real-time detection of network equipment and service operating parameters; (3)如果未有匹配事件,基于信息模型、拓扑依赖关系选取出与当前处理的事件相关的网络对象进行实时检测,并将实时检测的结果作为条件应用回推理过程中;(3) If there is no matching event, select network objects related to the currently processed event based on the information model and topology dependencies for real-time detection, and apply the real-time detection results back to the reasoning process as conditions; (4)返回原始事件列表继续查找与当前处理事件相关的事件或者与实时检测结果吻合的事件,并将该事件加入到工作列表中;(4) Return to the original event list and continue to search for events related to the current processing event or events that match the real-time detection results, and add the event to the work list; (5)在原始事件列表中已经没有其他可以加入工作列表的事件,则从工作列表中的事件构造一个新的故障情景并加入到历史故障情景信息中,清空工作列表;(5) In the original event list, there are no other events that can be added to the work list, then a new fault scenario is constructed from the events in the work list and added to the historical fault scenario information, and the work list is cleared; (6)从原始事件列表中读取下一个符合选择策略的事件,返回到第(2)步,如果没有事件在列表中,则挂起等待有事件输入。(6) Read the next event that meets the selection strategy from the original event list, return to step (2), if there is no event in the list, then hang up and wait for event input. 2.如权利要求1所述的网络故障实时相关性分析方法,其特征在于所述的信息模型包括:2. the network failure real-time correlation analysis method as claimed in claim 1, is characterized in that described information model comprises: (1)对被管理网络中的各种被管理对象进行面向对象抽象;(1) Object-oriented abstraction of various managed objects in the managed network; (2)按照抽象后的被管理类之间的继承关系组成一个层次化的信息模型;(2) Form a hierarchical information model according to the inheritance relationship between the abstracted managed classes; (3)在信息模型中用关联类定义被管理类之间的相互关系。(3) In the information model, use the association class to define the relationship between the managed classes. 3.如权利要求1或2所述的网络故障实时相关性分析方法,其特征在于所述拓扑依赖关系包括:3. The network failure real-time correlation analysis method as claimed in claim 1 or 2, wherein said topology dependency comprises: (1)在网络运行中保持拓扑依赖关系与网络实际拓扑的一致;(1) Keep the topology dependency consistent with the actual topology of the network during network operation; (2)将故障相关性分析程序运行的网络节点设为参考点;(2) Set the network node where the fault correlation analysis program runs as a reference point; (3)通过参考点计算到达其他各个节点的可达性依赖关系;(3) Calculate the reachability dependencies to other nodes through the reference point; (4)利用来自设备的拓扑改变的通告触发拓扑同步程序由最新的拓扑重新计算拓扑依赖关系。(4) Using the topology change notification from the device to trigger the topology synchronization program to recalculate the topology dependency from the latest topology. 4.如权利要求1所述的网络故障实时相关性分析方法,其特征在于所述推理过程包括:4. the network failure real-time correlation analysis method as claimed in claim 1, is characterized in that described reasoning process comprises: (1)为每一步推理赋予一个置信概率,并通过计算每步的概率得出最后分析结果的概率;(1) Assign a confidence probability to each step of reasoning, and calculate the probability of the final analysis result by calculating the probability of each step; (2)在故障情景创建中定义时间约束函数来描述事件的时间特性以及相关联的事件之间的时间关系;(2) Define a time constraint function in fault scenario creation to describe the time characteristics of events and the time relationship between associated events; (3)用形式化方法进行告警内容的表示和匹配。(3) Represent and match the alarm content with a formal method. 5.如权利要求1所述的网络故障实时相关性分析方法,其特征在于将历史故障情景信息构造为一张便于快速查询的故障情景表。5. The real-time correlation analysis method for network faults according to claim 1, characterized in that the historical fault scenario information is constructed as a fault scenario table for quick query. 6.如权利要求1所述的网络故障实时相关性分析方法,其特征在于所述步骤(1)进一步包括:6. the network failure real-time correlation analysis method as claimed in claim 1, is characterized in that described step (1) further comprises: (1-1)在处理不同的事件类型时,按照预定规则动态改变原始事件队列的长度;(1-1) When processing different event types, dynamically change the length of the original event queue according to predetermined rules; (1-2)按照事件级别和用户定义规则来决定哪些事件作为相关性分析的起始点;(1-2) Determine which events are used as the starting point of correlation analysis according to the event level and user-defined rules; (1-3)对原始事件进行预处理,针对不同协议的故障事件提供可扩展的事件获取接口,将它们转化为统一的内部格式并过滤。(1-3) Preprocess the original events, provide an extensible event acquisition interface for fault events of different protocols, convert them into a unified internal format and filter them. 7.如权利要求1所述的网络故障实时相关性分析方法,其特征在于所述构造新的故障情景包括:7. The network fault real-time correlation analysis method as claimed in claim 1, is characterized in that the new fault scenario of said construction comprises: (1)提取故障特征参数;(1) Extracting fault characteristic parameters; (2)提取故障传播路径;(2) Extracting the fault propagation path; (3)利用故障特征参数和传播路径构造新的故障解决情景。(3) Construct new fault resolution scenarios by using fault characteristic parameters and propagation paths. 8.一种网络故障实时相关性分析系统,包括:8. A real-time correlation analysis system for network faults, comprising: 分析控制引擎:用于按照分析控制引擎算法调用其他模块和接口来完成故障相关性分析;Analysis control engine: used to call other modules and interfaces according to the analysis control engine algorithm to complete fault correlation analysis; 事件提取接口:用于接收网络设备发来的各种网络事件,将事件转化为统一的格式,写入原始事件列表,供分析控制引擎调用;Event extraction interface: used to receive various network events sent by network devices, convert the events into a unified format, and write them into the original event list for calling by the analysis control engine; 实时网络参数检测接口:用于检测网络中各种设备和服务的属性、性能和可达性等实时信息,被分析控制引擎所调用,接受故障分析引擎的参数以决定对哪个网络设备进行实时检测,并将结果返回给分析控制引擎;Real-time network parameter detection interface: used to detect real-time information such as attributes, performance, and accessibility of various devices and services in the network. It is called by the analysis control engine and accepts the parameters of the fault analysis engine to determine which network device to perform real-time detection , and return the result to the analysis control engine; 信息模型:描述一系列对应于网络协议对象和设备对象的管理类,以及它们之间的相互依赖关系;Information model: describe a series of management classes corresponding to network protocol objects and device objects, and the interdependence between them; 信息模型查询接口:用于从信息模型中查询管理类、管理类属性和管理类之间关系的函数,在运行时为分析控制引擎提供来自信息模型的信息;Information model query interface: a function used to query management classes, management class attributes and relationships between management classes from the information model, and provide information from the information model for the analysis control engine at runtime; 拓扑同步模块:用于被网络拓扑改变事件触发运行拓扑依赖关系生成算法,生成正确反映当前网络拓扑连结关系的拓扑依赖关系并存入拓扑依赖关系库,拓扑依赖关系库为分析控制引擎提供相关信息;Topology synchronization module: used to be triggered by network topology change events to run the topology dependency generation algorithm, generate topology dependencies that correctly reflect the current network topology connection relationship and store them in the topology dependency library, which provides relevant information for the analysis control engine ; 故障情景表生成模块:用于在已经找到相关性的一组事件上建立一个故障情景,并将此情景存入故障情景表中,通过故障情景表与后续的事件进行匹配。Fault scenario table generation module: used to create a fault scenario based on a group of events that have found correlation, and store this scenario in the fault scenario table, and match subsequent events through the fault scenario table. 9.如权利要求8所述的网络故障实时相关性分析系统,其特征在于所述信息模型以散列表文件方式存储,分析控制引擎在分析过程中通过模型查询接口提取信息模型的信息。9. The network fault real-time correlation analysis system according to claim 8, wherein the information model is stored in a hash table file, and the analysis control engine extracts the information of the information model through the model query interface during the analysis process. 10.如权利要求8或9所述的网络故障实时相关性分析系统,其特征在于进一步包括预处理模块:按照预定的预处理规则对接收到的原始事件进行预先处理。10. The network fault real-time correlation analysis system according to claim 8 or 9, further comprising a pre-processing module: pre-processing the received original events according to predetermined pre-processing rules.
CNB031347290A 2003-09-29 2003-09-29 Method and system for real-time correlation analysis of network faults Expired - Fee Related CN100456687C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB031347290A CN100456687C (en) 2003-09-29 2003-09-29 Method and system for real-time correlation analysis of network faults

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB031347290A CN100456687C (en) 2003-09-29 2003-09-29 Method and system for real-time correlation analysis of network faults

Publications (2)

Publication Number Publication Date
CN1529455A CN1529455A (en) 2004-09-15
CN100456687C true CN100456687C (en) 2009-01-28

Family

ID=34286184

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB031347290A Expired - Fee Related CN100456687C (en) 2003-09-29 2003-09-29 Method and system for real-time correlation analysis of network faults

Country Status (1)

Country Link
CN (1) CN100456687C (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467083B2 (en) 2017-06-08 2019-11-05 International Business Machines Corporation Event relationship analysis in fault management
CN113169898A (en) * 2018-11-07 2021-07-23 西门子股份公司 System and method for error identification and error cause analysis in a network of network components
CN113271216A (en) * 2020-02-14 2021-08-17 华为技术有限公司 Data processing method and related equipment

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100417080C (en) * 2005-02-01 2008-09-03 华为技术有限公司 A method for detecting network link faults and locating faults
CN100450016C (en) 2005-06-03 2009-01-07 华为技术有限公司 Realization method of online maintenance in communication network
FI20050625A0 (en) * 2005-06-13 2005-06-13 Nokia Corp Binary class based control
CN100382509C (en) * 2005-11-28 2008-04-16 华为技术有限公司 A Method of Fault Location in Wireless Network
CN101388794B (en) * 2008-10-10 2011-12-07 中兴通讯股份有限公司 Method and system for positioning network management system exception affair
CN101394314B (en) * 2008-10-20 2011-03-23 北京邮电大学 Fault positioning method for Web application system
CN101610174B (en) * 2009-07-24 2011-08-24 深圳市永达电子股份有限公司 Log correlation analysis system and method
EP2460105B1 (en) 2009-07-30 2014-10-01 Hewlett-Packard Development Company, L.P. Constructing a bayesian network based on received events associated with network entities
JP5542398B2 (en) * 2009-09-30 2014-07-09 株式会社日立製作所 Root cause analysis result display method, apparatus and system for failure
CN102045213B (en) * 2009-10-22 2014-04-02 华为技术有限公司 Fault positioning method and device
CN102158360B (en) * 2011-04-01 2013-10-30 华中科技大学 Network fault self-diagnosis method based on causal relationship positioning of time factors
CN102164089B (en) * 2011-05-13 2014-12-24 哈尔滨工程大学船舶装备科技有限公司 Routing-based IETM (Interactive Electronic Technical Manual) fault diagnosis recording and playback method
CN102307135A (en) * 2011-05-24 2012-01-04 中国电子科技集团公司第十研究所 Method for processing baseband data transmission data in real time by utilizing VxWorks platform
CN102404141B (en) * 2011-11-04 2014-03-12 华为技术有限公司 Method and device of alarm inhibition
DE112012006649T5 (en) * 2012-10-25 2015-03-19 Hewlett Packard Development Company, L.P. Event Correlation
CN103152219B (en) * 2013-02-18 2015-12-09 中国工商银行股份有限公司 A kind of event monitoring system of computer network system and event-monitoring method
WO2015008116A1 (en) * 2013-07-18 2015-01-22 Freescale Semiconductor, Inc. Fault detection apparatus and method
KR101545215B1 (en) * 2013-10-30 2015-08-18 삼성에스디에스 주식회사 system and method for automatically manageing fault events of data center
CN104539941B (en) * 2014-12-25 2016-12-07 南京大学镇江高新技术研究院 Based on the traffic video private network Fault Locating Method improving code book
US10339032B2 (en) * 2016-03-29 2019-07-02 Microsoft Technology Licensing, LLD System for monitoring and reporting performance and correctness issues across design, compile and runtime
CN106484595A (en) * 2016-10-09 2017-03-08 华青融天(北京)技术股份有限公司 A kind of event-handling method and device
CN109428741A (en) * 2017-08-22 2019-03-05 中兴通讯股份有限公司 A kind of detection method and device of network failure
CN108171341A (en) * 2017-12-19 2018-06-15 深圳交控科技有限公司 The state analysis method and device of signalling arrangement
CN109308248A (en) * 2018-08-27 2019-02-05 上海功致信息科技有限公司 Event relation analyzing method and system
CN109597752B (en) * 2018-10-19 2022-11-04 中国船舶重工集团公司第七一六研究所 Fault propagation path simulation method based on complex network model
CN110516931B (en) * 2019-08-12 2025-02-11 国家电网公司华东分部 Method and storage medium for multi-dimensional control of interaction mode and full event optimization aggregation
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN113206749B (en) * 2020-01-31 2023-11-17 瞻博网络公司 Programmable diagnostic model of correlation of network events
US11269711B2 (en) 2020-07-14 2022-03-08 Juniper Networks, Inc. Failure impact analysis of network events
US11265204B1 (en) * 2020-08-04 2022-03-01 Juniper Networks, Inc. Using a programmable resource dependency mathematical model to perform root cause analysis
CN114629776B (en) * 2020-12-11 2023-05-30 中国联合网络通信集团有限公司 Fault analysis method and device based on graph model
CN114363149B (en) * 2021-12-23 2023-12-26 上海哔哩哔哩科技有限公司 Fault processing method and device
CN116132214B (en) * 2022-12-30 2024-07-02 中国联合网络通信集团有限公司 Event transmission method, device, equipment and medium based on event bus model
CN116366356B (en) * 2023-04-10 2025-11-18 杭州安恒信息技术股份有限公司 A method, system, device, and network protection equipment for detecting equipment malfunctions.

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997050209A1 (en) * 1996-06-27 1997-12-31 Telefonaktiebolaget Lm Ericsson (Publ) A method for fault control of a telecommunications network and a telecommunications system
WO2003036914A1 (en) * 2001-10-25 2003-05-01 General Dynamics Government Systems Corporation A method and system for modeling, analysis and display of network security events

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997050209A1 (en) * 1996-06-27 1997-12-31 Telefonaktiebolaget Lm Ericsson (Publ) A method for fault control of a telecommunications network and a telecommunications system
WO2003036914A1 (en) * 2001-10-25 2003-05-01 General Dynamics Government Systems Corporation A method and system for modeling, analysis and display of network security events

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467083B2 (en) 2017-06-08 2019-11-05 International Business Machines Corporation Event relationship analysis in fault management
CN113169898A (en) * 2018-11-07 2021-07-23 西门子股份公司 System and method for error identification and error cause analysis in a network of network components
CN113169898B (en) * 2018-11-07 2022-12-27 西门子股份公司 System and method for error identification and error cause analysis in a network of network components
CN113271216A (en) * 2020-02-14 2021-08-17 华为技术有限公司 Data processing method and related equipment
WO2021159676A1 (en) * 2020-02-14 2021-08-19 华为技术有限公司 Data processing method and related device

Also Published As

Publication number Publication date
CN1529455A (en) 2004-09-15

Similar Documents

Publication Publication Date Title
CN100456687C (en) Method and system for real-time correlation analysis of network faults
US11625293B1 (en) Intent driven root cause analysis
EP3574612B1 (en) Configuration, telemetry, and analytics of a computer infrastructure using a graph model
US10013414B2 (en) System and method for metadata enhanced inventory management of a communications system
łgorzata Steinder et al. A survey of fault localization techniques in computer networks
US6792456B1 (en) Systems and methods for authoring and executing operational policies that use event rates
CN112436964B (en) Equipment adaptation method and network management device
CN107294764A (en) Intelligent supervision method and intelligent monitoring system
CN112367211B (en) Method, device and storage medium for generating configuration template by device command line
CN110912782B (en) Data acquisition method, device and storage medium
US10884805B2 (en) Dynamically configurable operation information collection
CN112180757B (en) A smart home system and its strategy management method
WO2016107397A1 (en) System and method for model-based search and retrieval of networked data
CN113868367A (en) Method, device and system for constructing knowledge graph and computer storage medium
CN114244683A (en) Event classification method and device
CN118138471A (en) Knowledge-graph-based network model construction method, device and storage medium
Inçki et al. Runtime verification of IoT systems using complex event processing
WO2025124097A1 (en) Method for fault localization and apparatus
CN113852476A (en) Method, device and system for determining an object associated with an abnormal event
WO2018010176A1 (en) Method and device for acquiring fault information
CN112134720A (en) Network topology discovery method
JP2012094129A (en) Method, device and program for discovering resource in computing environment
CN120822538A (en) A multi-agent platform collaboration system, method, device and medium
US20070233836A1 (en) Cross-cutting event correlation
KR100358156B1 (en) Converting Method of Managing Operation from service management system to Switching Command in a Integrated Network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: GANGWAN NETWORK CO., LTD.

Effective date: 20061013

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20061013

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Applicant after: Huawei Technologies Co., Ltd.

Address before: 100089, No. 21 West Third Ring Road, Beijing, Haidian District, Long Ling Building, 13 floor

Applicant before: Harbour Networks Holdings Limited

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGIES SERVICE GMBH

Free format text: FORMER OWNER: HUAWEI TECHNOLOGY CO LTD

Effective date: 20120217

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518129 SHENZHEN, GUANGDONG PROVINCE TO: 065000 LANGFANG, HEBEI PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20120217

Address after: 065000 west of Wangjing Road, Langfang economic and Technological Development Zone, Hebei

Patentee after: Huawei Technoloy Service Co., Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: Huawei Technologies Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090128

Termination date: 20150929

EXPY Termination of patent right or utility model