[go: up one dir, main page]

CN110493025A - Method and device for fault root cause diagnosis based on multi-layer directed graph - Google Patents

Method and device for fault root cause diagnosis based on multi-layer directed graph Download PDF

Info

Publication number
CN110493025A
CN110493025A CN201810461456.6A CN201810461456A CN110493025A CN 110493025 A CN110493025 A CN 110493025A CN 201810461456 A CN201810461456 A CN 201810461456A CN 110493025 A CN110493025 A CN 110493025A
Authority
CN
China
Prior art keywords
node
service
business
root cause
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810461456.6A
Other languages
Chinese (zh)
Other versions
CN110493025B (en
Inventor
乔柏林
叶晓龙
任赣
唐涛
蒋通通
胡林熙
蒋健
竺士杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Zhejiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810461456.6A priority Critical patent/CN110493025B/en
Publication of CN110493025A publication Critical patent/CN110493025A/en
Application granted granted Critical
Publication of CN110493025B publication Critical patent/CN110493025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Monitoring And Testing Of Exchanges (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Embodiment of the invention discloses a kind of based on the failure root of multilayer digraph because of the method and device of diagnosis, this method determines the call relation of each service node according to original service data and attribute information jointly, it can be comprehensively in view of the service node newly increased in practice or the call relation newly increased, ensure that can be added to each service node in multilayer Directed Graph Model when establishing multilayer Directed Graph Model according to call relation, accurately quickly to search the root for generating abnormal traffic data based on multilayer Directed Graph Model because node is laid a good foundation.It is generated by the service node in multilayer Directed Graph Model in this present embodiment according to practical business node, the comprehensive of node avoids method to emerging failure progress root because of the case where inquiring, simultaneously, analysis to data is not only to be based on call relation, but the multilayer Directed Graph Model based on creation analyzes data comprehensively.

Description

一种基于多层有向图的故障根因诊断的方法及装置Method and device for fault root cause diagnosis based on multi-layer directed graph

技术领域technical field

本发明实施例涉及计算机软件技术领域,尤其是涉及一种基于多层有向图的故障根因诊断的方法及装置。Embodiments of the present invention relate to the technical field of computer software, in particular to a method and device for root cause diagnosis of faults based on multi-layer directed graphs.

背景技术Background technique

云计算和容器云的普及,使得大量IT应用系统逐步被部署在虚拟化、容器化环境中。而随着各类业务场景的不断丰富和业务量的井喷式增长,给系统及应用的易维护性上带来巨大的挑战。尤其是在电信行业,运营商本身就构建了非常多的应用系统为广大消费者提供各种特色服务,而有些系统功能更涉及到多个业务系统的子功能,需要多系统协同才能正常工作。架构的演变更加剧此类业务系统的复杂性,对运维故障定位及解决能力提出了更高的要求。With the popularity of cloud computing and container cloud, a large number of IT application systems are gradually deployed in virtualized and containerized environments. With the continuous enrichment of various business scenarios and the blowout growth of business volume, it brings huge challenges to the maintainability of the system and applications. Especially in the telecommunications industry, operators themselves have built a lot of application systems to provide consumers with various special services, and some system functions involve sub-functions of multiple business systems, requiring multi-system collaboration to work properly. The evolution of architecture has intensified the complexity of such business systems, and put forward higher requirements for operation and maintenance fault location and resolution capabilities.

目前的故障诊断方法包括三种类型,方案一是基于告警等预案库形式的故障诊断,方案二是基于告警等预案库形式的故障诊断,方案三是基于决策树模型的故障诊断及修复方法。其中,基于告警等预案库形式的故障诊断:多数运维部门通常根据故障现象及处理记录汇总成故障处理手册,部分设备供应商也会提供类似的简单故障定位能力,以此来实现故障的初步定位及解决。除了基于历史故障经验,还包括QoE(用户体验质量)等其他维度来进行故障诊断。一旦故障发生,通过收集告警关键信息,并找到相应诊断手册进行检索生成诊断结果。因此,基于告警的方式可以简单快速的完成日常故障定位及修复,而一旦面对未知故障等与已知告警信息不符时则无能为力。基于离线指标分析工具的故障诊断方法:离线指标分析工具包含业务指标及系统运行指标,前者主要通过业务入库数据反映业务量指标,后者主要通过日志等外部数据导入数据库后进行分析,通过对系统运行指标分析,对系统性能,成功率,失败分布等信息予以分析,以判断系统运行健康度。基于数据库的方式便于提取系统关键指标,有效监控程序各环节运行状态,但相对而言时间延长较大,会对系统监控时效性上造成一定影响。基于决策树模型的故障诊断及修复方法:多数系统设计采用多层系统拓扑架构,基于分层调用的原则,建立树形关系的拓扑图,并基于此树形拓扑建立了面向业务及系统故障的决策树。一旦故障发生,通过收集故障关键信息,并找到相应决策树进行检索生成诊断结果。因此,基于决策树的方式可以简单快速的完成日常故障定位及修复,而一旦面对非树形结构调用关系时则无能为力。The current fault diagnosis methods include three types. The first is the fault diagnosis based on the emergency plan library such as alarm, the second is the fault diagnosis based on the alarm and other plan library, and the third is the fault diagnosis and repair method based on the decision tree model. Among them, fault diagnosis based on alarms and other contingency plans: most operation and maintenance departments usually compile fault handling manuals based on fault phenomena and processing records, and some equipment suppliers will also provide similar simple fault location capabilities to achieve preliminary fault diagnosis. positioning and resolution. In addition to historical fault experience, it also includes other dimensions such as QoE (Quality of Experience) for fault diagnosis. Once a fault occurs, the diagnostic result is generated by collecting the key information of the alarm and finding the corresponding diagnostic manual for retrieval. Therefore, the alarm-based method can easily and quickly complete daily fault location and repair, but once faced with unknown faults that do not match the known alarm information, there is nothing to do. Fault diagnosis method based on offline indicator analysis tools: Offline indicator analysis tools include business indicators and system operation indicators. The former mainly reflects business volume indicators through business storage data, and the latter mainly imports external data such as logs into the database for analysis. System operation index analysis, system performance, success rate, failure distribution and other information are analyzed to judge the health of the system operation. The database-based method is convenient for extracting key system indicators and effectively monitoring the running status of each link of the program, but relatively speaking, the time extension is relatively large, which will have a certain impact on the timeliness of system monitoring. Fault diagnosis and repair method based on decision tree model: most system designs adopt multi-layer system topology architecture, based on the principle of layered call, establish a tree-shaped topology map, and based on this tree-shaped topology, establish a business-oriented and system fault-oriented decision tree. Once a fault occurs, the diagnostic results are generated by collecting key fault information and finding the corresponding decision tree for retrieval. Therefore, the method based on the decision tree can easily and quickly complete the daily fault location and repair, but once faced with the non-tree structure call relationship, it is powerless.

然而,在基于大数据平台、DCOS平台、模块系统、微服务系统等虚拟化、容器化的环境中,针对集群节点故障或异常的诊断及修复,现有方案不足以支撑快速响应、高效分析解决的能力要求,其主要表现在以下几个方面:(1)使用场景狭隘,无法处理未知场景。如方案一中基于告警及预案库形式的故障诊断及修复方法,主要依赖于对已知故障信息的经验积累,而且这种方式对故障场景有极大的要求。同样的故障现象在不同的故障场景下可能会有不同的处理方式,也就超出了简单预案库的处理范围。尤其是在面对未知故障信息时,已有的手册等手段已经完全失效,需要人工进行逐步排查,定位故障,修复问题,导致运维效率低下。(2)指标时效性差,无法及时反馈信息。现有方案对故障定位能力的提升仅限于加强故障信息收集,而对故障的最终定位及修复还是依赖于运维人员的判断和执行。通过海量的监控指标数据,极大程度上扩大了故障信息来源,但也对指标的采集延迟较高,造成这些数据的自动处理和分析上能力不足,无法及时展现问题的信息点和根源。(3)需要海量历史数据,不适应敏捷模式。现有方案主要采用训练决策树来提升分析能力,但是训练决策树需要大量的历史数据,由于本司系统业务特点,新出问题占比较多,无法提供足量的有效训练数据,导致决策树模型准确度不高,对故障根因分析能力不足,无法提供有效支撑。However, in virtualized and containerized environments based on big data platforms, DCOS platforms, module systems, and microservice systems, existing solutions are not enough to support rapid response and efficient analysis and resolution of cluster node failures or abnormalities. The ability requirements are mainly manifested in the following aspects: (1) The usage scenarios are narrow and cannot handle unknown scenarios. For example, the fault diagnosis and repair method based on the alarm and contingency plan library in Scheme 1 mainly relies on the experience accumulation of known fault information, and this method has great requirements for fault scenarios. The same fault phenomenon may have different processing methods in different fault scenarios, which is beyond the processing range of the simple plan library. Especially in the face of unknown fault information, existing manuals and other means have completely failed, requiring manual step-by-step troubleshooting, locating faults, and repairing problems, resulting in low O&M efficiency. (2) The timeliness of the indicators is poor, and the information cannot be fed back in time. The improvement of fault location capabilities in existing solutions is limited to strengthening the collection of fault information, and the final location and repair of faults still depend on the judgment and execution of operation and maintenance personnel. Through the massive monitoring index data, the source of fault information has been greatly expanded, but the collection delay of the index is also high, resulting in insufficient automatic processing and analysis capabilities of these data, and the information points and root causes of the problem cannot be displayed in a timely manner. (3) Massive historical data is required, which is not suitable for agile mode. Existing solutions mainly use training decision trees to improve analysis capabilities, but training decision trees requires a large amount of historical data. Due to the business characteristics of our system, new problems account for a large proportion, and sufficient effective training data cannot be provided, resulting in the decision tree model The accuracy is not high, and the ability to analyze the root cause of the fault is insufficient to provide effective support.

在实现本发明实施例的过程中,发明人发现现有的查找故障根因的方法的环境适应能力差,无法对新出现的故障进行根因查询,且现有的查找故障根因的方法仅依据业务节点的调用关系查找,对数据的分析较为单一,数据分析能力较弱。In the process of implementing the embodiments of the present invention, the inventors found that the existing methods for finding the root cause of faults have poor environmental adaptability, and cannot perform root cause inquiries on new faults, and the existing methods for finding the root cause of faults only According to the call relationship search of business nodes, the analysis of data is relatively simple, and the data analysis ability is weak.

发明内容Contents of the invention

本发明所要解决的技术问题是如何解决现有的查找故障根因的方法的环境适应能力差,无法对新出现的故障进行根因查询,且现有的查找故障根因的方法仅依据业务节点的调用关系查找,对数据的分析较为单一,数据分析能力较弱的问题。The technical problem to be solved by the present invention is how to solve the problem that the existing method for finding the root cause of a fault has poor environmental adaptability and cannot perform root cause query on a new fault, and the existing method for finding the root cause of a fault is only based on the service node Invoking relationship search, data analysis is relatively simple, and data analysis ability is weak.

针对以上技术问题,本发明的实施例提供了一种基于多层有向图的故障根因诊断的方法,包括:In view of the above technical problems, embodiments of the present invention provide a method for root cause diagnosis of faults based on multi-layer directed graphs, including:

获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;Obtain the original service data generated at each service node of the preset service, and determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node;

根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;Establishing a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layers of each service node;

获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。Acquiring abnormal business data in the original business data, determining at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and determining from the root cause nodes that cause the Set the target root cause node of the business exception.

本发明的实施例提供了一种基于多层有向图的故障根因诊断的装置,包括:Embodiments of the present invention provide a device for root cause diagnosis of faults based on a multi-layer directed graph, including:

获取模块,用于获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;An acquisition module, configured to acquire original service data generated at each service node of a preset service, and determine the calling relationship of each service node according to the original service data and pre-stored attribute information of each service node;

建立模块,用于根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;Establishing a module for establishing a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layer to which each service node belongs;

根因确定模块,用于获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。A root cause determination module, configured to obtain abnormal business data in the original business data, determine at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and determine from the root cause The node determines the target root cause node that causes the preset service exception.

本实施例提供了一种电子设备,包括:This embodiment provides an electronic device, including:

至少一个处理器、至少一个存储器、通信接口和总线;其中,at least one processor, at least one memory, a communication interface, and a bus; wherein,

所述处理器、存储器、通信接口通过所述总线完成相互间的通信;The processor, the memory, and the communication interface complete mutual communication through the bus;

所述通信接口用于该电子设备和终端设备的通信设备之间的信息传输;The communication interface is used for information transmission between the electronic device and the communication device of the terminal device;

所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行以上所述的方法。The memory stores program instructions that can be executed by the processor, and the processor can execute the above-mentioned method by calling the program instructions.

本实施例提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行以上所述的方法。This embodiment provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer is made to execute the method described above.

本发明的实施例提供了一种基于多层有向图的故障根因诊断的方法及装置,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。Embodiments of the present invention provide a method and device for root cause diagnosis of faults based on a multi-layer directed graph. The method jointly determines the call relationship of each service node according to the original service data and attribute information, and can fully take into account the actual new The added business nodes or newly added call relationships ensure that each business node can be added to the multi-layer directed graph model when building a multi-layer directed graph model based on the call relationship, which is based on the multi-layer directed graph model. Accurately and quickly find the root cause nodes that generate abnormal business data to lay the foundation. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to the actual service nodes, the comprehensiveness of the nodes avoids the situation that it is impossible to query the root causes of new faults. At the same time, the analysis of data is not only It is based on the call relationship, but based on the created multi-layer directed graph model to conduct a comprehensive analysis of the data.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1是本发明一个实施例提供的基于多层有向图的故障根因诊断的方法的流程示意图;Fig. 1 is a schematic flow diagram of a method for root cause diagnosis of a fault based on a multilayer directed graph provided by an embodiment of the present invention;

图2是本发明另一个实施例提供的多层有向图的故障根因诊断的架构示意图;Fig. 2 is a schematic diagram of the architecture of the fault root cause diagnosis of a multi-layer directed graph provided by another embodiment of the present invention;

图3是本发明另一个实施例提供的进行故障根因查询的流程示意图;Fig. 3 is a schematic flow chart of performing fault root cause query provided by another embodiment of the present invention;

图4是本发明另一个实施例提供的基于多层有向图的故障根因诊断的装置的结构框图;Fig. 4 is the structural block diagram of the device of the fault root cause diagnosis based on multi-layer directed graph provided by another embodiment of the present invention;

图5是本发明另一个实施例提供的电子设备的结构框图。Fig. 5 is a structural block diagram of an electronic device provided by another embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1是本实施例提供的基于多层有向图的故障根因诊断的方法的流程示意图,参见图1,该方法包括:Fig. 1 is a schematic flow chart of the method for diagnosing the root cause of a fault based on a multi-layer directed graph provided in this embodiment, referring to Fig. 1, the method includes:

101:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;101: Obtain the original service data generated at each service node of the preset service, and determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node;

102:根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;102: Establish a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layer to which each service node belongs;

103:获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。103: Obtain abnormal business data in the original business data, determine at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and determine from the root cause nodes that cause the abnormal business data The target root cause node of the preset business exception.

本实施例提供的方法通常由对业务是否正常运行进行故障诊断和修复的设备执行,例如,服务器,本实施例对此不做具体限制。该方法用于对某一出现故障的业务进行根因查询。业务节点为该预设业务运行过程中的节点,在各业务节点处采集的数据为该业务的原始业务数据。属性信息为预先定义的各业务节点的属性信息,属性信息反应了各业务节点的调用关系。根据属性信息也可以对各业务节点进行分层,例如,位于应用层的节点、传输层的节点等。在创建各业务节点的多层有向图模型时需参照预先划分好的各业务节点的所属层。通过多层有向图查找导致预设业务异常的根因节点时,根据各业务节点的调用关系逐层查找。目标根因节点通常通过计算得到,具体的计算方法可以进行设定,本实施例对此不做具体限定。The method provided in this embodiment is usually performed by a device that diagnoses and repairs whether a service is running normally, for example, a server, and this embodiment does not specifically limit it. This method is used to query the root cause of a certain faulty service. The service nodes are the nodes in the running process of the preset service, and the data collected at each service node is the original service data of the service. The attribute information is the attribute information of each service node defined in advance, and the attribute information reflects the calling relationship of each service node. Each service node can also be layered according to attribute information, for example, a node at the application layer, a node at the transport layer, and the like. When creating the multi-layer directed graph model of each business node, it is necessary to refer to the pre-divided layers of each business node. When searching for the root cause node that causes preset business exceptions through a multi-layer directed graph, search layer by layer according to the calling relationship of each business node. The target root cause node is usually obtained through calculation, and the specific calculation method can be set, which is not specifically limited in this embodiment.

本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method jointly determines the calling relationship of each service node according to the original service data and attribute information, and can fully consider the newly added service nodes in practice. Or the newly added call relationship ensures that each business node can be added to the multi-layer directed graph model when the multi-layer directed graph model is established according to the call relationship, which is based on the multi-layer directed graph model. The root cause node of abnormal business data lays the foundation. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to the actual service nodes, the comprehensiveness of the nodes avoids the situation that it is impossible to query the root causes of new faults. At the same time, the analysis of data is not only It is based on the call relationship, but based on the created multi-layer directed graph model to conduct a comprehensive analysis of the data.

进一步地,在上述实施例的基础上,所述获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系,包括:Further, on the basis of the above-mentioned embodiments, the acquisition of the original service data generated at each service node of the preset service, and the determination of each service node’s Call relationship, including:

获取在预设业务的各业务节点处生成的原始业务数据和CMDB数据库中存储的各业务节点的属性信息,根据各业务节点的属性信息得到各业务节点之间的原始调用关系;Obtain the original service data generated at each service node of the preset service and the attribute information of each service node stored in the CMDB database, and obtain the original calling relationship between each service node according to the attribute information of each service node;

根据所述原始业务数据分析各业务节点的实际调用关系,根据实际调用关系对所述原始调用关系进行调整,得到由所述原始业务数据和所述属性信息确定的各业务节点的调用关系。Analyzing the actual call relationship of each service node according to the original service data, adjusting the original call relationship according to the actual call relationship, and obtaining the call relationship of each service node determined by the original service data and the attribute information.

CMDB数据库为存储与管理企业IT架构中设备的各种配置信息的数据库。对预设业务的各业务节点,首先根据CMDB数据库中定义的属性信息,得到各业务节点的调用关系。然而,由于实际中可能新增了预设业务的业务节点,而CMDB数据库中可能没有该新增的业务节点,因此在确定原始调用关系后需要再根据原始业务数据将新增节点和其它各业务节点的调用关系进行补充,最终得到符合实际的由所述原始业务数据和所述属性信息确定的各业务节点的调用关系。The CMDB database is a database that stores and manages various configuration information of devices in the enterprise IT architecture. For each service node of the preset service, firstly, according to the attribute information defined in the CMDB database, the calling relationship of each service node is obtained. However, since the business node of the preset business may be added in practice, and the newly added business node may not exist in the CMDB database, after determining the original call relationship, it is necessary to add the newly added node and other business nodes according to the original business data. The invocation relationship of the nodes is supplemented, and finally the actual invocation relationship of each service node determined by the original service data and the attribute information is obtained.

本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法对根据CMDB数据库得到的原始调用关系进行调整,保证最终确定的调用关系包括了业务实际运行过程中的所有调用关系,为对新出现的故障也能进行根因查询提供了保证。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method adjusts the original calling relationship obtained from the CMDB database to ensure that the final calling relationship includes all The call relationship provides a guarantee for the root cause query of new faults.

进一步地,在上述各实施例的基础上,所述根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型,包括:Further, on the basis of the above-mentioned embodiments, the establishment of multiple service nodes of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layer to which each service node belongs Layer directed graph model, including:

根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系,对所述CMDB数据库中存储的业务节点进行修正;Correcting the service nodes stored in the CMDB database according to the calling relationship of each service node determined by the original service data and the attribute information;

获取预先划分的修正后的CMDB数据库中第i层的业务节点,对CMDB数据库中第i层的业务节点vn,根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系获取由该业务节点vn到达且能到达该业务节点vn的目标业务节点;Obtain the service nodes of the i-th layer in the pre-divided and corrected CMDB database, and for the service node v n of the i-th layer in the CMDB database, according to the calling relationship of each service node determined by the original service data and the attribute information Obtain the target service node that is reached by the service node v n and can reach the service node v n ;

将CMDB数据库中第i层的每一业务节点对应的目标业务节点添加到第i层节点集合中,则所述第i层节点集合中的点为所述多层有向图模型中第i层的节点。Add the target service node corresponding to each service node of the i-th layer in the CMDB database to the i-th layer node set, then the point in the i-th layer node set is the i-th layer in the multi-layer directed graph model of nodes.

例如,业务实际运行时新增了业务节点,那么需要将该业务节点添加到CMDB数据库中,及时对CMDB数据库进行更新。各业务节点在CMDB数据库中预先根据各业务节点的属性划分了层,例如,将属于应用层的业务节点划分为同一层,将属于传输层的业务节点划分为同一层。For example, if a new business node is added during the actual operation of the business, the business node needs to be added to the CMDB database, and the CMDB database should be updated in time. Each service node is pre-divided into layers in the CMDB database according to the attributes of each service node. For example, the service nodes belonging to the application layer are divided into the same layer, and the service nodes belonging to the transport layer are divided into the same layer.

在多层有向图模型中,第i层节点集合可以通过公式Li={R1∩A1,R2∩A2,……,Rn∩An}表示。其中,Rn表示所有从vn到达的节点的集合,An表示所有能够到达vn的节点的集合。CMDB数据库中第i层的业务节点共有n各,分别为v1,v2,……vnIn a multi-layer directed graph model, the i-th layer node set can be represented by the formula L i ={R 1 ∩A 1 , R 2 ∩A 2 , . . . , R n ∩A n }. Among them, R n represents the set of all nodes reachable from v n , and A n represents the set of all nodes that can reach v n . There are n business nodes in the i-th layer in the CMDB database, namely v 1 , v 2 , ... v n .

本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法根据CMDB数据库中各业务节点所属层得到多层有向图模型中各层的业务节点。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method obtains the service nodes of each layer in the multi-layer directed graph model according to the layer to which each service node belongs in the CMDB database.

进一步地,在上述各实施例的基础上,所述获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点,包括:Further, on the basis of the above-mentioned embodiments, the acquisition of abnormal business data in the original business data determines at least one of the abnormal business data that causes the business business to generate the abnormal business data according to the multi-layer directed graph model. The root cause node is to determine the target root cause node that causes the preset service exception from the root cause nodes, including:

根据预先设定的阈值区间判断每一业务节点处生成的原始业务数据是否异常,获取原始业务数据中所有异常的异常业务数据;Judging whether the original business data generated at each business node is abnormal according to the preset threshold interval, and obtaining all abnormal abnormal business data in the original business data;

将每一异常业务数据映射到所述多层有向图模型中生成该异常业务数据的业务节点上,根据所述多层有向图模型中各业务节点的调用关系和各业务节点在所述多层有向图模型中所属层查找导致所述业务业务的至少一个根因节点;Each abnormal business data is mapped to the business node that generates the abnormal business data in the multi-layer directed graph model, and according to the calling relationship of each business node in the multi-layer directed graph model and each business node in the described multi-layer directed graph model In the multi-layer directed graph model, at least one root cause node that leads to the business operation is searched by the layer belonging to it;

构建时间序列数据<m,k,T,Em×k>,以xi(t)为自变量,以Em×k-xi(t)为应变量,构造函数f[xi(t)]=Em×k-xi(t),对每一根因节点所有时间序列上的值xi(t)~xi(t-k)进行扰动,得到每一根因节点的波动值y[δ,f[xi(t)]],将波动值小于预设波动值的根因节点作为所述目标根因节点;Construct time series data <m, k, T, E m×k >, take x i (t) as independent variable, take E m×k -xi (t) as dependent variable, construct function f[ xi (t )]=E m×k -xi (t), perturb the value x i ( t)~ xi (tk) of each root node in all time series, and get the fluctuation value y of each root node [δ, f[ xi (t)]], using the root cause node whose fluctuation value is less than the preset fluctuation value as the target root cause node;

其中,m是所述多层有向图模型中业务节点个数,k是每个业务节点存在的时滞个数,T为时间序列的长度,Em×k为所述多层有向图模型中所有业务节点在所有时滞上的集合,δ为与所述多层有向图模型有关的参数,根因节点的总个数为j,xi(t)为第i个业务节点在时间序列长度为t时对应的业务数据。Wherein, m is the number of business nodes in the multi-layer directed graph model, k is the number of time lags that each business node exists, T is the length of the time series, and E m × k is the multi-layer directed graph The set of all business nodes in the model on all time lags, δ is a parameter related to the multi-layer directed graph model, the total number of root cause nodes is j, x i (t) is the ith business node in The corresponding business data when the time series length is t.

判断业务数据是否为异常业务数据可以根据设定的阈值范围进行判断,也可以对业务数据进行运算处理后,根据运算处理后的结果判断业务数据是否异常,本实施例对此不做具体限制。在进行根因查找的过程中,只需要将异常业务数据映射到多层有向图模型中。Whether the business data is abnormal can be judged according to the set threshold range, or whether the business data is abnormal can be judged according to the result of the calculation after the business data is processed, which is not specifically limited in this embodiment. In the process of finding the root cause, it is only necessary to map abnormal business data into a multi-layer directed graph model.

在查找根因节点时,需要根据各业务节点所属层和各业务节点之间的调用关系进行查找。例如,具有调用关系的一组节点在每一层均存在一个异常点,则通常位于最底层的业务节点为根因节点;若具有调用关系的一组业务节点在某一层不存在异常业务节点,则应将该层之上和该层之下的业务节点作为独立的两个部分进行根因查找。When finding the root cause node, it is necessary to search according to the layer to which each service node belongs and the calling relationship between each service node. For example, if a group of nodes with a call relationship has an abnormal point at each layer, the business node at the bottom is usually the root cause node; if a group of business nodes with a call relationship does not have an abnormal business node at a certain layer , the service nodes above the layer and below the layer should be regarded as two independent parts for root cause investigation.

查找到根因节点后,根据计算出的每一根因节点对应的波动值对根因节点进行排序,波动值越小,说明该根因节点导致业务异常的可能性更大,将可能性较大的几个根因节点作为目标根因节点。After finding the root cause node, sort the root cause nodes according to the calculated fluctuation value corresponding to each root cause node. Several large root cause nodes are used as target root cause nodes.

本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法通过多层有向图模型进行根因查询,多个维度分析数据,提高了根因查找的准确性。从根因节点中确定出目标根因节点,缩小了对业务进行修复时需要考虑的节点范围,提高了修复业务的效率。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method performs root cause query through a multi-layer directed graph model, analyzes data in multiple dimensions, and improves the accuracy of root cause search. Determining the target root cause node from the root cause nodes reduces the range of nodes to be considered when repairing the service, and improves the efficiency of repairing the service.

进一步地,在上述各实施例的基础上,所述获取在预设业务的各业务节点处生成的原始业务数据之前,还包括:Further, on the basis of the above-mentioned embodiments, before acquiring the original service data generated at each service node of the preset service, it also includes:

对每一业务进行KEI指标评估,判断该业务是否处于健康状态,若该业务未处于健康状态,则将该业务作为所述预设业务,获取在所述预设业务的各业务节点处生成的原始业务数据。Carry out KEI index evaluation for each business to determine whether the business is in a healthy state, if the business is not in a healthy state, then use this business as the preset business, and obtain the Raw business data.

KEI(关键绩效指标)用于对业务是否处于健康装填进行评估,本实施例提供的方法仅对处于不健康状态的业务进行根因诊断。KEI (Key Performance Indicator) is used to evaluate whether the business is in a healthy state, and the method provided in this embodiment only performs root cause diagnosis on the business in an unhealthy state.

本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法通过KEI指标筛选出处于不健康状态的业务,对处于不健康状态的业务进行根因诊断,避免了对健康状态的业务进行不必要的诊断。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method screens out services in an unhealthy state through the KEI index, and diagnoses the root cause of the services in an unhealthy state, avoiding the need for faults in a healthy state. Unnecessary diagnosis of the business.

进一步地,在上述各实施例的基础上,所述从根因节点中确定导致所述预设业务异常的目标根因节点之后,还包括:Further, on the basis of the above embodiments, after determining the target root cause node that causes the preset service exception from the root cause nodes, it further includes:

判断是否存储有对所述目标根因节点进行修复的故障处理预案,若是,根据故障处理预案修复所述目标根因节点,并发送已经对目标根因节点进行修复的第一提示信息,否则,发送所述目标根因节点的节点信息和未对目标根因节点进行修复的第二提示信息。Judging whether there is a fault handling plan for repairing the target root cause node, if yes, repairing the target root cause node according to the fault handling plan, and sending the first prompt message that the target root cause node has been repaired, otherwise, Sending the node information of the target root cause node and the second prompt information that the target root cause node has not been repaired.

确定目标根因节点后,需要针对目标根因节点进行修复,保证系统的正常运行。第一提示信息和第二提示信息可以是通过邮件发送或者通过短信发送的信息,本实施例对此不做具体限定。After determining the target root cause node, it is necessary to repair the target root cause node to ensure the normal operation of the system. The first prompt information and the second prompt information may be information sent by email or short message, which is not specifically limited in this embodiment.

本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法在能够及时修复故障的情况下及时对故障进行修复,在无法修复故障的情况下及时发出提示信息,及时告知工作人员采取修复方案进行故障修复,保证业务的正常运行。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method repairs the fault in time if the fault can be repaired in time, and sends a prompt message in time if the fault cannot be repaired. Inform the staff to take a repair plan to repair the fault to ensure the normal operation of the business.

作为更为具体的实施例,图2为本实施例提供的多层有向图的故障根因诊断的架构示意图,参见图2,其主要涉及CMDB数据库,应用拓扑关系管理,有向图模型转换器,模型库,指标管理装置,故障根源分析装置,故障自动化处理装置等。其中,有向图转换器通过对现有资产数据继续分析,生成故障多层有向图模型(FSDG),故障根源诊断装置利用FSDG模型对实时KEI指标进行评估计算,最终挖掘故障根因。As a more specific embodiment, Figure 2 is a schematic diagram of the root cause diagnosis of multi-layer directed graph faults provided by this embodiment, see Figure 2, which mainly involves CMDB database, application topology relationship management, and directed graph model conversion device, model library, indicator management device, fault root analysis device, fault automatic processing device, etc. Among them, the directed graph converter continues to analyze the existing asset data to generate a fault multi-layer directed graph model (FSDG), and the fault root cause diagnosis device uses the FSDG model to evaluate and calculate real-time KEI indicators, and finally dig out the root cause of the fault.

如图2所示的各部分中,(1)应用生产系统实时对用户的操作进行处理,当业务处理产生异常时,应用生产系统必然存在异常点。应用生产系统与应用拓扑管理系统连接:当各应用服务之间产生调用关系时,拓扑管理系统获取到调用关系数据。Among the parts shown in Figure 2, (1) the application production system processes the user's operations in real time, and when an exception occurs in business processing, there must be an abnormal point in the application production system. The application production system is connected with the application topology management system: when a call relationship is generated between application services, the topology management system obtains the call relationship data.

(2)应用拓扑关系管理主要有6个装置组成,包括调用数据采集,数据清洗,规则转换,调用关系分析,调用行为分析,规则持续学习。应用拓扑关系管理通过调用数据采集,分析系统中各节点之间的调用关系,为后续的有向图提供数据支持,并和CMDB数据共同提交至模型转换器生产多层有向图模型。(2) Application topology relationship management mainly consists of six devices, including call data collection, data cleaning, rule conversion, call relationship analysis, call behavior analysis, and continuous learning of rules. The application topology relationship management analyzes the calling relationship between nodes in the system by calling data collection, provides data support for the subsequent directed graph, and submits it together with CMDB data to the model converter to produce a multi-layer directed graph model.

(3)CMDB数据库中保存了应用系统中各CI项的属性,已经CI项之间的多种关系定义。通过CMDB数据,可以定义出多层有向图中的FSDG分层模型,并将模型提交至模型转换器生产多层有向图模型。(3) The attribute of each CI item in the application system is saved in the CMDB database, and various relationships between CI items have been defined. Through the CMDB data, the FSDG hierarchical model in the multi-layer directed graph can be defined, and the model can be submitted to the model converter to produce the multi-layer directed graph model.

(4)模型转换器对输入数据进行处理与转换,根据数据属性转换为对应编码。通过应用拓扑关系数据及CMDB数据,将系统的复杂调用关系转换为多层有向图模型。模型转换器与模型库相连,将数据进行编码转换后提交至FSDG模型库;(4) The model converter processes and converts the input data, and converts it into corresponding codes according to the data attributes. By applying topological relationship data and CMDB data, the complex call relationship of the system is converted into a multi-layer directed graph model. The model converter is connected to the model library, and the data is encoded and converted and then submitted to the FSDG model library;

由CMDB数据,得到节点集合V={vi|vi为CMDB中管理的资产节点};From the CMDB data, the node set V={v i |v i is the asset node managed in the CMDB};

由应用拓扑关系数据,得到支路集合E={ei,j|节点vi指向节点vj的有向边};By applying the topological relationship data, the branch set E={e i,j |directed edge from node v i pointing to node v j };

多层有向图模型中,第i层的所有业务节点通过集合Li={R1∩A1,R2∩A2,……,Rn∩An}表示。In the multi-layer directed graph model, all service nodes in the i-th layer are represented by the set L i ={R 1 ∩A 1 , R 2 ∩A 2 ,...,R n ∩A n }.

(5)模型库中包含已知系统拓扑模型,根据业务和系统进行分类,可分为CRM,渠道,CBOSS模型等等,不同系统的拓扑层次及调用关系都有差异。与故障根源分析装置相连:当模型库将信息输入故障根源分析装置后,与指标管理装置的指标数据一起供分析模块分析故障根源。(5) The model library contains known system topology models, which can be classified according to business and system, and can be divided into CRM, channel, CBOSS models, etc. The topology levels and calling relationships of different systems are different. Connected with the fault root analysis device: After the model library inputs the information into the fault root analysis device, it will be used together with the index data of the index management device for the analysis module to analyze the fault root.

(6)指标管理装置管理了系统中业务,系统等指标数据,包含多层有向图模型中各节点指标数据,包括健康度等关键指标。与故障根源分析装置连接:将指标推送至分析装置,并与指标库中模型配合分析故障根源。(6) The index management device manages index data such as business and system in the system, including index data of each node in the multi-layer directed graph model, including key indexes such as health. Connect with the fault root analysis device: push the indicators to the analysis device, and cooperate with the model in the index library to analyze the root cause of the fault.

(7)故障根源分析装置基于大数据STORM流计算架构,通过实时数据计算,将故障根源计算耗时缩短至秒级;根据多层有向图模型及节点指标数据,判断系统是否有异常,如果有异常,根据多层有向图算法,计算出根源节点,即分析出系统故障的根因。与故障自动化处理装置连接:当分析出故障根源时,将故障根源发送至处理装置进行故障处理。(7) The fault root analysis device is based on the big data STORM flow computing architecture, and through real-time data calculation, the time-consuming calculation of the fault root is shortened to the second level; according to the multi-layer directed graph model and node index data, it is judged whether there is an abnormality in the system, if If there is an exception, according to the multi-layer directed graph algorithm, the root node is calculated, that is, the root cause of the system failure is analyzed. Connect with the fault automatic processing device: When the root cause of the fault is analyzed, the root cause of the fault is sent to the processing device for troubleshooting.

图3为本实施例提供的进行故障根因查询的流程示意图,参见图3,该过程包括:Fig. 3 is a schematic flow chart of performing fault root cause query provided by this embodiment, referring to Fig. 3, the process includes:

利用KEI模型对FSDG模型最高层的指标数据进行评估,如果评估结果处于健康状态,系统不进行后续分析;如果评估结果处于不健康状态,则触发FSDG故障根源分析流程,计算故障源。Use the KEI model to evaluate the index data at the top level of the FSDG model. If the evaluation result is in a healthy state, the system will not perform subsequent analysis; if the evaluation result is in an unhealthy state, it will trigger the FSDG fault root cause analysis process and calculate the fault source.

对FSDG故障节点集合采用朴素因果挖掘算法进行处理,构建故障因果挖掘对象FCS,FCS是系统中各个要素产生的所有时间序列数据,形式化表达成四元组<m,k,T,Em×k>,m是FSDG中要素个数,k是每个要素存在时滞个数,T表示时间序列的长度,Em×k表示系统中所有要素在所有时滞上的集合。FSDG图可能有多条业务节点组成的链路需要进行故障根因诊断,C1……Cn表示针对不同的业务节点组成的关联链路拆分的对应于每一链路的FSDG图。The FSDG fault node set is processed by the naive causal mining algorithm, and the fault causal mining object FCS is constructed. FCS is all time series data generated by each element in the system, and is formally expressed as a quadruple <m, k, T, E m× k >, m is the number of elements in FSDG, k is the number of time lags for each element, T represents the length of the time series, and E m×k represents the set of all elements in the system on all time lags. The FSDG diagram may have multiple links composed of service nodes that need to be diagnosed for the root cause of the fault. C1...Cn represents the FSDG diagram corresponding to each link split for the associated links composed of different service nodes.

在波动话计算的过程中,target=xi(t),variables=Em×k-xi(t),以target为因变量,以variables为自变量进行基于GEP的函数拟合,得到函数fxi(t);依次对fxi(t)自变量集合variable中的每个要素进行扰动。由于系统的时滞为k,故对每个要素xj所有时间序列上的值xi(t)~xj(t-k)都进行扰动;基于扰动计算出各要素波动值δfxi(t)(xi,δ)然后根据波动大小进行因果判断,波动值较小的为故障根源。In the calculation process of wave words, target= xi (t), variables=E m×k -xi (t), with target as dependent variable and variables as independent variable, GEP-based function fitting is performed to obtain the function fx i (t); perturb each element in the independent variable set variable of fx i (t) in turn. Since the time lag of the system is k, the values x i (t)~x j (tk) in all time series of each element x j are disturbed; based on the disturbance, the fluctuation value of each element δfx i (t)( x i , δ) and then make a causal judgment according to the magnitude of the fluctuation, and the smaller fluctuation value is the root cause of the fault.

(8)故障自动化处理装置用于对故障根源进行自动处理,如有相应故障处理预案,装置自动按预案执行,及时对系统进行修复,并通知系统相关责任人。(8) The fault automatic processing device is used to automatically process the root cause of the fault. If there is a corresponding fault processing plan, the device will automatically execute according to the plan, repair the system in time, and notify the relevant person in charge of the system.

针对现有方案只局限于已知故障分析根源,无法灵活应对新发现故障,且无法提供实时计算能力的缺点,本实施例提供的基于多层有向图的故障根因诊断的方法基于Storm流计算技术,采用故障有向图算法FSDG及朴素因果挖掘算法NCM相结合的方法,提供了实时高效灵活的故障根源分析能力。另一方面,针对当前IT运维系统建模方法难以提供海量数据进行训练的缺点,本实施例提供的方法提出了一种基于CMDB数据及应用拓扑关系管理模块生成FSDG模型的快速建模方法,提升的模型建立的便捷性,避免训练数据不足造成模型误差较大。In view of the shortcomings of existing solutions that are limited to known fault analysis root causes, unable to flexibly deal with newly discovered faults, and unable to provide real-time computing capabilities, the method of fault root cause diagnosis based on multi-layer directed graphs provided in this embodiment is based on Storm flow Computing technology, using the method of combining fault directed graph algorithm FSDG and naive causality mining algorithm NCM, provides real-time, efficient and flexible fault root analysis capabilities. On the other hand, in view of the shortcomings of the current IT operation and maintenance system modeling method that it is difficult to provide massive data for training, the method provided in this embodiment proposes a fast modeling method based on CMDB data and the application topology relationship management module to generate the FSDG model, Improve the convenience of model establishment to avoid large model errors caused by insufficient training data.

本实施例提供的基于多层有向图的故障根因诊断的方法不局限于对已知故障的定位处理,对于新的故障能够依据模型自动进行根因分析。加强了故障数据分析能力,通过数据实时计算,避免了信息爆炸等带来的数据积压影响。提升了故障自动处理能力,引入自动化处理装置,实现了故障从自动发现、定位到最终处理的闭环管理。The method for diagnosing the root cause of a fault based on a multi-layer directed graph provided in this embodiment is not limited to locating known faults, and can automatically perform root cause analysis for new faults according to the model. The ability to analyze fault data has been strengthened, and through real-time calculation of data, the impact of data backlog caused by information explosion has been avoided. The ability to automatically handle faults has been improved, and automatic processing devices have been introduced to realize closed-loop management of faults from automatic discovery, location to final processing.

图4为本实施例提供的基于多层有向图的故障根因诊断的装置的结构框图,参见图4,该装置包括获取模块401、建立模块402和根因确定模块403,其中,Fig. 4 is a structural block diagram of a device for root cause diagnosis of a fault based on a multi-layer directed graph provided in this embodiment. Referring to Fig. 4, the device includes an acquisition module 401, an establishment module 402 and a root cause determination module 403, wherein,

获取模块401,用于获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;The obtaining module 401 is used to obtain the original service data generated at each service node of the preset service, and determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node;

建立模块402,用于根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;A building module 402, configured to establish a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layer to which each service node belongs;

根因确定模块403,用于获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。The root cause determination module 403 is configured to obtain abnormal business data in the original business data, determine at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and start from the root The target root cause node that causes the preset service exception is determined in the cause node.

本实施例提供的基于多层有向图的故障根因诊断的装置适用于上述实施例提供的基于多层有向图的故障根因诊断的方法,在此不再赘述。The device for diagnosing the root cause of a fault based on a multi-layer directed graph provided in this embodiment is applicable to the method for diagnosing a root cause of a fault based on a multi-layer directed graph provided in the above embodiment, and details are not repeated here.

本实施例提供了一种基于多层有向图的故障根因诊断的装置,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。This embodiment provides a device for diagnosing the root cause of a fault based on a multi-layer directed graph. The method jointly determines the calling relationship of each service node according to the original service data and attribute information, and can fully consider the newly added service nodes in practice. Or the newly added call relationship ensures that each business node can be added to the multi-layer directed graph model when the multi-layer directed graph model is established according to the call relationship, which is based on the multi-layer directed graph model. The root cause node of abnormal business data lays the foundation. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to the actual service nodes, the comprehensiveness of the nodes avoids the situation that it is impossible to query the root causes of new faults. At the same time, the analysis of data is not only It is based on the call relationship, but based on the created multi-layer directed graph model to conduct a comprehensive analysis of the data.

图5是示出本实施例提供的电子设备的结构框图。Fig. 5 is a block diagram showing the structure of the electronic device provided by this embodiment.

参照图5,所述电子设备包括:处理器(processor)501、存储器(memory)502、通信接口(Communications Interface)503和总线504;Referring to FIG. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, a communication interface (Communications Interface) 503 and a bus 504;

其中,in,

所述处理器501、存储器502、通信接口503通过所述总线504完成相互间的通信;The processor 501, the memory 502, and the communication interface 503 complete mutual communication through the bus 504;

所述通信接口503用于该电子设备和其它电子设备的通信设备之间的信息传输;The communication interface 503 is used for information transmission between the electronic device and communication devices of other electronic devices;

所述处理器501用于调用所述存储器502中的程序指令,以执行上述各方法实施例所提供的方法,例如包括:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。The processor 501 is used to call the program instructions in the memory 502 to execute the methods provided by the above method embodiments, for example, including: obtaining the original service data generated at each service node of the preset service, according to the The original service data and the attribute information of each service node stored in advance determine the call relationship of each service node; according to the call relationship of each service node determined by the original service data and the attribute information and the pre-divided service nodes The affiliated layer establishes a multi-layer directed graph model of each business node; obtains abnormal business data in the original business data, and determines at least one of the abnormal business data that causes the business business to generate the abnormal business data according to the multi-layer directed graph model The root cause node is to determine the target root cause node that causes the preset service exception from the root cause nodes.

本实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行上述各方法实施例所提供的方法,例如包括:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。This embodiment provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the methods provided in the above method embodiments, for example, including : Obtain the original service data generated at each service node of the preset service, determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node; The invocation relationship of each business node determined by the attribute information and the multi-layer directed graph model of each business node are established with the pre-divided layers of each business node; the abnormal business data in the original business data is obtained, and according to the multi-layer The directed graph model determines at least one root cause node that causes the business service to generate the abnormal service data, and determines the target root cause node that causes the preset service exception from the root cause nodes.

本实施例公开一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的方法,例如,包括:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。This embodiment discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by the computer, the computer The methods provided by the above method embodiments can be executed, for example, including: obtaining the original service data generated at each service node of the preset service, and determining each The invocation relationship of service nodes; according to the invocation relationship of each service node determined by the original service data and the attribute information and the pre-divided layer of each service node, a multi-layer directed graph model of each service node is established; obtain all Abnormal business data in the original business data, determine at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, determine from the root cause node that causes the preset business The target root cause node for the exception.

本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the It includes the steps of the above method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

以上所描述的电子设备等实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The above-described embodiments such as electronic equipment are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

最后应说明的是:以上各实施例仅用以说明本发明的实施例的技术方案,而非对其限制;尽管参照前述各实施例对本发明的实施例进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明的实施例各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, not to limit them; although the embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art The skilled person should understand that: it is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the present invention The scope of the technical solution of each embodiment of the embodiment.

Claims (9)

1.一种基于多层有向图的故障根因诊断的方法,其特征在于,包括:1. A method for fault root cause diagnosis based on multilayer directed graph, characterized in that, comprising: 获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;Obtain the original service data generated at each service node of the preset service, and determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node; 根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;Establishing a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layers of each service node; 获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。Acquiring abnormal business data in the original business data, determining at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and determining from the root cause nodes that cause the Set the target root cause node of the business exception. 2.根据权利要求1所述的方法,其特征在于,所述获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系,包括:2. The method according to claim 1, characterized in that the acquisition of the original service data generated at each service node of the preset service is determined according to the original service data and the pre-stored attribute information of each service node The calling relationship of each business node, including: 获取在预设业务的各业务节点处生成的原始业务数据和CMDB数据库中存储的各业务节点的属性信息,根据各业务节点的属性信息得到各业务节点之间的原始调用关系;Obtain the original service data generated at each service node of the preset service and the attribute information of each service node stored in the CMDB database, and obtain the original calling relationship between each service node according to the attribute information of each service node; 根据所述原始业务数据分析各业务节点的实际调用关系,根据实际调用关系对所述原始调用关系进行调整,得到由所述原始业务数据和所述属性信息确定的各业务节点的调用关系。Analyzing the actual call relationship of each service node according to the original service data, adjusting the original call relationship according to the actual call relationship, and obtaining the call relationship of each service node determined by the original service data and the attribute information. 3.根据权利要求2所述的方法,其特征在于,所述根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型,包括:3. The method according to claim 2, characterized in that, establishing each service according to the call relationship of each service node determined by the original service data and the attribute information and the pre-divided layer to which each service node belongs A multi-layer directed graph model of nodes, including: 根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系,对所述CMDB数据库中存储的业务节点进行修正;Correcting the service nodes stored in the CMDB database according to the calling relationship of each service node determined by the original service data and the attribute information; 获取预先划分的修正后的CMDB数据库中第i层的业务节点,对CMDB数据库中第i层的业务节点vn,根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系获取由该业务节点vn到达且能到达该业务节点vn的目标业务节点;Obtain the service nodes of the i-th layer in the pre-divided and corrected CMDB database, and for the service node v n of the i-th layer in the CMDB database, according to the calling relationship of each service node determined by the original service data and the attribute information Obtain the target service node that is reached by the service node v n and can reach the service node v n ; 将CMDB数据库中第i层的每一业务节点对应的目标业务节点添加到第i层节点集合中,则所述第i层节点集合中的点为所述多层有向图模型中第i层的节点。Add the target service node corresponding to each service node of the i-th layer in the CMDB database to the i-th layer node set, then the point in the i-th layer node set is the i-th layer in the multi-layer directed graph model of nodes. 4.根据权利要求3所述的方法,其特征在于,所述获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点,包括:4. The method according to claim 3, wherein the acquisition of the abnormal business data in the original business data determines that the business business generates the abnormal business data according to the multi-layer directed graph model at least one root cause node, and determine the target root cause node that causes the preset business exception from the root cause nodes, including: 根据预先设定的阈值区间判断每一业务节点处生成的原始业务数据是否异常,获取原始业务数据中所有异常的异常业务数据;Judging whether the original business data generated at each business node is abnormal according to the preset threshold interval, and obtaining all abnormal abnormal business data in the original business data; 将每一异常业务数据映射到所述多层有向图模型中生成该异常业务数据的业务节点上,根据所述多层有向图模型中各业务节点的调用关系和各业务节点在所述多层有向图模型中所属层查找导致所述业务业务的至少一个根因节点;Each abnormal business data is mapped to the business node that generates the abnormal business data in the multi-layer directed graph model, and according to the calling relationship of each business node in the multi-layer directed graph model and each business node in the described multi-layer directed graph model In the multi-layer directed graph model, at least one root cause node that leads to the business operation is searched by the layer belonging to it; 构建时间序列数据<m,k,T,Em×k>,以xi(t)为自变量,以Em×k-xi(t)为应变量,构造函数f[xi(t)]=Em×k-xi(t),对每一根因节点所有时间序列上的值xi(t)~xi(t-k)进行扰动,得到每一根因节点的波动值y[δ,f[xi(t)]],将波动值小于预设波动值的根因节点作为所述目标根因节点;Construct time series data <m, k, T, E m×k >, take x i (t) as independent variable, take E m×k -xi (t) as dependent variable, construct function f[ xi (t )]=E m×k -xi (t), perturb the value x i ( t)~ xi (tk) of each root node in all time series, and get the fluctuation value y of each root node [δ, f[ xi (t)]], using the root cause node whose fluctuation value is less than the preset fluctuation value as the target root cause node; 其中,m是所述多层有向图模型中业务节点个数,k是每个业务节点存在的时滞个数,T为时间序列的长度,Em×k为所述多层有向图模型中所有业务节点在所有时滞上的集合,δ为与所述多层有向图模型有关的参数,根因节点的总个数为j,xi(t)为第i个业务节点在时间序列长度为t时对应的业务数据。Wherein, m is the number of business nodes in the multi-layer directed graph model, k is the number of time lags that each business node exists, T is the length of the time series, and E m × k is the multi-layer directed graph The set of all business nodes in the model on all time lags, δ is a parameter related to the multi-layer directed graph model, the total number of root cause nodes is j, x i (t) is the ith business node in The corresponding business data when the time series length is t. 5.根据权利要求1所述的方法,其特征在于,所述获取在预设业务的各业务节点处生成的原始业务数据之前,还包括:5. The method according to claim 1, characterized in that, before the acquisition of the original service data generated at each service node of the preset service, further comprising: 对每一业务进行KEI指标评估,判断该业务是否处于健康状态,若该业务未处于健康状态,则将该业务作为所述预设业务,获取在所述预设业务的各业务节点处生成的原始业务数据。Carry out KEI index evaluation for each business to determine whether the business is in a healthy state, if the business is not in a healthy state, then use this business as the preset business, and obtain the Raw business data. 6.根据权利要求1所述的方法,其特征在于,所述从根因节点中确定导致所述预设业务异常的目标根因节点之后,还包括:6. The method according to claim 1, characterized in that, after determining the target root cause node that causes the preset service exception from the root cause node, further comprising: 判断是否存储有对所述目标根因节点进行修复的故障处理预案,若是,根据故障处理预案修复所述目标根因节点,并发送已经对目标根因节点进行修复的第一提示信息,否则,发送所述目标根因节点的节点信息和未对目标根因节点进行修复的第二提示信息。Judging whether there is a fault handling plan for repairing the target root cause node, if yes, repairing the target root cause node according to the fault handling plan, and sending the first prompt message that the target root cause node has been repaired, otherwise, Sending the node information of the target root cause node and the second prompt information that the target root cause node has not been repaired. 7.一种基于多层有向图的故障根因诊断的装置,其特征在于,包括:7. A device for fault root cause diagnosis based on multilayer directed graph, characterized in that, comprising: 获取模块,用于获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;An acquisition module, configured to acquire original service data generated at each service node of a preset service, and determine the calling relationship of each service node according to the original service data and pre-stored attribute information of each service node; 建立模块,用于根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;Establishing a module for establishing a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-divided layer to which each service node belongs; 根因确定模块,用于获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。A root cause determination module, configured to obtain abnormal business data in the original business data, determine at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and determine from the root cause The node determines the target root cause node that causes the preset service exception. 8.一种电子设备,其特征在于,包括:8. An electronic device, characterized in that it comprises: 至少一个处理器、至少一个存储器、通信接口和总线;其中,at least one processor, at least one memory, a communication interface, and a bus; wherein, 所述处理器、存储器、通信接口通过所述总线完成相互间的通信;The processor, the memory, and the communication interface complete mutual communication through the bus; 所述通信接口用于该电子设备和其它电子设备的通信设备之间的信息传输;The communication interface is used for information transmission between the electronic device and communication devices of other electronic devices; 所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如权利要求1至6中任一项所述的方法。The memory stores program instructions executable by the processor, and the processor can execute the method according to any one of claims 1 to 6 by invoking the program instructions. 9.一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行如权利要求1至6任一项所述的方法。9. A computer program product, characterized in that the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer , causing the computer to execute the method according to any one of claims 1 to 6.
CN201810461456.6A 2018-05-15 2018-05-15 A method and device for fault root cause diagnosis based on multi-layer directed graph Active CN110493025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810461456.6A CN110493025B (en) 2018-05-15 2018-05-15 A method and device for fault root cause diagnosis based on multi-layer directed graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810461456.6A CN110493025B (en) 2018-05-15 2018-05-15 A method and device for fault root cause diagnosis based on multi-layer directed graph

Publications (2)

Publication Number Publication Date
CN110493025A true CN110493025A (en) 2019-11-22
CN110493025B CN110493025B (en) 2022-06-14

Family

ID=68545155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810461456.6A Active CN110493025B (en) 2018-05-15 2018-05-15 A method and device for fault root cause diagnosis based on multi-layer directed graph

Country Status (1)

Country Link
CN (1) CN110493025B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111107158A (en) * 2019-12-26 2020-05-05 远景智能国际私人投资有限公司 Alarm method, device, equipment and medium for Internet of things equipment cluster
CN111639115A (en) * 2020-04-29 2020-09-08 国家电网有限公司客户服务中心 Five-dimensional model-based analysis method for operation and maintenance data abnormity of power grid information system
CN111858123A (en) * 2020-07-29 2020-10-30 中国工商银行股份有限公司 Fault root cause analysis method and device based on directed graph network
CN111913824A (en) * 2020-06-23 2020-11-10 中国建设银行股份有限公司 Method for determining data link fault reason and related equipment
CN112506763A (en) * 2020-11-30 2021-03-16 清华大学 Automatic positioning method and device for database system fault root
CN112541098A (en) * 2020-12-17 2021-03-23 杉数科技(北京)有限公司 Directed graph drawing method and chemical material planning method
CN112580810A (en) * 2020-12-22 2021-03-30 济南中科成水质净化有限公司 Sewage treatment process analysis and diagnosis method based on directed acyclic graph
CN112711493A (en) * 2020-12-25 2021-04-27 上海精鲲计算机科技有限公司 Scenario root cause analysis application
CN112887108A (en) * 2019-11-29 2021-06-01 中兴通讯股份有限公司 Fault positioning method, device, equipment and storage medium
CN113282884A (en) * 2021-04-28 2021-08-20 沈阳航空航天大学 General root cause analysis method
CN113793128A (en) * 2021-09-18 2021-12-14 北京京东振世信息技术有限公司 Method, apparatus, device and computer-readable medium for generating service failure cause information
CN113970913A (en) * 2020-07-24 2022-01-25 华为技术有限公司 A fault diagnosis method and device
CN114356859A (en) * 2021-12-30 2022-04-15 中国电信股份有限公司 Data import method and apparatus, device, and computer-readable storage medium
CN114371950A (en) * 2020-10-15 2022-04-19 中国移动通信集团浙江有限公司 Root cause positioning method and device for application service abnormity
CN114461434A (en) * 2022-02-11 2022-05-10 中国工商银行股份有限公司 Failure root cause analysis method, device, electronic equipment and medium
CN114629776A (en) * 2020-12-11 2022-06-14 中国联合网络通信集团有限公司 Fault analysis method and device based on graph model
CN115265710A (en) * 2022-08-25 2022-11-01 华北电力大学 Desulfurization system absorption tower liquid level measurement system and method based on Gaussian process regression
CN117061332A (en) * 2023-10-11 2023-11-14 中国人民解放军国防科技大学 Fault diagnosis method and system based on probability directed graph deep learning
WO2025081616A1 (en) * 2023-10-19 2025-04-24 天翼电子商务有限公司 Service risk determination method and apparatus, and storage medium and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157723A1 (en) * 2007-12-14 2009-06-18 Bmc Software, Inc. Impact Propagation in a Directed Acyclic Graph
CN106330501A (en) * 2015-06-26 2017-01-11 中兴通讯股份有限公司 A fault correlation method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157723A1 (en) * 2007-12-14 2009-06-18 Bmc Software, Inc. Impact Propagation in a Directed Acyclic Graph
CN106330501A (en) * 2015-06-26 2017-01-11 中兴通讯股份有限公司 A fault correlation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵靓: "一种基于多层有向图的故障根因诊断的方法", 《中国优秀硕士学位论文期刊网》 *
郑皎凌: "基于扰动的亚复杂动力系统因果关系挖掘", 《计算机学报》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887108A (en) * 2019-11-29 2021-06-01 中兴通讯股份有限公司 Fault positioning method, device, equipment and storage medium
CN111107158B (en) * 2019-12-26 2023-02-17 远景智能国际私人投资有限公司 Alarm method, device, equipment and medium for Internet of things equipment cluster
CN111107158A (en) * 2019-12-26 2020-05-05 远景智能国际私人投资有限公司 Alarm method, device, equipment and medium for Internet of things equipment cluster
CN111639115A (en) * 2020-04-29 2020-09-08 国家电网有限公司客户服务中心 Five-dimensional model-based analysis method for operation and maintenance data abnormity of power grid information system
CN111913824B (en) * 2020-06-23 2024-03-05 中国建设银行股份有限公司 Method for determining data link fault cause and related equipment
CN111913824A (en) * 2020-06-23 2020-11-10 中国建设银行股份有限公司 Method for determining data link fault reason and related equipment
CN113970913A (en) * 2020-07-24 2022-01-25 华为技术有限公司 A fault diagnosis method and device
CN111858123B (en) * 2020-07-29 2023-09-26 中国工商银行股份有限公司 Fault root cause analysis method and device based on directed graph network
CN111858123A (en) * 2020-07-29 2020-10-30 中国工商银行股份有限公司 Fault root cause analysis method and device based on directed graph network
CN114371950A (en) * 2020-10-15 2022-04-19 中国移动通信集团浙江有限公司 Root cause positioning method and device for application service abnormity
CN112506763A (en) * 2020-11-30 2021-03-16 清华大学 Automatic positioning method and device for database system fault root
CN114629776A (en) * 2020-12-11 2022-06-14 中国联合网络通信集团有限公司 Fault analysis method and device based on graph model
CN112541098A (en) * 2020-12-17 2021-03-23 杉数科技(北京)有限公司 Directed graph drawing method and chemical material planning method
CN112580810A (en) * 2020-12-22 2021-03-30 济南中科成水质净化有限公司 Sewage treatment process analysis and diagnosis method based on directed acyclic graph
CN112711493A (en) * 2020-12-25 2021-04-27 上海精鲲计算机科技有限公司 Scenario root cause analysis application
CN113282884B (en) * 2021-04-28 2023-09-26 沈阳航空航天大学 Universal root cause analysis method
CN113282884A (en) * 2021-04-28 2021-08-20 沈阳航空航天大学 General root cause analysis method
CN113793128A (en) * 2021-09-18 2021-12-14 北京京东振世信息技术有限公司 Method, apparatus, device and computer-readable medium for generating service failure cause information
CN114356859A (en) * 2021-12-30 2022-04-15 中国电信股份有限公司 Data import method and apparatus, device, and computer-readable storage medium
CN114461434A (en) * 2022-02-11 2022-05-10 中国工商银行股份有限公司 Failure root cause analysis method, device, electronic equipment and medium
CN115265710A (en) * 2022-08-25 2022-11-01 华北电力大学 Desulfurization system absorption tower liquid level measurement system and method based on Gaussian process regression
CN117061332A (en) * 2023-10-11 2023-11-14 中国人民解放军国防科技大学 Fault diagnosis method and system based on probability directed graph deep learning
CN117061332B (en) * 2023-10-11 2023-12-29 中国人民解放军国防科技大学 Fault diagnosis method and system based on probability directed graph deep learning
WO2025081616A1 (en) * 2023-10-19 2025-04-24 天翼电子商务有限公司 Service risk determination method and apparatus, and storage medium and electronic device

Also Published As

Publication number Publication date
CN110493025B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN110493025B (en) A method and device for fault root cause diagnosis based on multi-layer directed graph
CN113935497B (en) Intelligent operation and maintenance fault processing method, device, equipment and storage medium thereof
KR102483025B1 (en) Operational maintenance systems and methods
CN107770797A (en) Correlation analysis method and system for wireless network alarm management
CN118967147B (en) After-sales trigger management method and system based on multi-field analysis and fusion
CN106533754A (en) Fault diagnosis method and expert system for college teaching servers
CN112769605A (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
NL2034766B1 (en) Alarming method for micro-service index prediction based on causality test
CN117640342B (en) A method, device, equipment and medium for detecting abnormality of a power monitoring system
CN118365311A (en) Intelligent operation and maintenance management system, server and method based on large language model
CN117041029A (en) Network equipment fault processing method and device, electronic equipment and storage medium
CN111708654A (en) A method and device for repairing virtual machine faults
CN117390529A (en) Multi-factor traceable data center information management method
CN119204730A (en) A chemical park intelligent management method and system
WO2024021603A1 (en) Fault handling method, device, and storage medium
CN117851122A (en) A disaster recovery backup system for power information system in cloud environment
CN120110893A (en) A network operation and maintenance method and system based on digital twin technology
CN112579402A (en) Method and device for positioning faults of application system
CN117591887A (en) Prediction model training method and hazardous waste monitoring method
CN117785530A (en) Intelligent data analysis method and system
CN117528596A (en) Fault detection maintenance method and data processing equipment
CN115544682A (en) Physical model and predictive maintenance method of automatic packaging machine
CN116186616A (en) Inspection management system and inspection management method
Peng et al. Research on data quality detection technology based on ubiquitous state grid internet of things platform
CN119579147B (en) Substation network equipment fault elimination and verification method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant