CN120196528B

CN120196528B - Operation and maintenance fault positioning method, device, equipment and storage medium based on large model

Info

Publication number: CN120196528B
Application number: CN202510678067.9A
Authority: CN
Inventors: 赵兴业; 李廷; 韩同
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2025-05-26
Filing date: 2025-05-26
Publication date: 2025-08-19
Anticipated expiration: 2045-05-26
Also published as: CN120196528A

Abstract

The present application discloses a method, device, equipment and storage medium for locating operation and maintenance faults based on a large model, which relates to the field of computer technology, including: if a system fault is detected, the current system operation and maintenance data is collected in real time, and the current system operation and maintenance data is preprocessed to obtain the target preprocessed system operation and maintenance data; the data features corresponding to the target preprocessed system operation and maintenance data are extracted, and the data features are input into a preset large model to analyze the data features through the preset large model to obtain the fault analysis results; based on the fault tree analysis method, a fault tree corresponding to the fault analysis results is constructed to determine the cause of the system fault through the fault tree, and the cause of the fault is analyzed to complete the fault location. Thus, when a fault occurs, various information related to the fault can be quickly collected to quickly and accurately locate the root cause of the fault, thereby reducing the time for troubleshooting and improving the efficiency of fault handling.

Description

Operation and maintenance fault positioning method, device, equipment and storage medium based on large model

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for positioning an operation and maintenance fault based on a large model.

Background

In the present digital age, the scale and complexity of various types of information systems, network systems, and industrial control systems are increasing. These systems are typically made up of a large number of hardware devices, software components, and network connections, the operating state of which is affected by a number of factors, such as hardware aging, software vulnerabilities, network fluctuations, human operational errors, and the like. Once the system fails, serious consequences such as service interruption, data loss, service quality degradation and the like can be caused, and huge losses are brought to enterprises and users.

Traditional operation and maintenance management methods mainly depend on manual experience and simple monitoring tools. The operation and maintenance personnel monitor key indexes of the system, such as processor utilization rate, memory occupancy rate, network flow rate and the like by setting a plurality of fixed thresholds. When these indicators exceed a threshold, the system will issue an alarm to notify the operator. However, this method has the problems of insufficient fault prediction capability caused by dead plate of threshold setting and difficult positioning of fault root caused by complicated alarm information.

Disclosure of Invention

Therefore, the invention aims to provide an operation and maintenance fault positioning method, device, equipment and storage medium based on a large model, which can rapidly collect various information related to faults to rapidly and accurately position the root cause of the faults when the faults occur, reduce the fault troubleshooting time and improve the fault processing efficiency. The specific scheme is as follows:

In a first aspect, the application discloses an operation and maintenance fault positioning method based on a large model, which comprises the following steps:

If the system fault is monitored, current system operation and maintenance data are collected in real time, and the current system operation and maintenance data are preprocessed to obtain system operation and maintenance data after target preprocessing;

extracting data features corresponding to the system operation data after target preprocessing, and inputting the data features into a preset large model to analyze the data features through the preset large model so as to obtain a fault analysis result;

constructing a fault tree corresponding to the fault analysis result based on a fault tree analysis method, determining a fault cause of the system fault through the fault tree, and analyzing the fault cause to complete fault positioning.

Optionally, if the system fault is detected, current system operation and maintenance data are collected in real time, and the current system operation and maintenance data are preprocessed to obtain target preprocessed system operation and maintenance data, including:

If a system fault is detected, collecting current system operation data of a local system in real time, and carrying out data marking and data classification on the current system operation data to obtain first preprocessed system operation data;

Determining invalid data, repeated data and abnormal data in the first preprocessed system operation and maintenance data, and removing the invalid data, the repeated data and the abnormal data from the first preprocessed system operation and maintenance data to obtain second preprocessed system operation and maintenance data;

And carrying out normalization processing on the second preprocessed system operation data to convert the second preprocessed system operation data into a preset data format so as to obtain target preprocessed system operation data.

Optionally, the collecting the current system operation data of the local system in real time, and performing data labeling and data classification on the current system operation data to obtain first preprocessed system operation data, including:

Collecting current system logs, alarm information, performance index data, network flow data and hardware state data of a system in real time;

adding a time stamp to the system log, and sorting the system log based on the time stamp to obtain a sorted system log;

and generating an alarm event chain based on the alarm information, classifying the performance index data based on index types, classifying the hardware state data based on component types, and classifying the network flow data based on flow types.

Optionally, the analyzing the data features through the preset large model to obtain a fault analysis result includes:

Extracting information from the data features through a preset large model, and matching the extracted target information with a preset fault case to determine a target fault case matched with the target information in the preset fault case;

constructing a fault causal relationship graph based on the alarm event chain in the data characteristic;

Comparing the performance index data with historical performance index data to determine abnormal changes in the performance index;

analyzing the network traffic data to determine abnormal network behavior existing in the network traffic data;

Comparing the hardware state data with a preset hardware data threshold value to determine an abnormal hardware state in the hardware state data;

And taking the fault causal relationship graph, the abnormal change of the performance index, the abnormal network behavior, the abnormal hardware state and the target fault case as fault analysis results.

Optionally, the constructing a fault tree corresponding to the fault analysis result based on the fault tree analysis method to determine a fault cause of the system fault through the fault tree includes:

Identifying a fault top event, a fault middle event and a fault bottom event corresponding to the system fault based on the fault analysis result, and constructing a fault tree corresponding to the fault analysis result based on the fault top event, the fault middle event and the fault bottom event;

And determining a target fault bottom event with the highest probability of influencing the fault top event in the fault bottom events through the fault tree, and taking the target fault bottom event as a fault reason of the system fault.

Optionally, the analyzing the fault cause to complete fault localization includes:

And analyzing the fault reasons to determine fault occurrence time, fault occurrence components and fault abnormal data corresponding to the fault reasons so as to complete fault positioning.

Optionally, the operation and maintenance fault positioning method based on the large model further includes:

constructing a fault prediction model based on the pre-training model through an integrated learning method;

collecting current target system operation data in real time based on a preset time interval, and inputting the target system operation data into the fault prediction model to perform system fault prediction in real time through the fault prediction model so as to obtain a real-time fault prediction result;

And adjusting system parameters based on the fault prediction result so as to prevent system faults.

In a second aspect, the application discloses an operation and maintenance fault positioning device based on a large model, which comprises:

The data preprocessing module is used for acquiring current system operation and maintenance data in real time if the system fault is monitored, and preprocessing the current system operation and maintenance data to obtain target preprocessed system operation and maintenance data;

The fault analysis module is used for extracting data features corresponding to the system operation data after the target pretreatment, inputting the data features into a preset large model, and analyzing the data features through the preset large model to obtain a fault analysis result;

The fault positioning module is used for constructing a fault tree corresponding to the fault analysis result based on a fault tree analysis method so as to determine the fault cause of the system fault through the fault tree and analyzing the fault cause to finish fault positioning.

In a third aspect, the present application discloses an electronic device, comprising:

A memory for storing a computer program;

And the processor is used for executing the computer program to realize the operation and maintenance fault positioning method based on the large model.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements an operation and maintenance fault localization method based on a large model as described above.

The method comprises the steps of monitoring system faults, acquiring current system operation and maintenance data in real time, preprocessing the current system operation and maintenance data to obtain target preprocessed system operation and maintenance data, extracting data features corresponding to the target preprocessed system operation and maintenance data, inputting the data features into a preset large model, analyzing the data features through the preset large model to obtain fault analysis results, constructing a fault tree corresponding to the fault analysis results based on a fault tree analysis method, determining fault reasons of the system faults through the fault tree, and analyzing the fault reasons to finish fault positioning. Therefore, by the method, the operation and maintenance data of the system are required to be acquired in real time after the system fault is detected, then the operation and maintenance data of the system are preprocessed, and the operation and maintenance characteristics of the preprocessed data are extracted. After the characteristics are extracted, the data characteristics are required to be analyzed through a preset large model, and a fault tree corresponding to the obtained analysis result is constructed, so that the fault cause of the system is determined through the finally obtained fault tree, and therefore, when the fault occurs, various information related to the fault can be rapidly collected to rapidly and accurately locate the root cause of the fault, the fault checking time is shortened, the fault processing efficiency is improved, and the rapid recovery and normal operation of the system are ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an operation and maintenance fault positioning method based on a large model;

FIG. 2 is a timing diagram of an operation and maintenance fault locating method based on a large model according to the present application;

FIG. 3 is a schematic diagram of an operation and maintenance fault locating device based on a large model;

fig. 4 is a block diagram of an electronic device according to the present disclosure.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the prior art, the traditional operation and maintenance management method mainly relies on manual experience and a simple monitoring tool, for example, a threshold is manually set to detect key indexes of a system and alarm is carried out. However, this method has the problems of insufficient fault prediction capability caused by dead plate setting of the threshold value and difficult positioning of fault root caused by complicated alarm information.

In order to overcome the technical problems, the application discloses an operation and maintenance fault positioning method, device, equipment and storage medium based on a large model, which can rapidly collect various information related to faults to rapidly and accurately position root causes of the faults when the faults occur, reduce fault troubleshooting time and improve fault processing efficiency.

Referring to fig. 1, the embodiment of the invention discloses an operation and maintenance fault positioning method based on a large model, which comprises the following steps:

And S11, if the system fault is monitored, acquiring current system operation and maintenance data in real time, and preprocessing the current system operation and maintenance data to obtain the system operation and maintenance data after target preprocessing.

In this embodiment, as shown in fig. 2, if a system fault is detected, current system operation data needs to be collected in real time, and corresponding preprocessing is performed on the system operation data, specifically, if a system fault is detected, current system operation data of a local system is collected in real time, and data labeling and data classification are performed on the current system operation data to obtain first preprocessed system operation data, where current system logs, alarm information, performance index data, network flow data and hardware state data of the system need to be collected in real time, such as logs of a server, an application program, a database and the like, after log collection is completed, time stamps need to be added to the logs, and the system logs are ordered based on the time stamps to obtain ordered system logs; the alarm information is needed to be obtained from hardware, a network and an application performance monitoring system, the alarm information is needed to be integrated and classified, repetition is removed, the alarm information is needed to be correlated to form an alarm event chain, performance index data such as the utilization rate of a processor, the occupancy rate of a memory and the like are classified according to index types, historical data and change trends are displayed in a chart form, hardware state data is needed to be classified according to component types corresponding to the hardware, such as temperature, fan rotating speed, power supply state and the like, and network flow data is needed to be obtained through a network flow collection tool and classified according to flow types, and flow direction, size and data packet characteristics of flow can be analyzed.

Further, further processing needs to be performed on the obtained first preprocessed system operation and maintenance data, specifically, invalid data, repeated data and abnormal data in the first preprocessed system operation and maintenance data need to be determined, and the invalid data, the repeated data and the abnormal data are removed from the first preprocessed system operation and maintenance data, so that second preprocessed system operation and maintenance data are obtained. Therefore, by eliminating invalid data, repeated data and abnormal data, noise interference in the data can be effectively reduced, and the data characteristics can be reflected more truly.

And further, normalizing the obtained second preprocessed system operation data to convert the second preprocessed system operation data into a preset data format, thereby obtaining the target preprocessed system operation data. It should be noted that, normalization processing is performed on the data, and the data is converted into a unified data format, that is, JSON (JavaScript Object Notation ) format. It should be noted that, the normalization processing is performed on the data, and the data is converted into a unified data format, so that the consistency of the data can be ensured, and analysis errors caused by data contradiction can be avoided.

And S12, extracting data features corresponding to the system operation data after target pretreatment, and inputting the data features into a preset large model to analyze the data features through the preset large model so as to obtain a fault analysis result.

In this embodiment, as shown in fig. 2, feature extraction is required to be performed on the system operation data after target preprocessing, and the extracted data features are analyzed by using a multi-scale model to obtain a corresponding fault analysis result. Specifically, information extraction is required to be performed on the data features through a preset large model, and matching is performed on the extracted target information and a preset fault case according to the extracted target information, so that a target fault case matched with the target information in the preset fault case is determined. After the system operation data after the target preprocessing is obtained, the data is required to be subjected to feature extraction, the time domain features comprise the mean value, variance, the maximum value and the median of the performance indexes, and the occurrence frequency and time interval of specific events in the log, the frequency domain features acquire data frequency components such as main frequency and harmonic waves through Fourier transformation and the like to reflect the system operation periodicity and stability, the trend features extract the rising, falling or stable trend of the performance indexes by using time sequence analysis methods such as moving average, exponential smoothing and the like, and the text features find out abnormal system performance in advance aiming at the text data such as the system log and the like. And inputting the extracted data features into a preset large model, judging which features are important for fault prediction according to analysis of knowledge and data in the operation and maintenance field by the large model, avoiding that feature redundancy influences the performance of the model, and matching the large model according to fault cases and feature information in the operation and maintenance field so as to determine corresponding target fault cases.

Further, a fault causal relationship graph needs to be constructed according to alarm event chains in data characteristics to judge which alarms cause other alarms and root causes of faults, performance index data and historical performance index data need to be compared to determine abnormal changes of performance indexes, fault correlation can be judged according to system normal and fault modes to analyze processes and reasons causing abnormal performance indexes, network flow data can be analyzed to determine abnormal network behaviors in the network flow data and judge whether network attack, congestion or abnormal network behaviors of application programs exist or not, hardware state data and preset hardware data thresholds can be compared to determine abnormal hardware states in the hardware state data, such as judging reasons of overhigh server temperature and influences on other components of the system. Finally, the fault causal relation graph, the abnormal change of the performance index, the abnormal network behavior, the abnormal hardware state and the target fault case are taken as fault analysis results.

And S13, constructing a fault tree corresponding to the fault analysis result based on a fault tree analysis method, determining the fault cause of the system fault through the fault tree, and analyzing the fault cause to finish fault positioning.

In this embodiment, the fault cause of the system fault needs to be determined through the constructed fault tree, and the corresponding fault location is completed. Specifically, it is necessary to identify a fault top event, a fault middle event, and a fault bottom event corresponding to a system fault based on a fault analysis result, and construct a fault tree corresponding to the fault analysis result based on the fault top event, the fault middle event, and the fault bottom event. That is, based on the large model analysis result, a fault tree is constructed by adopting a fault tree analysis method. And taking the fault phenomenon as a top event, and gradually expanding and constructing an intermediate event and a bottom event according to the fault cause and the causal relationship analyzed by the large model. If the system service is not available, the intermediate event may be a network failure, a server hardware failure or an application error, and the further expansion network failure may be a network device failure, a line interruption or a configuration error.

Then, determining a target fault bottom event with the highest probability of influencing a fault top event in the fault bottom events through a fault tree, and taking the target fault bottom event as a fault reason of the system fault. Specifically, parameters such as a fault tree structure, an event logic relationship, occurrence probability of a bottom event and the like are input by using fault tree analysis software, the contribution degree of the bottom event to occurrence of a top event is calculated, and a target fault bottom event with the maximum probability of influencing the fault top event in the fault bottom event is determined, so that the target fault bottom event is used as a fault cause of a system fault. Further, the fault cause needs to be analyzed to determine the fault occurrence time, the fault occurrence component and the fault abnormal data corresponding to the fault cause, so as to complete fault positioning. It should be noted that, after fault localization is completed, the fault report may be generated and the fault report may include fault occurrence time, location, phenomenon, root cause analysis process, fault tree structure, and solution proposal. The large model analysis thought and the fault tree construction process are described in detail in the report, and clear fault processing basis is provided for operation and maintenance personnel. Specific operational steps and suggestions, such as network equipment port failure, are then provided based on the root cause of the failure, port replacement is suggested, line connections are checked, reconfiguration parameters are checked, and related technical documentation and reference links are provided. Ultimately, the fault analysis report needs to be provided to the operation and maintenance personnel. Therefore, when a fault occurs, various information related to the fault, including system logs, alarm information, performance indexes and the like, can be rapidly collected, and the root cause of the fault can be rapidly and accurately positioned by means of knowledge reasoning and semantic understanding capability of a large model, so that the fault troubleshooting time is shortened, the fault processing efficiency is improved, and the rapid recovery and normal operation of the system are ensured.

It should be further noted that, a fault prediction model may be constructed to predict the system by using the prediction model, specifically, a fault prediction model may be constructed based on a pre-training model by using a learning method, where the pre-training model may be selected according to requirements, for example, a vector machine (Support Vector Machine, SVM), random Forest (RF), long Short-Term Memory (LSTM), convolutional neural network (Convolutional Neural Networks, CNN), and other algorithms, and a suitable algorithm or combination may be selected according to the characteristics and data characteristics of the system. And after the fault prediction model is constructed, the current target system operation data can be acquired in real time based on a preset time interval, and the target system operation data is input into the fault prediction model, so that the system fault prediction is performed in real time through the fault prediction model, and a real-time fault prediction result is obtained. And moreover, the preset large model can be used for knowing the performance of the model under different parameter settings according to the failure prediction result in combination with operation and maintenance cases and data, providing optimal parameter suggestions for the failure prediction model, such as adjusting suggestions for kernel function parameters, penalty factors and the like of an SVM (Support Vector Machine ) model, optimizing model architecture, such as adjusting network layers, neuron numbers and the like according to system data complexity and failure modes, suggesting an LSTM or CNN model, and improving model prediction capability. Therefore, fault prediction can be performed in real time, and parameter adjustment is performed through a large model, so that the efficiency and accuracy of fault prediction are effectively improved.

In the embodiment, if a system fault is detected, current system operation and maintenance data are collected in real time, the current system operation and maintenance data are preprocessed to obtain target preprocessed system operation and maintenance data, data features corresponding to the target preprocessed system operation and maintenance data are extracted and input into a preset large model, the data features are analyzed through the preset large model to obtain a fault analysis result, a fault tree corresponding to the fault analysis result is constructed based on a fault tree analysis method, the fault cause of the system fault is determined through the fault tree, and the fault cause is analyzed to complete fault positioning. Therefore, by the method, the operation and maintenance data of the system are required to be acquired in real time after the system fault is detected, then the operation and maintenance data of the system are preprocessed, and the operation and maintenance characteristics of the preprocessed data are extracted. After the characteristics are extracted, the data characteristics are required to be analyzed through a preset large model, and a fault tree corresponding to the obtained analysis result is constructed so as to determine the system fault cause through the finally obtained fault tree. On the one hand, when faults occur, various information related to the faults can be rapidly collected to rapidly and accurately locate the root cause of the faults, the fault checking time is reduced, the fault processing efficiency is improved, the rapid recovery and normal operation of the system are guaranteed, and on the other hand, through automatic fault prediction and root cause location, the dependence on manual operation and maintenance is reduced, the working intensity and professional requirements of operation and maintenance personnel are reduced, and therefore the operation and maintenance cost of enterprises is reduced. On the other hand, through carrying on real-time collection and analysis to the heterogeneous data of multisource that produces in the course of system operation, utilize the predictive model constructed, find the potential trouble hidden danger in the system in advance, realize the accurate trouble prediction, offer sufficient time to carry on trouble prevention and treatment for the operation and maintenance personnel, reduce probability and influence degree that the trouble takes place.

Referring to fig. 3, the embodiment of the invention discloses an operation and maintenance fault positioning device based on a large model, which comprises:

The data preprocessing module 11 is configured to collect current system operation and maintenance data in real time if a system fault is detected, and preprocess the current system operation and maintenance data to obtain target preprocessed system operation and maintenance data;

the fault analysis module 12 is configured to extract data features corresponding to the system operation data after the target preprocessing, and input the data features to a preset large model, so as to analyze the data features through the preset large model, and obtain a fault analysis result;

And the fault positioning module 13 is used for constructing a fault tree corresponding to the fault analysis result based on a fault tree analysis method so as to determine the fault cause of the system fault through the fault tree and analyzing the fault cause to finish fault positioning.

In the embodiment, if a system fault is detected, current system operation and maintenance data are collected in real time, the current system operation and maintenance data are preprocessed to obtain target preprocessed system operation and maintenance data, data features corresponding to the target preprocessed system operation and maintenance data are extracted and input into a preset large model, the data features are analyzed through the preset large model to obtain a fault analysis result, a fault tree corresponding to the fault analysis result is constructed based on a fault tree analysis method, the fault cause of the system fault is determined through the fault tree, and the fault cause is analyzed to complete fault positioning. Therefore, by the method, the operation and maintenance data of the system are required to be acquired in real time after the system fault is detected, then the operation and maintenance data of the system are preprocessed, and the operation and maintenance characteristics of the preprocessed data are extracted. After the characteristics are extracted, the data characteristics are required to be analyzed through a preset large model, and a fault tree corresponding to the obtained analysis result is constructed, so that the fault cause of the system is determined through the finally obtained fault tree, and therefore, when the fault occurs, various information related to the fault can be rapidly collected to rapidly and accurately locate the root cause of the fault, the fault checking time is shortened, the fault processing efficiency is improved, and the rapid recovery and normal operation of the system are ensured.

In some embodiments, the data preprocessing module 11 may specifically include:

The first preprocessing sub-module is used for acquiring the current system operation data of the local system in real time if the system fault is detected, and carrying out data marking and data classification on the current system operation data to obtain the system operation data after the first preprocessing;

The second preprocessing sub-module is used for data, and removing the invalid data, the repeated data and the abnormal data from the first preprocessed system operation and maintenance data to obtain second preprocessed system operation and maintenance data;

And the data conversion sub-module is used for carrying out normalization processing on the second preprocessed system operation data so as to convert the second preprocessed system operation data into a preset data format, so as to obtain the target preprocessed system operation data.

In some embodiments, the first preprocessing sub-module may specifically include:

the data acquisition unit is used for acquiring current system logs, alarm information, performance index data, network flow data and hardware state data of the system in real time;

The log sorting unit is used for adding a time stamp to the system log and sorting the system log based on the time stamp to obtain a sorted system log;

And the data classification unit is used for generating an alarm event chain based on the alarm information, classifying the performance index data based on index types, classifying the hardware state data based on component types and classifying the network flow data based on flow types.

In some embodiments, the fault analysis module 12 may specifically include:

The information matching unit is used for extracting information of the data features through a preset large model, and matching the extracted target information with a preset fault case to determine a target fault case matched with the target information in the preset fault case;

A relationship graph construction unit, configured to construct a fault causal relationship graph based on the alarm event chain in the data feature;

the first data comparison unit is used for comparing the performance index data with the historical performance index data so as to determine abnormal change of the performance index;

the second data comparison unit is used for analyzing the network traffic data to determine abnormal network behaviors in the network traffic data;

A third data comparing unit, configured to compare the hardware state data with a preset hardware data threshold value, so as to determine an abnormal hardware state in the hardware state data;

and the data definition unit is used for taking the fault causal relation graph, the performance index abnormal change, the abnormal network behavior, the abnormal hardware state and the target fault case as fault analysis results.

In some embodiments, the fault location module 13 may specifically include:

The fault tree construction unit is used for identifying a fault top event, a fault middle event and a fault bottom event corresponding to the system fault based on the fault analysis result, and constructing a fault tree corresponding to the fault analysis result based on the fault top event, the fault middle event and the fault bottom event;

And the fault cause analysis unit is used for determining a target fault bottom event with the highest probability of influencing the fault top event in the fault bottom events through the fault tree, and taking the target fault bottom event as the fault cause of the system fault.

In some embodiments, the fault location module 13 may specifically include:

And the fault locating unit is used for analyzing the fault reasons to determine fault occurrence time, fault occurrence components and fault abnormal data corresponding to the fault reasons so as to complete fault locating.

In some embodiments, the operation and maintenance fault positioning device based on the large model may further include:

the model building unit is used for building a fault prediction model based on the pre-training model through an integrated learning method;

The fault real-time prediction unit is used for collecting the current target system operation data in real time based on a preset time interval, inputting the target system operation data into the fault prediction model, and performing system fault prediction in real time through the fault prediction model to obtain a real-time fault prediction result;

and the parameter adjustment unit is used for adjusting system parameters based on the fault prediction result so as to prevent the system faults.

Further, the embodiment of the present application further discloses an electronic device, and fig. 4 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.

Fig. 4 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may include, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26. The memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the operation and maintenance fault positioning method based on the large model disclosed in any one of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.

In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device 20, the communication interface 24 is capable of creating a data transmission channel with an external device for the electronic device 20, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the external device, and the specific interface type of the input/output interface may be selected according to the specific application needs and is not specifically limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the large model-based operation and maintenance fault localization method performed by the electronic device 20 as disclosed in any of the previous embodiments.

Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the operation and maintenance fault positioning method based on the large model. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.

Claims

1. A large-scale model-based operation and maintenance fault location method, characterized by comprising:

If a system failure is detected, the current system operation and maintenance data is collected in real time, and the current system operation and maintenance data is preprocessed to obtain the target preprocessed system operation and maintenance data;

Extracting data features corresponding to the target pre-processed system operation and maintenance data, and inputting the data features into a preset large model, so as to analyze the data features through the preset large model to obtain a fault analysis result;

Constructing a fault tree corresponding to the fault analysis result based on a fault tree analysis method, determining the cause of the system fault through the fault tree, and analyzing the cause of the fault to complete fault location;

The analyzing the data features by the preset large model to obtain the fault analysis results includes:

Extracting information from the data features through a preset large model, and matching the extracted target information with preset fault cases to determine a target fault case that matches the target information in the preset fault cases;

Constructing a fault causal relationship diagram based on the alarm event chain in the data feature; the alarm event chain is an event chain generated based on the alarm information;

Compare performance indicator data with historical performance indicator data to identify abnormal changes in performance indicators;

Analyzing network traffic data to determine abnormal network behavior in the network traffic data;

Comparing the hardware status data with a preset hardware data threshold to determine an abnormal hardware status in the hardware status data;

Taking the fault causal relationship diagram, the abnormal change of the performance indicator, the abnormal network behavior, the abnormal hardware status, and the target fault case as the fault analysis result;

Among them, the alarm information, the performance indicator data, the network traffic data, and the hardware status data are data obtained by real-time collection of the system.

2. The large-model-based operation and maintenance fault location method according to claim 1, wherein if a system fault is detected, current system operation and maintenance data is collected in real time and preprocessed to obtain target preprocessed system operation and maintenance data, including:

If a system failure is detected, the current system operation and maintenance data of the local system is collected in real time, and the current system operation and maintenance data is labeled and classified to obtain the first pre-processed system operation and maintenance data;

Determining invalid data, duplicate data, and abnormal data in the first preprocessed system operation and maintenance data, and removing the invalid data, the duplicate data, and the abnormal data from the first preprocessed system operation and maintenance data to obtain second preprocessed system operation and maintenance data;

Normalization processing is performed on the second preprocessed system operation and maintenance data to convert the second preprocessed system operation and maintenance data into a preset data format to obtain target preprocessed system operation and maintenance data.

3. The large-model-based operation and maintenance fault location method according to claim 2, wherein the real-time acquisition of local current system operation and maintenance data and the data labeling and data classification of the current system operation and maintenance data to obtain first pre-processed system operation and maintenance data include:

Real-time collection of the system's current system logs, alarm information, performance indicator data, network traffic data, and hardware status data;

adding a timestamp to the system log, and sorting the system log based on the timestamp to obtain sorted system logs;

An alarm event chain is generated based on the alarm information, and the performance indicator data is classified based on the indicator type, the hardware status data is classified based on the component type, and the network traffic data is classified based on the traffic type.

4. The large-model-based operation and maintenance fault location method according to claim 1, wherein constructing a fault tree corresponding to the fault analysis result based on a fault tree analysis method to determine the cause of the system fault through the fault tree comprises:

Identifying a top fault event, an intermediate fault event, and a bottom fault event corresponding to the system fault based on the fault analysis result, and constructing a fault tree corresponding to the fault analysis result based on the top fault event, the intermediate fault event, and the bottom fault event;

A target bottom fault event having the greatest probability of affecting the top fault event among the bottom fault events is determined through the fault tree, and the target bottom fault event is used as the fault cause of the system fault.

5. The large model-based operation and maintenance fault location method according to claim 1, wherein analyzing the cause of the fault to complete the fault location comprises:

The fault cause is analyzed to determine the fault occurrence time, fault-occurring component, and fault abnormality data corresponding to the fault cause, so as to complete the fault location.

6. The large model-based operation and maintenance fault location method according to claim 1, further comprising:

Build a fault prediction model based on the pre-trained model through an ensemble learning method;

Collecting current target system operation and maintenance data in real time based on a preset time interval, and inputting the target system operation and maintenance data into the fault prediction model, so as to perform system fault prediction in real time through the fault prediction model to obtain a real-time fault prediction result;

System parameters are adjusted based on the fault prediction results to prevent system faults.

7. A large-scale model-based operation and maintenance fault location device, comprising:

A data preprocessing module is used to collect current system operation and maintenance data in real time if a system fault is detected, and to preprocess the current system operation and maintenance data to obtain target preprocessed system operation and maintenance data;

a fault analysis module, configured to extract data features corresponding to the target pre-processed system operation and maintenance data, and input the data features into a preset large model, so as to analyze the data features through the preset large model and obtain a fault analysis result;

a fault location module, configured to construct a fault tree corresponding to the fault analysis result based on a fault tree analysis method, so as to determine the cause of the system fault through the fault tree, and to analyze the cause of the fault to complete fault location;

Wherein, the fault analysis module includes:

An information matching unit is used to extract information from the data features using a preset large model, and match the extracted target information with preset fault cases to determine a target fault case that matches the target information in the preset fault cases;

A relationship graph construction unit, configured to construct a fault causal relationship graph based on an alarm event chain in the data feature; the alarm event chain is an event chain generated based on the alarm information;

a first data comparison unit, configured to compare the performance indicator data with historical performance indicator data to determine abnormal changes in the performance indicator;

a second data comparison unit, used in the data analysis unit, for analyzing the network traffic data to determine abnormal network behavior present in the network traffic data;

a third data comparing unit, configured to compare the hardware status data with a preset hardware data threshold to determine an abnormal hardware status in the hardware status data;

a data definition unit, configured to use the fault causal relationship diagram, the abnormal change in the performance indicator, the abnormal network behavior, the abnormal hardware status, and the target fault case as a fault analysis result;

8. An electronic device, comprising:

Memory, used to store computer programs;

A processor is used to execute the computer program to implement the large model-based operation and maintenance fault location method according to any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that it is used to store a computer program, wherein when the computer program is executed by a processor, it implements the large model-based operation and maintenance fault location method according to any one of claims 1 to 6.