[go: up one dir, main page]

CN107733702A - The method and apparatus that operational state of mainframe is managed in group system - Google Patents

The method and apparatus that operational state of mainframe is managed in group system Download PDF

Info

Publication number
CN107733702A
CN107733702A CN201710911387.XA CN201710911387A CN107733702A CN 107733702 A CN107733702 A CN 107733702A CN 201710911387 A CN201710911387 A CN 201710911387A CN 107733702 A CN107733702 A CN 107733702A
Authority
CN
China
Prior art keywords
host
hosts
response data
heartbeat response
communicate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710911387.XA
Other languages
Chinese (zh)
Inventor
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710911387.XA priority Critical patent/CN107733702A/en
Publication of CN107733702A publication Critical patent/CN107733702A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses the method and apparatus that operational state of mainframe is managed in a kind of group system.Methods described includes:To the segregate management duration of the main frame after main frame can not communicate with other main frames in acquisition cluster file system CFS;The timing detection main frame can not be starting point with the time after other main-machine communications, broken down since the main frame;Before the duration of record is not up to the management duration, warning information is sent to cloud platform.

Description

集群系统中管理主机运行状态的方法和装置Method and device for managing host running state in cluster system

技术领域technical field

本发明涉及信息处理领域,尤指一种集群系统中管理主机运行状态的方法和装置。The invention relates to the field of information processing, in particular to a method and device for managing the running state of a host computer in a cluster system.

背景技术Background technique

CFS(CFS Cluster File System,集群文件系统)是个多个主机节点同时挂载同一个文件系统的系统,其中集群文件系统能使运行在集群中所有节点并发的通过标准文件系统接口来访问存储设备。这给管理跨越整个集群的应用系统带来方便。CFS (CFS Cluster File System, cluster file system) is a system in which multiple host nodes mount the same file system at the same time. The cluster file system enables all nodes running in the cluster to access storage devices concurrently through standard file system interfaces. This makes it easy to manage application systems that span the entire cluster.

如果一定时间内某主机节点不能读写存储,则主机会被文件系统隔离,也即fence。在云计算产品中,若主机在用户不知情的情况下fence,势必导致用户的业务处理中断,影响业务的正常处理。If a host node cannot read and write storage within a certain period of time, the host will be isolated by the file system, that is, fence. In cloud computing products, if the host is fenced without the user's knowledge, it will inevitably lead to the interruption of the user's business processing and affect the normal processing of the business.

发明内容Contents of the invention

为了解决上述技术问题,本发明提供了一种集群系统中管理主机运行状态的方法和装置,能够降低主机会被文件系统隔离的可能。In order to solve the above technical problems, the present invention provides a method and device for managing the running state of a host in a cluster system, which can reduce the possibility that the host will be isolated by a file system.

为了达到本发明目的,本发明提供了一种集群系统中管理主机运行状态的方法,包括:In order to achieve the purpose of the present invention, the present invention provides a method for managing the running state of a host in a cluster system, including:

获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;Obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated;

在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;After detecting that the host cannot communicate with other hosts, start timing from the time when the host fails;

在记录的时长未达到所述管理时长前,向云平台发出告警信息。Before the recorded duration does not reach the management duration, an alarm message is sent to the cloud platform.

其中,所述方法还具有如下特点:所述主机是否能与其他主机通信是通过如下方式得到的,包括:Wherein, the method also has the following characteristics: whether the host can communicate with other hosts is obtained through the following methods, including:

从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;Detect the heartbeat response data sent by other hosts from the preset monitoring port, and get the detection result;

将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;Comparing the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result;

根据所述比较结果,确定所述主机是否能与其他主机通信。Based on the comparison result, it is determined whether the host can communicate with other hosts.

其中,所述方法还具有如下特点:所述从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果,包括:Wherein, the method also has the following characteristics: the heartbeat response data sent by other hosts is detected from the preset monitoring port, and the detection result is obtained, including:

接收其他主机发送的存储心跳数据包;Receive storage heartbeat packets sent by other hosts;

从所述预先设置的存储心跳数据包中的预设字段读取数据;Reading data from a preset field in the preset stored heartbeat data packet;

解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。Parse the content of the data to determine whether the data is heartbeat response data to the host.

其中,所述方法还具有如下特点:所述将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果,包括:Wherein, the method also has the following characteristics: the comparison of the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result includes:

判断是否接收到其他主机向所述主机发送的心跳应答数据;Judging whether the heartbeat response data sent by other hosts to the host is received;

如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;If no heartbeat response data sent by other hosts to the host is received, then it is determined that the host cannot communicate with other hosts;

如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;判断所述总数是否大于预先设置的总数阈值;如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。If the heartbeat response data sent by other hosts to the host is received, count the total number of hosts that send heartbeat response data to the host; judge whether the total is greater than the preset total threshold; if the total is less than the total threshold, it is determined that the host cannot communicate with other hosts.

其中,所述方法还具有如下特点:在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,所述方法还包括:Wherein, the method also has the following characteristics: before the recorded duration does not reach the management duration, after sending an alarm message to the cloud platform, the method also includes:

检测所述主机是否收到故障处理的操作请求;Detecting whether the host receives an operation request for fault handling;

如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。If the host does not receive the fault processing operation request, it sends an alarm message to the cloud platform again.

一种集群系统中管理主机运行状态的装置,包括:A device for managing the running state of a host in a cluster system, comprising:

获取模块,用于获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;The obtaining module is used to obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated;

计时模块,用于在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;Timing module, used to start counting from the time when the host fails when it detects that the host cannot communicate with other hosts;

告警模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息。The alarm module is configured to send an alarm message to the cloud platform before the recorded time length reaches the management time length.

其中,所述装置还具有如下特点:所述装置还包括:Wherein, the device also has the following characteristics: the device also includes:

第一检测模块,用于从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;The first detection module is used to detect heartbeat response data sent by other hosts from a preset monitoring port to obtain a detection result;

比较模块,用于将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;A comparison module, configured to compare the detection result with a preset management strategy for heartbeat response data to obtain a comparison result;

确定模块,用于根据所述比较结果,确定所述主机是否能与其他主机通信。A determining module, configured to determine whether the host can communicate with other hosts according to the comparison result.

其中,所述装置还具有如下特点:所述第一检测模块包括:Wherein, the device also has the following characteristics: the first detection module includes:

接收单元,用于接收其他主机发送的存储心跳数据包;A receiving unit, configured to receive storage heartbeat packets sent by other hosts;

读取单元,用于从所述预先设置的存储心跳数据包中的预设字段读取数据;A reading unit, configured to read data from a preset field in the preset stored heartbeat data packet;

确定单元,用于解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。The determining unit is configured to analyze the content of the data, and determine whether the data is heartbeat response data to the host.

其中,所述装置还具有如下特点:所述比较模块包括:Wherein, the device also has the following characteristics: the comparison module includes:

第一判断单元,用于判断是否接收到其他主机向所述主机发送的心跳应答数据;A first judging unit, configured to judge whether heartbeat response data sent by other hosts to the host is received;

第一处理单元,用于如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;The first processing unit is configured to determine that the host cannot communicate with other hosts if no heartbeat response data sent by the other host to the host is received;

统计单元,用于如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;A statistical unit, configured to count the total number of hosts sending heartbeat response data to the host if heartbeat response data sent by other hosts to the host is received;

第二判断单元,用于判断所述总数是否大于预先设置的总数阈值;A second judging unit, configured to judge whether the total number is greater than a preset total threshold;

第二处理单元,用于如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。A second processing unit, configured to determine that the host cannot communicate with other hosts if the total number is less than the total number threshold.

其中,所述装置还具有如下特点:所述装置还包括:Wherein, the device also has the following characteristics: the device also includes:

第二检测模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;The second detection module is used to detect whether the host receives an operation request for fault handling after sending an alarm message to the cloud platform before the recorded time length reaches the management time length;

所述告警模块还用于如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。The alarm module is further configured to send an alarm message to the cloud platform again if the host does not receive the fault handling operation request.

本发明提供的实施例,在得到主机被隔离的管理时长后,对发生故障的主机进行计时,未达到所述管理时长前,向云平台发出告警信息,实现在主机快要被文件系统隔离时,及时告警的目的,从而降低主机被隔离的可能,即使主机最终被当主机被文件系统隔离,用户也可以通过通知信息了解主机fence的原因,提升用户体验。In the embodiment provided by the present invention, after obtaining the management duration for which the host is isolated, timing is performed on the failed host, and before the management duration is reached, an alarm message is sent to the cloud platform, so that when the host is about to be isolated by the file system, The purpose of timely alarming is to reduce the possibility of the host being isolated. Even if the host is finally isolated by the file system, the user can still understand the reason for the host fence through the notification information and improve user experience.

本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

附图说明Description of drawings

附图用来提供对本发明技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本发明的技术方案,并不构成对本发明技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present invention, and constitute a part of the description, and are used together with the embodiments of the application to explain the technical solution of the present invention, and do not constitute a limitation to the technical solution of the present invention.

图1为本发明提供的集群系统中管理主机运行状态的方法的流程图;Fig. 1 is a flow chart of the method for managing the running state of the host computer in the cluster system provided by the present invention;

图2为本发明提供的集群系统中管理主机运行状态的装置的结构图。FIG. 2 is a structural diagram of a device for managing the running state of a host in the cluster system provided by the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚明白,下文中将结合附图对本发明的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。In order to make the object, technical solution and advantages of the present invention more clear, the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined arbitrarily with each other.

在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The steps shown in the flowcharts of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

图1为本发明提供的集群系统中管理主机运行状态的方法的流程图。图1所示方法包括:FIG. 1 is a flowchart of a method for managing the running state of a host in a cluster system provided by the present invention. The methods shown in Figure 1 include:

步骤101、获取OCFS2集群系统中主机与其他主机不能通信后至该主机被隔离的管理时长;Step 101, obtaining the management time from when the host in the OCFS2 cluster system cannot communicate with other hosts to when the host is isolated;

具体的,在集群文件系统中主机在不能与其他主机通信后,需要经过一段时间才会被隔离,因此,需要提前获知该时间,才能在到达该时间前,进行管理。Specifically, in the cluster file system, after a host cannot communicate with other hosts, it takes a period of time before it is isolated. Therefore, it is necessary to know the time in advance so as to perform management before the time is reached.

步骤102、在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;Step 102, after detecting that the host cannot communicate with other hosts, start timing from the time when the host fails;

具体的,在得知管理时长后,一旦主机不能与其他主机通信后,要对该故障的主机的故障时间进行计时;Specifically, after knowing the management duration, once the host cannot communicate with other hosts, time the failure time of the failed host;

步骤103、在记录的时长未达到所述管理时长前,向云平台发出告警信息。Step 103, before the recorded duration does not reach the management duration, send an alarm message to the cloud platform.

具体的,在主机快要被文件系统隔离时,及时发出告警信息,从而方便用户进行修复,降低被隔离的可能。Specifically, when the host is about to be isolated by the file system, an alarm message is sent in time, so that it is convenient for the user to repair and reduce the possibility of being isolated.

本发明提供的方法实施例,在得到主机被隔离的管理时长后,对发生故障的主机进行计时,未达到所述管理时长前,向云平台发出告警信息,实现在主机快要被文件系统隔离时,及时告警的目的,从而降低主机被隔离的可能,即使主机最终被当主机被文件系统隔离,用户也可以通过通知信息了解主机fence的原因,提升用户体验。In the embodiment of the method provided by the present invention, after obtaining the management duration for which the host is isolated, timing is performed on the failed host, and before the management duration is reached, an alarm message is sent to the cloud platform, so that when the host is about to be isolated by the file system , the purpose of timely alarming, thereby reducing the possibility of the host being isolated, even if the host is finally isolated by the file system, the user can also understand the reason for the host fence through the notification information, and improve user experience.

下面对本发明提供的方法作进一步说明:The method provided by the present invention is further described below:

在OCFS2系统中,主机间通过存储心跳进程来检测自身及其余节点与存储设备之间的连接是否正常。存储心跳进程每2s读一次其余节点的存储心跳,同时写一次自身节点的存储心跳。如果一定时间内某主机节点不能读写存储,则会被文件系统隔离。因此,本发明提供的检测所述主机是否能与其他主机通信是通过如下方式包括:In the OCFS2 system, the hosts use the storage heartbeat process to detect whether the connections between themselves and other nodes and storage devices are normal. The storage heartbeat process reads the storage heartbeat of other nodes every 2s, and writes the storage heartbeat of its own node at the same time. If a host node cannot read and write storage within a certain period of time, it will be isolated by the file system. Therefore, the detection of whether the host can communicate with other hosts provided by the present invention includes the following methods:

从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;Detect the heartbeat response data sent by other hosts from the preset monitoring port, and get the detection result;

将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;Comparing the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result;

根据所述比较结果,确定所述主机是否能与其他主机通信。Based on the comparison result, it is determined whether the host can communicate with other hosts.

借助该集群系统的现有传输机制完成通信检测,沿用了已有协议,无需进行协议修改,即完成检测目的,实现简单方便。With the help of the existing transmission mechanism of the cluster system, the communication detection is completed, the existing protocol is used, and the detection purpose is completed without modification of the protocol, which is simple and convenient.

在上述检测方法中,所述从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果,还可以基恩一步包括:In the above detection method, the detection of heartbeat response data sent by other hosts from the preset monitoring port to obtain the detection result can also include:

接收其他主机发送的存储心跳数据包;Receive storage heartbeat packets sent by other hosts;

从所述预先设置的存储心跳数据包中的预设字段读取数据;Reading data from a preset field in the preset stored heartbeat data packet;

解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。Parse the content of the data to determine whether the data is heartbeat response data to the host.

具体的,通常集群系统中的主机会协商好在存储心跳数据包的某个固定字段携带特定的数据来标识心跳应答数据,在对该数据解析中如果该数据的内容为预先设置的内容,则表示该存储心跳数据包有对所述主机的心跳应答数据;如果该数据的内容为空或者不是预先设置的内容,则表示表示该存储心跳数据包没有对所述主机的心跳应答数据。Specifically, usually the hosts in the cluster system will negotiate to store specific data in a fixed field of the heartbeat data packet to identify the heartbeat response data. If the content of the data is the preset content in the data analysis, then It means that the stored heartbeat data packet has heartbeat response data to the host; if the content of the data is empty or not preset, it means that the stored heartbeat data packet has no heartbeat response data to the host.

在上述检测方法中,所述将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果,包括:In the above detection method, the comparison of the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result includes:

判断是否接收到其他主机向所述主机发送的心跳应答数据;Judging whether the heartbeat response data sent by other hosts to the host is received;

如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;If no heartbeat response data sent by other hosts to the host is received, then it is determined that the host cannot communicate with other hosts;

如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;判断所述总数是否大于预先设置的总数阈值;如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。If the heartbeat response data sent by other hosts to the host is received, count the total number of hosts that send heartbeat response data to the host; judge whether the total is greater than the preset total threshold; if the total is less than the total threshold, it is determined that the host cannot communicate with other hosts.

需要说明的是,总数阈值通常为大于等于集群中总计总数的一半以上。It should be noted that the threshold of the total number is generally greater than or equal to half of the total number in the cluster.

当然,在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。Of course, before the recorded duration does not reach the management duration, after sending an alarm message to the cloud platform, it is detected whether the host receives an operation request for troubleshooting; if the host does not receive the operation request for troubleshooting, Then send an alarm message to the cloud platform again.

具体的,在发出告警后,如果过了一段时间还未收到响应操作,在没有到达管理时长前,再一次发出告警,以便提醒用户尽快操作。Specifically, after the alarm is issued, if no response operation is received after a period of time, the alarm is issued again before the management time limit is reached, so as to remind the user to operate as soon as possible.

下面以本发明提供的应用实例作进一步说明:Further description is given below with the application examples provided by the present invention:

为了减少主机被隔离的可能,本发明提出了一种面向OCFS2集群文件系统主机间共享给存储心跳监控的方法,主机通过抓包,拿到OCFS2集群文件系统心跳包,然后匹配心跳包固定位置信息,也即OCFS2集群中主机间心跳的应答信号,确定自身节点是否运行正常。当主机在一定时间内只能收到集群中其他少数主机的应答信号或者收不到集群中其他主机节点的应答信号时,主机就会断定自己处于非正常运行状态,在自身节点fence前,向上层平台发送通知,通过用户界面向用户发出警告信息。In order to reduce the possibility of the host being isolated, the present invention proposes a method for sharing storage heartbeat monitoring between hosts facing the OCFS2 cluster file system. The host obtains the heartbeat packet of the OCFS2 cluster file system by capturing packets, and then matches the fixed location information of the heartbeat packet , that is, the heartbeat response signal between the hosts in the OCFS2 cluster, to determine whether the own node is running normally. When the host can only receive the response signals from a few other hosts in the cluster or cannot receive the response signals from other host nodes in the cluster within a certain period of time, the host will determine that it is in an abnormal operating state, and before its own node fence, go up The layer platform sends notifications and sends warning messages to users through the user interface.

本应用实例提供的方法主要包括以下几个步骤:The method provided by this application example mainly includes the following steps:

步骤1、主机监听自身固定端口7777,通过抓包,获取集群中主机心跳包;Step 1. The host monitors its own fixed port 7777, and obtains the heartbeat packet of the host in the cluster by capturing packets;

步骤2、主机通过对抓取的心跳包固定位置做匹配,判断该心跳包是否为其他主机对自己的应答,若是,则继续抓包;否则继续监听;Step 2. The host judges whether the heartbeat packet is the response of other hosts to itself by matching the fixed position of the captured heartbeat packet. If so, continue to capture packets; otherwise, continue to monitor;

步骤3、一定时间后,若主机仍未收到集群中其他主机的心跳应答信息,或者收到应答的主机数小于集群中半数主机,则主机发出告警信息给平台;Step 3. After a certain period of time, if the host has not received the heartbeat response information from other hosts in the cluster, or the number of hosts that have received the response is less than half of the hosts in the cluster, the host will send an alarm message to the platform;

步骤4、平台监听底层发送的信息。若底层无发送信息,则继续监听;若有发送信息,平台处理后,反馈给UI,通过UI提醒用户底层存储的状态。Step 4. The platform monitors the information sent by the bottom layer. If there is no information sent from the bottom layer, it will continue to monitor; if there is a message sent, the platform will feed back to the UI after processing, and remind the user of the status of the bottom layer storage through the UI.

本发明应用实例提供的方法,能在较短时间内判断出OCFS2集群文件系统中单个主机是否正常工作,若主机未正常工作,则会通过发送警告信息的形式,提前告知用户。即使主机最终fence,用户通过告警信息,得知主机fence的原因。The method provided by the application example of the present invention can judge whether a single host in the OCFS2 cluster file system is working normally in a relatively short period of time, and if the host is not working normally, it will notify the user in advance by sending a warning message. Even if the host is finally fenced, the user will know the reason for the host fence through the alarm information.

图2为本发明提供的集群系统中管理主机运行状态的装置的结构图。图2所示装置包括:FIG. 2 is a structural diagram of a device for managing the running state of a host in the cluster system provided by the present invention. The device shown in Figure 2 includes:

获取模块201,用于获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;The obtaining module 201 is used to obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated;

计时模块202,用于在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;The timing module 202 is used to start counting from the time when the host fails when it detects that the host cannot communicate with other hosts;

告警模块203,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息。The alarm module 203 is configured to send an alarm message to the cloud platform before the recorded duration does not reach the management duration.

在本发明提供的一个装置实施例中,所述装置还包括:In an embodiment of the device provided by the present invention, the device further includes:

第一检测模块,用于从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;The first detection module is used to detect heartbeat response data sent by other hosts from a preset monitoring port to obtain a detection result;

比较模块,用于将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;A comparison module, configured to compare the detection result with a preset management strategy for heartbeat response data to obtain a comparison result;

确定模块,用于根据所述比较结果,确定所述主机是否能与其他主机通信。A determining module, configured to determine whether the host can communicate with other hosts according to the comparison result.

在本发明提供的一个装置实施例中,所述第一检测模块包括:In a device embodiment provided by the present invention, the first detection module includes:

接收单元,用于接收其他主机发送的存储心跳数据包;A receiving unit, configured to receive storage heartbeat packets sent by other hosts;

读取单元,用于从所述预先设置的存储心跳数据包中的预设字段读取数据;A reading unit, configured to read data from a preset field in the preset stored heartbeat data packet;

确定单元,用于解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。The determining unit is configured to analyze the content of the data, and determine whether the data is heartbeat response data to the host.

在本发明提供的一个装置实施例中,所述比较模块包括:In an apparatus embodiment provided by the present invention, the comparison module includes:

第一判断单元,用于判断是否接收到其他主机向所述主机发送的心跳应答数据;A first judging unit, configured to judge whether heartbeat response data sent by other hosts to the host is received;

第一处理单元,与所述第一断单元相连,用于如果没有接收到其他主机向所述主机发送的判心跳应答数据,则确定所述主机不能与其他主机通信;The first processing unit is connected to the first disconnection unit, and is used to determine that the host cannot communicate with other hosts if no heartbeat response data sent by other hosts to the host is received;

统计单元,与所述第一判断单元相连,用于如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;A statistics unit, connected to the first judging unit, used to count the total number of hosts sending heartbeat response data to the host if heartbeat response data sent by other hosts to the host is received;

第二判断单元,与所述统计单元相连,用于判断所述总数是否大于预先设置的总数阈值;A second judging unit, connected to the statistical unit, for judging whether the total number is greater than a preset total threshold;

第二处理单元,与所述第二断单元相连,用于如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。A second processing unit, connected to the second breaking unit, configured to determine that the host cannot communicate with other hosts if the total is less than the threshold of the total.

在本发明提供的一个装置实施例中,所述装置还包括:In an embodiment of the device provided by the present invention, the device further includes:

第二检测模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;The second detection module is used to detect whether the host receives an operation request for fault handling after sending an alarm message to the cloud platform before the recorded time length reaches the management time length;

所述告警模块还用于如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。The alarm module is further configured to send an alarm message to the cloud platform again if the host does not receive the fault handling operation request.

本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps of the above-mentioned embodiments can be implemented using a computer program flow, the computer program can be stored in a computer-readable storage medium, and the computer program can be run on a corresponding hardware platform (such as system, device, device, device, etc.), and when executed, includes one or a combination of the steps of the method embodiment.

可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Optionally, all or part of the steps in the above embodiments can also be implemented using integrated circuits, and these steps can be fabricated into individual integrated circuit modules, or multiple modules or steps among them can be fabricated into a single integrated circuit module accomplish. As such, the present invention is not limited to any specific combination of hardware and software.

上述实施例中的各装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。The devices/functional modules/functional units in the above embodiments can be realized by general-purpose computing devices, and they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices.

上述实施例中的各装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。When each device/functional module/functional unit in the above-mentioned embodiments is realized in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. The computer-readable storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求所述的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope described in the claims.

Claims (10)

1.一种集群系统中管理主机运行状态的方法,其特征在于,包括:1. A method for managing the running state of a host computer in a cluster system, characterized in that, comprising: 获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;Obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated; 在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;After detecting that the host cannot communicate with other hosts, start timing from the time when the host fails; 在记录的时长未达到所述管理时长前,向云平台发出告警信息。Before the recorded duration does not reach the management duration, an alarm message is sent to the cloud platform. 2.根据权利要求1所述的方法,其特征在于,所述主机是否能与其他主机通信是通过如下方式得到的,包括:2. The method according to claim 1, wherein whether the host can communicate with other hosts is obtained through the following methods, including: 从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;Detect the heartbeat response data sent by other hosts from the preset monitoring port, and get the detection result; 将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;Comparing the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result; 根据所述比较结果,确定所述主机是否能与其他主机通信。Based on the comparison result, it is determined whether the host can communicate with other hosts. 3.根据权利要求2所述的方法,其特征在于,所述从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果,包括:3. The method according to claim 2, wherein the detecting the heartbeat response data sent by other hosts from the preset monitoring port to obtain the detection result includes: 接收其他主机发送的存储心跳数据包;Receive storage heartbeat packets sent by other hosts; 从所述预先设置的存储心跳数据包中的预设字段读取数据;Reading data from a preset field in the preset stored heartbeat data packet; 解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。Parse the content of the data to determine whether the data is heartbeat response data to the host. 4.根据权利要求2或3所述的方法,其特征在于,所述将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果,包括:4. The method according to claim 2 or 3, wherein the comparison of the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result includes: 判断是否接收到其他主机向所述主机发送的心跳应答数据;Judging whether the heartbeat response data sent by other hosts to the host is received; 如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;If no heartbeat response data sent by other hosts to the host is received, then it is determined that the host cannot communicate with other hosts; 如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;判断所述总数是否大于预先设置的总数阈值;如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。If the heartbeat response data sent by other hosts to the host is received, count the total number of hosts that send heartbeat response data to the host; judge whether the total is greater than the preset total threshold; if the total is less than the total threshold, it is determined that the host cannot communicate with other hosts. 5.根据权利要求1所述的方法,其特征在于,在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,所述方法还包括:5. The method according to claim 1, characterized in that, before the recorded duration does not reach the management duration, after sending an alarm message to the cloud platform, the method further comprises: 检测所述主机是否收到故障处理的操作请求;Detecting whether the host receives an operation request for fault handling; 如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。If the host does not receive the fault processing operation request, it sends an alarm message to the cloud platform again. 6.一种集群系统中管理主机运行状态的装置,其特征在于,包括:6. A device for managing the running state of a host computer in a cluster system, characterized in that it comprises: 获取模块,用于获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;The obtaining module is used to obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated; 计时模块,用于在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;Timing module, used to start counting from the time when the host fails when it detects that the host cannot communicate with other hosts; 告警模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息。The alarm module is configured to send an alarm message to the cloud platform before the recorded time length reaches the management time length. 7.根据权利要求6所述的装置,其特征在于,所述装置还包括:7. The device according to claim 6, further comprising: 第一检测模块,用于从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;The first detection module is used to detect heartbeat response data sent by other hosts from a preset monitoring port to obtain a detection result; 比较模块,用于将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;A comparison module, configured to compare the detection result with a preset management strategy for heartbeat response data to obtain a comparison result; 确定模块,用于根据所述比较结果,确定所述主机是否能与其他主机通信。A determining module, configured to determine whether the host can communicate with other hosts according to the comparison result. 8.根据权利要求7所述的装置,其特征在于,所述第一检测模块包括:8. The device according to claim 7, wherein the first detection module comprises: 接收单元,用于接收其他主机发送的存储心跳数据包;A receiving unit, configured to receive storage heartbeat packets sent by other hosts; 读取单元,用于从所述预先设置的存储心跳数据包中的预设字段读取数据;A reading unit, configured to read data from a preset field in the preset stored heartbeat data packet; 确定单元,用于解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。The determining unit is configured to analyze the content of the data, and determine whether the data is heartbeat response data to the host. 9.根据权利要求7或8所述的装置,其特征在于,所述比较模块包括:9. The device according to claim 7 or 8, wherein the comparison module comprises: 第一判断单元,用于判断是否接收到其他主机向所述主机发送的心跳应答数据;A first judging unit, configured to judge whether heartbeat response data sent by other hosts to the host is received; 第一处理单元,用于如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;The first processing unit is configured to determine that the host cannot communicate with other hosts if no heartbeat response data sent by the other host to the host is received; 统计单元,用于如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;A statistical unit, configured to count the total number of hosts sending heartbeat response data to the host if heartbeat response data sent by other hosts to the host is received; 第二判断单元,用于判断所述总数是否大于预先设置的总数阈值;A second judging unit, configured to judge whether the total number is greater than a preset total threshold; 第二处理单元,用于如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。A second processing unit, configured to determine that the host cannot communicate with other hosts if the total number is less than the total number threshold. 10.根据权利要求6所述的装置,其特征在于,所述装置还包括:10. The device according to claim 6, further comprising: 第二检测模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;The second detection module is used to detect whether the host receives an operation request for fault handling after sending an alarm message to the cloud platform before the recorded time length reaches the management time length; 所述告警模块还用于如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。The alarm module is further configured to send an alarm message to the cloud platform again if the host does not receive the fault handling operation request.
CN201710911387.XA 2017-09-29 2017-09-29 The method and apparatus that operational state of mainframe is managed in group system Pending CN107733702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710911387.XA CN107733702A (en) 2017-09-29 2017-09-29 The method and apparatus that operational state of mainframe is managed in group system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710911387.XA CN107733702A (en) 2017-09-29 2017-09-29 The method and apparatus that operational state of mainframe is managed in group system

Publications (1)

Publication Number Publication Date
CN107733702A true CN107733702A (en) 2018-02-23

Family

ID=61209261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710911387.XA Pending CN107733702A (en) 2017-09-29 2017-09-29 The method and apparatus that operational state of mainframe is managed in group system

Country Status (1)

Country Link
CN (1) CN107733702A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959024A (en) * 2018-06-26 2018-12-07 郑州云海信息技术有限公司 A kind of cluster monitoring method and apparatus
CN109445709A (en) * 2018-11-05 2019-03-08 郑州云海信息技术有限公司 The management method and device of storage resource in virtualization system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101753444A (en) * 2009-12-31 2010-06-23 卓望数码技术(深圳)有限公司 Method and device for load balancing
US20120117241A1 (en) * 2010-11-05 2012-05-10 Verizon Patent And Licensing Inc. Server clustering in a computing-on-demand system
CN103455395A (en) * 2013-08-08 2013-12-18 华为技术有限公司 Method and device for detecting hard disk failures
CN104219091A (en) * 2014-08-27 2014-12-17 中国科学院计算技术研究所 System and method for network operation fault detection
CN105872061A (en) * 2016-04-01 2016-08-17 浪潮电子信息产业股份有限公司 Server cluster management method, device and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101753444A (en) * 2009-12-31 2010-06-23 卓望数码技术(深圳)有限公司 Method and device for load balancing
US20120117241A1 (en) * 2010-11-05 2012-05-10 Verizon Patent And Licensing Inc. Server clustering in a computing-on-demand system
CN103455395A (en) * 2013-08-08 2013-12-18 华为技术有限公司 Method and device for detecting hard disk failures
CN104219091A (en) * 2014-08-27 2014-12-17 中国科学院计算技术研究所 System and method for network operation fault detection
CN105872061A (en) * 2016-04-01 2016-08-17 浪潮电子信息产业股份有限公司 Server cluster management method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959024A (en) * 2018-06-26 2018-12-07 郑州云海信息技术有限公司 A kind of cluster monitoring method and apparatus
CN109445709A (en) * 2018-11-05 2019-03-08 郑州云海信息技术有限公司 The management method and device of storage resource in virtualization system

Similar Documents

Publication Publication Date Title
US10721135B1 (en) Edge computing system for monitoring and maintaining data center operations
CN101188527B (en) A heartbeat detection method and device
CN104065526B (en) A kind of method and apparatus of server failure alarm
CN104932978B (en) A kind of system operation automatic fault selftesting and the method and system of selfreparing
US11930292B2 (en) Device state monitoring method and apparatus
CN106789386A (en) Method for detecting error on communication bus and error detector for network system
CN101742540A (en) Method and device for online self-diagnosis
US20250254121A1 (en) Device management method, device, system, and storage medium
CN103414916A (en) Fault diagnosis system and method
CN116684256B (en) Node fault monitoring method, device and system, electronic equipment and storage medium
CN104090824B (en) Communication dispatch method, apparatus and system based on Tuxedo middlewares
CN101989933A (en) Method and system for failure detection
CN117221091A (en) Isolation method and device for sub-health nodes in storage cluster and electronic equipment
CN110858813A (en) Network camera safety detection method and device
CN101340567A (en) Reliability guarantee method of network video monitoring frontend
CN111130821B (en) Power failure alarm method, processing method and device
CN101227324A (en) Method for collecting fault information of communication equipment, communication equipment and system
WO2016187979A1 (en) Transmitting method and apparatus for bidirectional forwarding detection (bfd) message
CN107733702A (en) The method and apparatus that operational state of mainframe is managed in group system
CN106095638A (en) The method of a kind of server resource alarm, Apparatus and system
JP2012038257A (en) Os operating state confirmation system, confirmation object device, os operating state confirmation device, and os operating state confirmation method and program
CN100421381C (en) A method and device for acquiring network equipment operation and fault state information
CN112804115B (en) Method, device and equipment for detecting abnormity of virtual network function
CN110798660B (en) Integrated operation and maintenance system based on cloud federation audio and video fusion platform
CN101848474B (en) Communication link monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180223