CN107733702A - The method and apparatus that operational state of mainframe is managed in group system - Google Patents
The method and apparatus that operational state of mainframe is managed in group system Download PDFInfo
- Publication number
- CN107733702A CN107733702A CN201710911387.XA CN201710911387A CN107733702A CN 107733702 A CN107733702 A CN 107733702A CN 201710911387 A CN201710911387 A CN 201710911387A CN 107733702 A CN107733702 A CN 107733702A
- Authority
- CN
- China
- Prior art keywords
- host
- hosts
- response data
- heartbeat response
- communicate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000001514 detection method Methods 0.000 claims abstract description 34
- 230000004044 response Effects 0.000 claims description 58
- 238000012545 processing Methods 0.000 claims description 11
- 238000012544 monitoring process Methods 0.000 claims description 10
- 238000004891 communication Methods 0.000 abstract description 2
- 238000007726 management method Methods 0.000 description 25
- 238000004590 computer program Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013024 troubleshooting Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0893—Assignment of logical groups to network elements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
技术领域technical field
本发明涉及信息处理领域,尤指一种集群系统中管理主机运行状态的方法和装置。The invention relates to the field of information processing, in particular to a method and device for managing the running state of a host computer in a cluster system.
背景技术Background technique
CFS(CFS Cluster File System,集群文件系统)是个多个主机节点同时挂载同一个文件系统的系统,其中集群文件系统能使运行在集群中所有节点并发的通过标准文件系统接口来访问存储设备。这给管理跨越整个集群的应用系统带来方便。CFS (CFS Cluster File System, cluster file system) is a system in which multiple host nodes mount the same file system at the same time. The cluster file system enables all nodes running in the cluster to access storage devices concurrently through standard file system interfaces. This makes it easy to manage application systems that span the entire cluster.
如果一定时间内某主机节点不能读写存储,则主机会被文件系统隔离,也即fence。在云计算产品中,若主机在用户不知情的情况下fence,势必导致用户的业务处理中断,影响业务的正常处理。If a host node cannot read and write storage within a certain period of time, the host will be isolated by the file system, that is, fence. In cloud computing products, if the host is fenced without the user's knowledge, it will inevitably lead to the interruption of the user's business processing and affect the normal processing of the business.
发明内容Contents of the invention
为了解决上述技术问题,本发明提供了一种集群系统中管理主机运行状态的方法和装置,能够降低主机会被文件系统隔离的可能。In order to solve the above technical problems, the present invention provides a method and device for managing the running state of a host in a cluster system, which can reduce the possibility that the host will be isolated by a file system.
为了达到本发明目的,本发明提供了一种集群系统中管理主机运行状态的方法,包括:In order to achieve the purpose of the present invention, the present invention provides a method for managing the running state of a host in a cluster system, including:
获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;Obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated;
在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;After detecting that the host cannot communicate with other hosts, start timing from the time when the host fails;
在记录的时长未达到所述管理时长前,向云平台发出告警信息。Before the recorded duration does not reach the management duration, an alarm message is sent to the cloud platform.
其中,所述方法还具有如下特点:所述主机是否能与其他主机通信是通过如下方式得到的,包括:Wherein, the method also has the following characteristics: whether the host can communicate with other hosts is obtained through the following methods, including:
从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;Detect the heartbeat response data sent by other hosts from the preset monitoring port, and get the detection result;
将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;Comparing the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result;
根据所述比较结果,确定所述主机是否能与其他主机通信。Based on the comparison result, it is determined whether the host can communicate with other hosts.
其中,所述方法还具有如下特点:所述从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果,包括:Wherein, the method also has the following characteristics: the heartbeat response data sent by other hosts is detected from the preset monitoring port, and the detection result is obtained, including:
接收其他主机发送的存储心跳数据包;Receive storage heartbeat packets sent by other hosts;
从所述预先设置的存储心跳数据包中的预设字段读取数据;Reading data from a preset field in the preset stored heartbeat data packet;
解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。Parse the content of the data to determine whether the data is heartbeat response data to the host.
其中,所述方法还具有如下特点:所述将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果,包括:Wherein, the method also has the following characteristics: the comparison of the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result includes:
判断是否接收到其他主机向所述主机发送的心跳应答数据;Judging whether the heartbeat response data sent by other hosts to the host is received;
如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;If no heartbeat response data sent by other hosts to the host is received, then it is determined that the host cannot communicate with other hosts;
如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;判断所述总数是否大于预先设置的总数阈值;如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。If the heartbeat response data sent by other hosts to the host is received, count the total number of hosts that send heartbeat response data to the host; judge whether the total is greater than the preset total threshold; if the total is less than the total threshold, it is determined that the host cannot communicate with other hosts.
其中,所述方法还具有如下特点:在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,所述方法还包括:Wherein, the method also has the following characteristics: before the recorded duration does not reach the management duration, after sending an alarm message to the cloud platform, the method also includes:
检测所述主机是否收到故障处理的操作请求;Detecting whether the host receives an operation request for fault handling;
如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。If the host does not receive the fault processing operation request, it sends an alarm message to the cloud platform again.
一种集群系统中管理主机运行状态的装置,包括:A device for managing the running state of a host in a cluster system, comprising:
获取模块,用于获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;The obtaining module is used to obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated;
计时模块,用于在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;Timing module, used to start counting from the time when the host fails when it detects that the host cannot communicate with other hosts;
告警模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息。The alarm module is configured to send an alarm message to the cloud platform before the recorded time length reaches the management time length.
其中,所述装置还具有如下特点:所述装置还包括:Wherein, the device also has the following characteristics: the device also includes:
第一检测模块,用于从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;The first detection module is used to detect heartbeat response data sent by other hosts from a preset monitoring port to obtain a detection result;
比较模块,用于将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;A comparison module, configured to compare the detection result with a preset management strategy for heartbeat response data to obtain a comparison result;
确定模块,用于根据所述比较结果,确定所述主机是否能与其他主机通信。A determining module, configured to determine whether the host can communicate with other hosts according to the comparison result.
其中,所述装置还具有如下特点:所述第一检测模块包括:Wherein, the device also has the following characteristics: the first detection module includes:
接收单元,用于接收其他主机发送的存储心跳数据包;A receiving unit, configured to receive storage heartbeat packets sent by other hosts;
读取单元,用于从所述预先设置的存储心跳数据包中的预设字段读取数据;A reading unit, configured to read data from a preset field in the preset stored heartbeat data packet;
确定单元,用于解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。The determining unit is configured to analyze the content of the data, and determine whether the data is heartbeat response data to the host.
其中,所述装置还具有如下特点:所述比较模块包括:Wherein, the device also has the following characteristics: the comparison module includes:
第一判断单元,用于判断是否接收到其他主机向所述主机发送的心跳应答数据;A first judging unit, configured to judge whether heartbeat response data sent by other hosts to the host is received;
第一处理单元,用于如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;The first processing unit is configured to determine that the host cannot communicate with other hosts if no heartbeat response data sent by the other host to the host is received;
统计单元,用于如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;A statistical unit, configured to count the total number of hosts sending heartbeat response data to the host if heartbeat response data sent by other hosts to the host is received;
第二判断单元,用于判断所述总数是否大于预先设置的总数阈值;A second judging unit, configured to judge whether the total number is greater than a preset total threshold;
第二处理单元,用于如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。A second processing unit, configured to determine that the host cannot communicate with other hosts if the total number is less than the total number threshold.
其中,所述装置还具有如下特点:所述装置还包括:Wherein, the device also has the following characteristics: the device also includes:
第二检测模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;The second detection module is used to detect whether the host receives an operation request for fault handling after sending an alarm message to the cloud platform before the recorded time length reaches the management time length;
所述告警模块还用于如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。The alarm module is further configured to send an alarm message to the cloud platform again if the host does not receive the fault handling operation request.
本发明提供的实施例,在得到主机被隔离的管理时长后,对发生故障的主机进行计时,未达到所述管理时长前,向云平台发出告警信息,实现在主机快要被文件系统隔离时,及时告警的目的,从而降低主机被隔离的可能,即使主机最终被当主机被文件系统隔离,用户也可以通过通知信息了解主机fence的原因,提升用户体验。In the embodiment provided by the present invention, after obtaining the management duration for which the host is isolated, timing is performed on the failed host, and before the management duration is reached, an alarm message is sent to the cloud platform, so that when the host is about to be isolated by the file system, The purpose of timely alarming is to reduce the possibility of the host being isolated. Even if the host is finally isolated by the file system, the user can still understand the reason for the host fence through the notification information and improve user experience.
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
附图说明Description of drawings
附图用来提供对本发明技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本发明的技术方案,并不构成对本发明技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present invention, and constitute a part of the description, and are used together with the embodiments of the application to explain the technical solution of the present invention, and do not constitute a limitation to the technical solution of the present invention.
图1为本发明提供的集群系统中管理主机运行状态的方法的流程图;Fig. 1 is a flow chart of the method for managing the running state of the host computer in the cluster system provided by the present invention;
图2为本发明提供的集群系统中管理主机运行状态的装置的结构图。FIG. 2 is a structural diagram of a device for managing the running state of a host in the cluster system provided by the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚明白,下文中将结合附图对本发明的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。In order to make the object, technical solution and advantages of the present invention more clear, the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined arbitrarily with each other.
在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The steps shown in the flowcharts of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.
图1为本发明提供的集群系统中管理主机运行状态的方法的流程图。图1所示方法包括:FIG. 1 is a flowchart of a method for managing the running state of a host in a cluster system provided by the present invention. The methods shown in Figure 1 include:
步骤101、获取OCFS2集群系统中主机与其他主机不能通信后至该主机被隔离的管理时长;Step 101, obtaining the management time from when the host in the OCFS2 cluster system cannot communicate with other hosts to when the host is isolated;
具体的,在集群文件系统中主机在不能与其他主机通信后,需要经过一段时间才会被隔离,因此,需要提前获知该时间,才能在到达该时间前,进行管理。Specifically, in the cluster file system, after a host cannot communicate with other hosts, it takes a period of time before it is isolated. Therefore, it is necessary to know the time in advance so as to perform management before the time is reached.
步骤102、在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;Step 102, after detecting that the host cannot communicate with other hosts, start timing from the time when the host fails;
具体的,在得知管理时长后,一旦主机不能与其他主机通信后,要对该故障的主机的故障时间进行计时;Specifically, after knowing the management duration, once the host cannot communicate with other hosts, time the failure time of the failed host;
步骤103、在记录的时长未达到所述管理时长前,向云平台发出告警信息。Step 103, before the recorded duration does not reach the management duration, send an alarm message to the cloud platform.
具体的,在主机快要被文件系统隔离时,及时发出告警信息,从而方便用户进行修复,降低被隔离的可能。Specifically, when the host is about to be isolated by the file system, an alarm message is sent in time, so that it is convenient for the user to repair and reduce the possibility of being isolated.
本发明提供的方法实施例,在得到主机被隔离的管理时长后,对发生故障的主机进行计时,未达到所述管理时长前,向云平台发出告警信息,实现在主机快要被文件系统隔离时,及时告警的目的,从而降低主机被隔离的可能,即使主机最终被当主机被文件系统隔离,用户也可以通过通知信息了解主机fence的原因,提升用户体验。In the embodiment of the method provided by the present invention, after obtaining the management duration for which the host is isolated, timing is performed on the failed host, and before the management duration is reached, an alarm message is sent to the cloud platform, so that when the host is about to be isolated by the file system , the purpose of timely alarming, thereby reducing the possibility of the host being isolated, even if the host is finally isolated by the file system, the user can also understand the reason for the host fence through the notification information, and improve user experience.
下面对本发明提供的方法作进一步说明:The method provided by the present invention is further described below:
在OCFS2系统中,主机间通过存储心跳进程来检测自身及其余节点与存储设备之间的连接是否正常。存储心跳进程每2s读一次其余节点的存储心跳,同时写一次自身节点的存储心跳。如果一定时间内某主机节点不能读写存储,则会被文件系统隔离。因此,本发明提供的检测所述主机是否能与其他主机通信是通过如下方式包括:In the OCFS2 system, the hosts use the storage heartbeat process to detect whether the connections between themselves and other nodes and storage devices are normal. The storage heartbeat process reads the storage heartbeat of other nodes every 2s, and writes the storage heartbeat of its own node at the same time. If a host node cannot read and write storage within a certain period of time, it will be isolated by the file system. Therefore, the detection of whether the host can communicate with other hosts provided by the present invention includes the following methods:
从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;Detect the heartbeat response data sent by other hosts from the preset monitoring port, and get the detection result;
将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;Comparing the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result;
根据所述比较结果,确定所述主机是否能与其他主机通信。Based on the comparison result, it is determined whether the host can communicate with other hosts.
借助该集群系统的现有传输机制完成通信检测,沿用了已有协议,无需进行协议修改,即完成检测目的,实现简单方便。With the help of the existing transmission mechanism of the cluster system, the communication detection is completed, the existing protocol is used, and the detection purpose is completed without modification of the protocol, which is simple and convenient.
在上述检测方法中,所述从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果,还可以基恩一步包括:In the above detection method, the detection of heartbeat response data sent by other hosts from the preset monitoring port to obtain the detection result can also include:
接收其他主机发送的存储心跳数据包;Receive storage heartbeat packets sent by other hosts;
从所述预先设置的存储心跳数据包中的预设字段读取数据;Reading data from a preset field in the preset stored heartbeat data packet;
解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。Parse the content of the data to determine whether the data is heartbeat response data to the host.
具体的,通常集群系统中的主机会协商好在存储心跳数据包的某个固定字段携带特定的数据来标识心跳应答数据,在对该数据解析中如果该数据的内容为预先设置的内容,则表示该存储心跳数据包有对所述主机的心跳应答数据;如果该数据的内容为空或者不是预先设置的内容,则表示表示该存储心跳数据包没有对所述主机的心跳应答数据。Specifically, usually the hosts in the cluster system will negotiate to store specific data in a fixed field of the heartbeat data packet to identify the heartbeat response data. If the content of the data is the preset content in the data analysis, then It means that the stored heartbeat data packet has heartbeat response data to the host; if the content of the data is empty or not preset, it means that the stored heartbeat data packet has no heartbeat response data to the host.
在上述检测方法中,所述将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果,包括:In the above detection method, the comparison of the detection result with the preset management strategy of the heartbeat response data to obtain the comparison result includes:
判断是否接收到其他主机向所述主机发送的心跳应答数据;Judging whether the heartbeat response data sent by other hosts to the host is received;
如果没有接收到其他主机向所述主机发送的心跳应答数据,则确定所述主机不能与其他主机通信;If no heartbeat response data sent by other hosts to the host is received, then it is determined that the host cannot communicate with other hosts;
如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;判断所述总数是否大于预先设置的总数阈值;如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。If the heartbeat response data sent by other hosts to the host is received, count the total number of hosts that send heartbeat response data to the host; judge whether the total is greater than the preset total threshold; if the total is less than the total threshold, it is determined that the host cannot communicate with other hosts.
需要说明的是,总数阈值通常为大于等于集群中总计总数的一半以上。It should be noted that the threshold of the total number is generally greater than or equal to half of the total number in the cluster.
当然,在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。Of course, before the recorded duration does not reach the management duration, after sending an alarm message to the cloud platform, it is detected whether the host receives an operation request for troubleshooting; if the host does not receive the operation request for troubleshooting, Then send an alarm message to the cloud platform again.
具体的,在发出告警后,如果过了一段时间还未收到响应操作,在没有到达管理时长前,再一次发出告警,以便提醒用户尽快操作。Specifically, after the alarm is issued, if no response operation is received after a period of time, the alarm is issued again before the management time limit is reached, so as to remind the user to operate as soon as possible.
下面以本发明提供的应用实例作进一步说明:Further description is given below with the application examples provided by the present invention:
为了减少主机被隔离的可能,本发明提出了一种面向OCFS2集群文件系统主机间共享给存储心跳监控的方法,主机通过抓包,拿到OCFS2集群文件系统心跳包,然后匹配心跳包固定位置信息,也即OCFS2集群中主机间心跳的应答信号,确定自身节点是否运行正常。当主机在一定时间内只能收到集群中其他少数主机的应答信号或者收不到集群中其他主机节点的应答信号时,主机就会断定自己处于非正常运行状态,在自身节点fence前,向上层平台发送通知,通过用户界面向用户发出警告信息。In order to reduce the possibility of the host being isolated, the present invention proposes a method for sharing storage heartbeat monitoring between hosts facing the OCFS2 cluster file system. The host obtains the heartbeat packet of the OCFS2 cluster file system by capturing packets, and then matches the fixed location information of the heartbeat packet , that is, the heartbeat response signal between the hosts in the OCFS2 cluster, to determine whether the own node is running normally. When the host can only receive the response signals from a few other hosts in the cluster or cannot receive the response signals from other host nodes in the cluster within a certain period of time, the host will determine that it is in an abnormal operating state, and before its own node fence, go up The layer platform sends notifications and sends warning messages to users through the user interface.
本应用实例提供的方法主要包括以下几个步骤:The method provided by this application example mainly includes the following steps:
步骤1、主机监听自身固定端口7777,通过抓包,获取集群中主机心跳包;Step 1. The host monitors its own fixed port 7777, and obtains the heartbeat packet of the host in the cluster by capturing packets;
步骤2、主机通过对抓取的心跳包固定位置做匹配,判断该心跳包是否为其他主机对自己的应答,若是,则继续抓包;否则继续监听;Step 2. The host judges whether the heartbeat packet is the response of other hosts to itself by matching the fixed position of the captured heartbeat packet. If so, continue to capture packets; otherwise, continue to monitor;
步骤3、一定时间后,若主机仍未收到集群中其他主机的心跳应答信息,或者收到应答的主机数小于集群中半数主机,则主机发出告警信息给平台;Step 3. After a certain period of time, if the host has not received the heartbeat response information from other hosts in the cluster, or the number of hosts that have received the response is less than half of the hosts in the cluster, the host will send an alarm message to the platform;
步骤4、平台监听底层发送的信息。若底层无发送信息,则继续监听;若有发送信息,平台处理后,反馈给UI,通过UI提醒用户底层存储的状态。Step 4. The platform monitors the information sent by the bottom layer. If there is no information sent from the bottom layer, it will continue to monitor; if there is a message sent, the platform will feed back to the UI after processing, and remind the user of the status of the bottom layer storage through the UI.
本发明应用实例提供的方法,能在较短时间内判断出OCFS2集群文件系统中单个主机是否正常工作,若主机未正常工作,则会通过发送警告信息的形式,提前告知用户。即使主机最终fence,用户通过告警信息,得知主机fence的原因。The method provided by the application example of the present invention can judge whether a single host in the OCFS2 cluster file system is working normally in a relatively short period of time, and if the host is not working normally, it will notify the user in advance by sending a warning message. Even if the host is finally fenced, the user will know the reason for the host fence through the alarm information.
图2为本发明提供的集群系统中管理主机运行状态的装置的结构图。图2所示装置包括:FIG. 2 is a structural diagram of a device for managing the running state of a host in the cluster system provided by the present invention. The device shown in Figure 2 includes:
获取模块201,用于获取CFS集群文件系统中主机与其他主机不能通信后至该主机被隔离的管理时长;The obtaining module 201 is used to obtain the management time from when the host in the CFS cluster file system cannot communicate with other hosts to when the host is isolated;
计时模块202,用于在检测主机不能与其他主机通信后,自所述主机发生故障的时间为起始点开始计时;The timing module 202 is used to start counting from the time when the host fails when it detects that the host cannot communicate with other hosts;
告警模块203,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息。The alarm module 203 is configured to send an alarm message to the cloud platform before the recorded duration does not reach the management duration.
在本发明提供的一个装置实施例中,所述装置还包括:In an embodiment of the device provided by the present invention, the device further includes:
第一检测模块,用于从预先设置的监控端口检测其他主机发送的心跳应答数据,得到检测结果;The first detection module is used to detect heartbeat response data sent by other hosts from a preset monitoring port to obtain a detection result;
比较模块,用于将所述检测结果与预先设置的心跳应答数据的管理策略进行比较,得到比较结果;A comparison module, configured to compare the detection result with a preset management strategy for heartbeat response data to obtain a comparison result;
确定模块,用于根据所述比较结果,确定所述主机是否能与其他主机通信。A determining module, configured to determine whether the host can communicate with other hosts according to the comparison result.
在本发明提供的一个装置实施例中,所述第一检测模块包括:In a device embodiment provided by the present invention, the first detection module includes:
接收单元,用于接收其他主机发送的存储心跳数据包;A receiving unit, configured to receive storage heartbeat packets sent by other hosts;
读取单元,用于从所述预先设置的存储心跳数据包中的预设字段读取数据;A reading unit, configured to read data from a preset field in the preset stored heartbeat data packet;
确定单元,用于解析所述数据的内容,确定所述数据是否为对所述主机的心跳应答数据。The determining unit is configured to analyze the content of the data, and determine whether the data is heartbeat response data to the host.
在本发明提供的一个装置实施例中,所述比较模块包括:In an apparatus embodiment provided by the present invention, the comparison module includes:
第一判断单元,用于判断是否接收到其他主机向所述主机发送的心跳应答数据;A first judging unit, configured to judge whether heartbeat response data sent by other hosts to the host is received;
第一处理单元,与所述第一断单元相连,用于如果没有接收到其他主机向所述主机发送的判心跳应答数据,则确定所述主机不能与其他主机通信;The first processing unit is connected to the first disconnection unit, and is used to determine that the host cannot communicate with other hosts if no heartbeat response data sent by other hosts to the host is received;
统计单元,与所述第一判断单元相连,用于如果接收到其他主机向所述主机发送的心跳应答数据,则再统计向所述主机发送心跳应答数据的主机的总数;A statistics unit, connected to the first judging unit, used to count the total number of hosts sending heartbeat response data to the host if heartbeat response data sent by other hosts to the host is received;
第二判断单元,与所述统计单元相连,用于判断所述总数是否大于预先设置的总数阈值;A second judging unit, connected to the statistical unit, for judging whether the total number is greater than a preset total threshold;
第二处理单元,与所述第二断单元相连,用于如果所述总数小于所述总数阈值,则确定所述主机不能与其他主机通信。A second processing unit, connected to the second breaking unit, configured to determine that the host cannot communicate with other hosts if the total is less than the threshold of the total.
在本发明提供的一个装置实施例中,所述装置还包括:In an embodiment of the device provided by the present invention, the device further includes:
第二检测模块,用于在记录的时长未达到所述管理时长前,向云平台发出告警信息之后,检测所述主机是否收到故障处理的操作请求;The second detection module is used to detect whether the host receives an operation request for fault handling after sending an alarm message to the cloud platform before the recorded time length reaches the management time length;
所述告警模块还用于如果所述主机未接收到所述故障处理的操作请求,则再次向云平台发出告警信息。The alarm module is further configured to send an alarm message to the cloud platform again if the host does not receive the fault handling operation request.
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps of the above-mentioned embodiments can be implemented using a computer program flow, the computer program can be stored in a computer-readable storage medium, and the computer program can be run on a corresponding hardware platform (such as system, device, device, device, etc.), and when executed, includes one or a combination of the steps of the method embodiment.
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Optionally, all or part of the steps in the above embodiments can also be implemented using integrated circuits, and these steps can be fabricated into individual integrated circuit modules, or multiple modules or steps among them can be fabricated into a single integrated circuit module accomplish. As such, the present invention is not limited to any specific combination of hardware and software.
上述实施例中的各装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。The devices/functional modules/functional units in the above embodiments can be realized by general-purpose computing devices, and they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices.
上述实施例中的各装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。When each device/functional module/functional unit in the above-mentioned embodiments is realized in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. The computer-readable storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求所述的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope described in the claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710911387.XA CN107733702A (en) | 2017-09-29 | 2017-09-29 | The method and apparatus that operational state of mainframe is managed in group system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710911387.XA CN107733702A (en) | 2017-09-29 | 2017-09-29 | The method and apparatus that operational state of mainframe is managed in group system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN107733702A true CN107733702A (en) | 2018-02-23 |
Family
ID=61209261
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710911387.XA Pending CN107733702A (en) | 2017-09-29 | 2017-09-29 | The method and apparatus that operational state of mainframe is managed in group system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN107733702A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108959024A (en) * | 2018-06-26 | 2018-12-07 | 郑州云海信息技术有限公司 | A kind of cluster monitoring method and apparatus |
| CN109445709A (en) * | 2018-11-05 | 2019-03-08 | 郑州云海信息技术有限公司 | The management method and device of storage resource in virtualization system |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101753444A (en) * | 2009-12-31 | 2010-06-23 | 卓望数码技术(深圳)有限公司 | Method and device for load balancing |
| US20120117241A1 (en) * | 2010-11-05 | 2012-05-10 | Verizon Patent And Licensing Inc. | Server clustering in a computing-on-demand system |
| CN103455395A (en) * | 2013-08-08 | 2013-12-18 | 华为技术有限公司 | Method and device for detecting hard disk failures |
| CN104219091A (en) * | 2014-08-27 | 2014-12-17 | 中国科学院计算技术研究所 | System and method for network operation fault detection |
| CN105872061A (en) * | 2016-04-01 | 2016-08-17 | 浪潮电子信息产业股份有限公司 | Server cluster management method, device and system |
-
2017
- 2017-09-29 CN CN201710911387.XA patent/CN107733702A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101753444A (en) * | 2009-12-31 | 2010-06-23 | 卓望数码技术(深圳)有限公司 | Method and device for load balancing |
| US20120117241A1 (en) * | 2010-11-05 | 2012-05-10 | Verizon Patent And Licensing Inc. | Server clustering in a computing-on-demand system |
| CN103455395A (en) * | 2013-08-08 | 2013-12-18 | 华为技术有限公司 | Method and device for detecting hard disk failures |
| CN104219091A (en) * | 2014-08-27 | 2014-12-17 | 中国科学院计算技术研究所 | System and method for network operation fault detection |
| CN105872061A (en) * | 2016-04-01 | 2016-08-17 | 浪潮电子信息产业股份有限公司 | Server cluster management method, device and system |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108959024A (en) * | 2018-06-26 | 2018-12-07 | 郑州云海信息技术有限公司 | A kind of cluster monitoring method and apparatus |
| CN109445709A (en) * | 2018-11-05 | 2019-03-08 | 郑州云海信息技术有限公司 | The management method and device of storage resource in virtualization system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10721135B1 (en) | Edge computing system for monitoring and maintaining data center operations | |
| CN101188527B (en) | A heartbeat detection method and device | |
| CN104065526B (en) | A kind of method and apparatus of server failure alarm | |
| CN104932978B (en) | A kind of system operation automatic fault selftesting and the method and system of selfreparing | |
| US11930292B2 (en) | Device state monitoring method and apparatus | |
| CN106789386A (en) | Method for detecting error on communication bus and error detector for network system | |
| CN101742540A (en) | Method and device for online self-diagnosis | |
| US20250254121A1 (en) | Device management method, device, system, and storage medium | |
| CN103414916A (en) | Fault diagnosis system and method | |
| CN116684256B (en) | Node fault monitoring method, device and system, electronic equipment and storage medium | |
| CN104090824B (en) | Communication dispatch method, apparatus and system based on Tuxedo middlewares | |
| CN101989933A (en) | Method and system for failure detection | |
| CN117221091A (en) | Isolation method and device for sub-health nodes in storage cluster and electronic equipment | |
| CN110858813A (en) | Network camera safety detection method and device | |
| CN101340567A (en) | Reliability guarantee method of network video monitoring frontend | |
| CN111130821B (en) | Power failure alarm method, processing method and device | |
| CN101227324A (en) | Method for collecting fault information of communication equipment, communication equipment and system | |
| WO2016187979A1 (en) | Transmitting method and apparatus for bidirectional forwarding detection (bfd) message | |
| CN107733702A (en) | The method and apparatus that operational state of mainframe is managed in group system | |
| CN106095638A (en) | The method of a kind of server resource alarm, Apparatus and system | |
| JP2012038257A (en) | Os operating state confirmation system, confirmation object device, os operating state confirmation device, and os operating state confirmation method and program | |
| CN100421381C (en) | A method and device for acquiring network equipment operation and fault state information | |
| CN112804115B (en) | Method, device and equipment for detecting abnormity of virtual network function | |
| CN110798660B (en) | Integrated operation and maintenance system based on cloud federation audio and video fusion platform | |
| CN101848474B (en) | Communication link monitoring method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180223 |