[go: up one dir, main page]

WO2024239569A1 - Cluster service processing method, server, and system - Google Patents

Cluster service processing method, server, and system Download PDF

Info

Publication number
WO2024239569A1
WO2024239569A1 PCT/CN2023/134453 CN2023134453W WO2024239569A1 WO 2024239569 A1 WO2024239569 A1 WO 2024239569A1 CN 2023134453 W CN2023134453 W CN 2023134453W WO 2024239569 A1 WO2024239569 A1 WO 2024239569A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
local
flow detection
detection
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2023/134453
Other languages
French (fr)
Chinese (zh)
Inventor
田苗
薛居征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Publication of WO2024239569A1 publication Critical patent/WO2024239569A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of database technology, and in particular to a cluster service processing method, server and system.
  • a high-availability cluster is usually used to build a database. When a high-availability cluster detects that one or more nodes in the cluster have failed, it will switch the business from the failed node to the normal working node, thereby avoiding business interruption.
  • node switching in a high-availability cluster usually relies on heartbeat network detection.
  • Heartbeat network detection mainly detects whether a node fails by monitoring the heartbeat signals of the nodes in the cluster. When the heartbeat signal of a node in the cluster is not detected within a specified time, the node is determined to be failed, and the business running on the node is switched to a normal node. However, if the node is not failed, if the independent redundant disk array (RAID) card used to control data storage in the node fails, the IO streams of multiple disks managed by the RAID card cannot be read and written normally, thereby affecting the database business. At this time, node switching is also required. Since the node has not failed, its heartbeat signal can still be monitored, so the cluster business system will not switch the business from the node with a failed RAID card to a normal node.
  • RAID independent redundant disk array
  • the embodiments of the present application provide a cluster service processing method, server and system for solving the problem in the prior art that the heartbeat network detection cannot detect the RAID card failure in the node, resulting in the cluster service system not switching the service from the node with the RAID card failure to the normal node for operation.
  • an embodiment of the present application provides a cluster service processing method, which is applied to a cluster service system, and the cluster service system includes: a first node and a second node; the method includes: when the heartbeat network of the first node and the heartbeat network of the second node are normal, obtaining the detection result of the local input and output IO flow detection; when the detection result of the local IO flow detection is abnormal, obtaining the detection result of the peer IO flow detection; based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, determining the cause of the fault, and processing the cluster service according to the cause of the fault; wherein the local input and output IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card and the first disk in the first node by the first node, and a first local IO flow detection of the IO flow between the second RAID card and the second disk in the second node by the second node Second local IO flow detection
  • each node detects the local IO flow respectively, and in the case of abnormal detection results, each node detects the IO flow of the opposite end. Since the local IO flow detection service is abnormal, it may be caused by the local IO flow detection service abnormality or a soft failure of the local RAID card. Therefore, in the case of abnormal local IO flow detection, by detecting the IO flow of the opposite end, it can be further determined whether the local RAID card has a soft failure, thereby solving the problem of system pseudo-death without switching the master and standby nodes of the cluster service system when the disk cannot be read and written normally, thereby improving the reliability of the cluster service system.
  • obtaining a detection result of a local IO flow detection includes: the first node and the second node respectively initiate a first data read instruction to a disk of the local node; wherein the first data read instruction is used to read first data in the disk of the local node; obtaining first data returned by the disk of the local node based on the first data read instruction; and the detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained.
  • the first node and the second node respectively send a data read instruction to their respective local disks, and return data based on the data read instruction. If the data in the local disk can be obtained, the local IO flow detection result is normal. If the data in the local disk is not obtained, the local IO flow detection result is abnormal.
  • the above method is used to realize the detection of the local IO flow by each node.
  • obtaining the first data returned by the disk of the local node based on the first read data instruction includes: obtaining the first data returned by the disk of the local node based on the first read data instruction at a preset time interval within a preset first time period; if the detection result of the local IO flow detection is abnormal, it also includes: if within the preset first time period, at the preset time interval, the total number of times the first data returned by the disk of the local node is obtained is less than a first number threshold, then determining that the detection result of the local IO flow detection is abnormal.
  • each node reads the data in the disk of the local node at a certain time interval within a preset first time period. If the total number of times the first data returned by the disk of the local node is obtained within the preset first time period is greater than or equal to the first number threshold, it can be determined that the detection result of the local IO flow detection is normal, otherwise it is abnormal. By judging the number of times the first data is successfully obtained within a preset time period, the judgment of the detection result of the IO flow detection is more accurate and reliable.
  • obtaining first data returned by a disk of a local node based on a first read data instruction includes: obtaining the first data returned by the disk of the local node based on the first read data instruction within a preset second time period; if a detection result of a local IO flow detection is abnormal, it also includes: if the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than a first time threshold, determining that the detection result of the local IO flow detection is abnormal; wherein the preset second time is the maximum time for normally obtaining the first data returned by the disk of the local node.
  • the local IO flow detection result is determined to be abnormal, otherwise it is normal.
  • the detection result of the local IO flow detection can be made more reliable.
  • obtaining the disk of the local node based on the first read data instruction returned by the first A data including: within a preset first time period, obtaining first data returned by a disk of a local node based on a first data read instruction; the detection result of the local IO flow detection is abnormal, and also including: within the preset first time period, the number of times the difference between the acquisition time of the first data returned by the disk of the local node and a preset second time is greater than a first time threshold is greater than a second number threshold; the preset second time is the maximum time for the first data returned by the disk of the local node to be normally obtained.
  • the detection result of the local IO flow is judged by judging the number of timeouts for each node to obtain the first data within the first time period, thereby further improving the reliability and accuracy of the detection.
  • the obtaining of the detection result of the peer IO flow detection includes: the first node or the second node initiating a second read data instruction to the disk of the peer node; wherein the second read data instruction is used to read second data in the disk of the peer node; obtaining second data returned by the disk of the peer node based on the second read data instruction; if the second data returned by the disk of the peer node is obtained, determining that the detection result of the peer IO flow detection is normal; if the second data returned by the disk of the peer node is not obtained, determining that the detection result of the peer IO flow detection is abnormal.
  • each node sends a read data instruction to the disk of the peer node. If the second data returned by the peer node can be obtained, the peer IO flow detection result is determined to be normal, otherwise, it is abnormal. By adding the detection of the peer IO flow, the detection result of the local IO flow can be further verified, thereby enhancing the accuracy and reliability of fault judgment.
  • the determining the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection includes: when the detection result of the first local IO flow detection is abnormal and the detection result of the first peer IO flow detection is normal, determining that the RAID state of the first node is abnormal; when the detection result of the second local IO flow detection is abnormal and the detection result of the second peer IO flow detection is normal, determining that the RAID state of the second node is abnormal; when the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is abnormal, and the detection result of the second local IO flow detection is normal, determining that the IO flow detection service of the first node is abnormal; when the detection result of the second local IO flow detection is abnormal, the detection result of the second peer IO flow detection is abnormal, and the detection result of the first local IO flow detection is normal, determining that the IO flow detection service of the second node is abnormal.
  • the cluster service system fault can be accurately determined by obtaining the detection result of the peer IO flow and combining it with the local IO flow detection result to determine the cause of the fault.
  • the method also includes: when the heartbeat detection result of the first node is abnormal and the heartbeat detection result of the second node is normal, determining that the heartbeat network of the first node is abnormal; when the heartbeat detection result of the first node is normal and the heartbeat detection result of the second node is abnormal, determining that the heartbeat network of the second node is abnormal; when the heartbeat detection results of the first node and the second node are both abnormal, determining that the heartbeat networks of the first node and the second node are both abnormal.
  • the processing of the cluster service according to the fault cause includes: the cluster service system is in a hot standby scenario, the first node is a master node, and the second node is a standby node;
  • the cluster service is switched from the first node to the second node for operation, and an alarm processing of a soft fault of the RAID card of the first node is performed;
  • an alarm processing of a soft fault of the RAID card of the second node is performed;
  • an alarm processing of a soft fault of the RAID card of the second node is performed; when it is determined that the RAID card status of the first node is abnormal and the RAID card status of the second node is abnormal, an alarm processing of a soft fault of the RAID cards of the first node and the second node is performed; when it is determined that the IO flow detection service of the first node and/or
  • the cluster service system is processed accordingly based on each failure cause, thereby improving the reliability of system operation.
  • the processing of the cluster service according to the fault cause includes: the cluster service system is in an active-active scenario, the first node and the second node are standby nodes for each other, the first node runs the first cluster service, and the second node runs the second cluster service; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, the first cluster service is switched from the first node to the second node for operation, and an alarm processing of a soft fault of the RAID card of the first node is performed; when it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, the second cluster service is switched from the second node to the first node for operation, and an alarm processing of a soft fault of the RAID card of the second node is performed; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, an alarm processing of a soft fault of the RAID card of the
  • the cluster service system is processed accordingly based on each fault cause, thereby improving the reliability of system operation.
  • an embodiment of the present application provides a server, comprising: a processor, a memory, and a communication interface; the memory is used to store executable instructions of the processor; wherein the processor is configured to execute the cluster business processing method described in the first aspect by executing the executable instructions.
  • an embodiment of the present application provides a cluster service system, comprising: at least one first node and at least one second node, the first node is a master node, and the second node is a backup node; wherein the first node executes the cluster service processing method described in the first aspect.
  • the embodiment of the present application provides a cluster service processing method, a server and a system, the method is applied to a cluster service system, the cluster service system includes a first node and a second node; the method includes: when the heartbeat network of the first node and the heartbeat network of the second node are normal, obtaining the detection result of the local input and output IO flow detection; when the detection result of the local IO flow detection is abnormal, obtaining the detection result of the opposite end IO flow detection; based on the detection result of the opposite end IO flow detection and the detection result of the local IO flow detection, determining the cause of the fault, and processing the cluster service according to the cause of the fault; wherein the local input and output IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card in the first node and the first disk, and a second local IO flow detection of the IO flow between the second RAID card in the second node and the second disk; the opposite end IO flow detection is a first
  • the present application determines the cause of the failure of the business cluster according to the local IO flow detection and the opposite IO flow detection of the first node and the second node, combined with the heartbeat detection results of the first node and the second node, and processes the cluster business according to the cause of the failure, so that the business cluster of the present application can switch the business from the faulty node to the normal node in time when the business needs to be switched due to a soft failure of the RAID card, thereby solving the problem in the prior art that the heartbeat network detection cannot detect the soft failure of the RAID card in the node, resulting in the cluster business system not switching the business from the node with the RAID card failure to the normal node for operation.
  • FIG1 is a schematic diagram of the structure of a cluster service system
  • FIG2 is a flow chart of a cluster service processing method according to an embodiment of the present application.
  • FIG3 is a schematic diagram of a cluster service system performing IO flow detection
  • FIG4 is a flow chart of a second embodiment of a cluster service processing method provided in an embodiment of the present application.
  • FIG5 is a flow chart of a third embodiment of a cluster service processing method provided in an embodiment of the present application.
  • FIG6 is a flow chart of a fourth embodiment of a cluster service processing method provided in an embodiment of the present application.
  • FIG7 is a flow chart of a fifth embodiment of a cluster service processing method provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a server embodiment provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the structure of another server embodiment provided in an embodiment of the present application.
  • RAID card Redundant Array of Independent Disks (RAID) combines multiple independent disks into a large-capacity disk group.
  • the RAID card manages the multiple disks that make up the disk array.
  • the operating system needs to read and write data on the disks it manages through the RAID card.
  • Fig. 1 is a schematic diagram of the structure of a cluster service system.
  • the cluster service system includes a first node 11 and a second node 12.
  • the first node 11 is a master node
  • the second node 12 is a standby node.
  • a heartbeat network detection is performed between the first node 11 and the second node 12 via a heartbeat network link 13.
  • the heartbeat network detection mainly detects whether a failure occurs to the node by monitoring the heartbeat signal of the node in the cluster.
  • the first node 11 and the second node 12 send heartbeat messages to each other at a fixed frequency via the heartbeat network link 13, and receive the heartbeat message of the opposite node. If the second node 12 does not receive the heartbeat message of the first node 11 within a specified time, it is determined that the first node 11 fails, and the service running on the first node 11 is switched to the second node 12.
  • the first node 11 and the second node 12 both include a processor running an operating system, a RAID card, and multiple disks managed by the RAID card.
  • the operating system needs to read and write data on the disks it manages through the RAID card. If a RAID card used to control data reading and writing in a node fails, such as an error in the RAID card program, the IO streams of the multiple disks managed by the RAID card cannot be read and written normally, thereby affecting the database service, and node switching is also required at this time. Since the node has not failed, its heartbeat signal can still be monitored, so the cluster service system will not switch the service from the node where the RAID card fails to a normal node.
  • the technical conception process of the present application is as follows: how to detect the RAID card failure of a node to switch the cluster service from the node with the RAID card failure to a normal node.
  • FIG2 is a flow chart of a cluster service processing method according to an embodiment of the present application. The method is applied to a cluster service system, which includes a first node and a second node. Referring to FIG2 , the cluster service processing method specifically includes the following steps:
  • Step S201 When the heartbeat network of the first node and the heartbeat network of the second node are normal, the detection result of the local input and output IO flow detection is obtained.
  • a cluster service system is in a hot standby scenario, including a first node and a second node, wherein the first node is a primary node and the second node is a standby node.
  • a database service runs on the first node, and when a failure occurs in the first node, the database service switches from the first node to the second node.
  • the cluster service system may also include a management node.
  • the first node and the second node are external storage devices.
  • FIG3 is a schematic diagram of IO flow detection performed by the cluster service system.
  • the local IO flow detection is the first local IO flow detection of the IO flow between the first independent redundant disk array RAID card and the first disk in the first node, and/or the second local IO flow detection of the IO flow between the second RAID card and the second disk in the second node by the second node.
  • the first local IO flow detection is the detection of the IO flow between the first RAID card and the first disk in the first node by the first node
  • the second local IO flow detection is the detection of the IO flow between the second RAID card and the second disk in the second node by the second node.
  • the local IO flow detection is mainly used to detect whether the local system disk has problems with normal reading and writing.
  • the first node is the main node and the second node is the standby node. Both the first node and the second node include a RAID card and multiple disks managed by it. Among them, the disk can be a system disk.
  • the operating system needs to read and write data to the disk it manages through the RAID card. In the process of reading and writing data, an IO flow is formed between the RAID card and the disk. Therefore, by detecting the IO flow between the RAID card and the disk, it can be determined whether the RAID card has a soft fault such as a program running error. When the detection result of the IO flow between the RAID card and the disk is abnormal, it can be determined that the RAID card has a soft fault such as a program running error.
  • the detection result of the local input and output IO flow detection is obtained.
  • the first node detects the IO flow between the first RAID card and the first disk in the first node to obtain the detection result of the first local IO flow detection
  • the second node detects the IO flow between the second RAID card and the second disk in the second node to obtain the detection result of the second local IO flow detection.
  • the first node as the master node, obtains the detection results of the first local IO flow detection and the second local IO flow detection
  • the management node can also obtain the detection results of the first local IO flow detection and the second local IO flow detection.
  • Step S202 when the detection result of the local IO flow detection is abnormal, the detection result of the peer IO flow detection is obtained.
  • the peer IO flow detection is the first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk, and/or the second peer IO flow detection of the IO flow between the first RAID card in the first node and the first disk.
  • the first peer IO flow detection is the detection of the IO flow between the second RAID card in the second node and the second disk by the first node;
  • the second peer IO flow detection is the detection of the IO flow between the first RAID card in the first node and the first disk by the second node.
  • the peer IO flow detection is used to verify the results, ensure the consistency and reliability of the results, and to verify whether there is an abnormality in the IO flow detection service itself.
  • the detection result of the peer IO flow detection is obtained.
  • the first node detects the IO flow between the second RAID card and the second disk in the second node to obtain the detection result of the first peer IO flow detection;
  • the second node also detects the IO flow between the first RAID card and the first disk in the first node to obtain the detection result of the second peer IO flow detection.
  • the first node as the master node, obtains the detection results of the first peer IO flow detection and the second peer IO flow detection.
  • the management node can also obtain the detection results of the first peer IO flow detection and the second peer IO flow detection.
  • the first node and the second node also perform heartbeat detection, send heartbeat messages to each other at a fixed frequency, and receive heartbeat messages from the opposite node to obtain heartbeat detection results.
  • the first node as the master node, obtains the heartbeat detection results of the first node and the heartbeat detection results of the second node.
  • the management node can also obtain the heartbeat detection results of the first node and the heartbeat detection results of the second node.
  • Step S203 based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, determine the cause of the fault, and process the cluster service according to the cause of the fault.
  • the first node or the management node determines the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection.
  • the detection result of the first local IO flow detection is abnormal, and the detection result of the first peer IO flow detection is normal, it can be preliminarily determined that the RAID card status of the first node is abnormal, and the detection result of the second local IO flow detection is normal, which means that the IO detection service of the second node is normal.
  • the detection result of the second peer IO flow detection is abnormal, it can be further determined that the RAID status of the first node is abnormal, and the RAID status of the second node is determined to be normal.
  • the first node is the master node
  • the second node is the standby node
  • the RAID status of the first node is abnormal and the RAID status of the second node is normal
  • the cluster service needs to be switched, and the cluster service running on the first node is switched to run on the second node.
  • the cluster service system is in an active-active scenario, including a first node and a second node, the first node and the second node are standby nodes for each other, database service A runs on the first node, and database service B runs on the second node.
  • database service A runs on the first node
  • database service B runs on the second node.
  • the cluster service system may also include a management node.
  • the first node, the second node or the management node can obtain the detection result of the local input and output IO flow detection.
  • the detection result of the local IO flow detection is abnormal, the detection result of the peer IO flow detection is obtained.
  • the first node, the second node or the management node determines the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, and processes the cluster service according to the cause of the fault.
  • the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node are obtained.
  • the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node is determined respectively to determine whether to switch the cluster service.
  • the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node can be re-obtained at preset intervals, and the RAID card status of the first node and the second node can be determined based on the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, respectively, to determine whether to switch the cluster service.
  • the detection results of the first local IO flow detection and the first opposite end IO flow detection, the detection results of the second local IO flow detection and the second opposite end IO flow detection, and the heartbeat detection results of the first node and the second node can be obtained multiple times within a preset time.
  • the number of abnormal RAID card statuses of the first node and the second node is counted, and whether to switch the cluster service is determined according to the number of abnormal RAID card statuses of the first node and the second node.
  • the first node is the master node
  • the second node is the standby node
  • the number of abnormal RAID card statuses of the first node exceeds the number threshold
  • the number of abnormal RAID card statuses of the second node is 0, that is, when the multiple detection results of the RAID card of the second node are all normal, the cluster service is switched.
  • the cluster service system includes a first node and a second node; when the heartbeat network of the first node and the heartbeat network of the second node are normal, a detection result of a local input/output IO flow detection is obtained; when the detection result of the local IO flow detection is abnormal, a detection result of a peer IO flow detection is obtained; based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, a fault cause is determined, and the cluster service is processed according to the fault cause; wherein the local IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card in the first node and the first disk, and/or a second local IO flow detection of the IO flow between the second RAID card in the second node and the second disk; the peer IO flow detection is a first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk, and/or a second peer IO
  • the present application uses local IO flow detection of the first node and the second node and peer IO flow detection, combined with the first node
  • the heartbeat detection result of the node and the second node is used to determine the cause of the failure of the business cluster, and the cluster business is processed according to the cause of the failure, so that the business cluster of the present application can switch the business from the faulty node to the normal node in time when the business needs to be switched due to a soft failure of the RAID card, which solves the problem in the prior art that the heartbeat network detection cannot detect the soft failure of the RAID card in the node, resulting in the cluster business system not switching the business from the node with the RAID card failure to the normal node for operation.
  • Figure 4 is a flow chart of a second embodiment of a cluster business processing method provided by an embodiment of the present application.
  • the above step S201 may include: the first node and the second node respectively initiate a first read data instruction to the disk of the local node; wherein the first read data instruction is used to read the first data in the disk in the local node; obtain the first data returned by the disk of the local node based on the first read data instruction; the detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained.
  • step S201 may include the following steps:
  • Step S401 a first node initiates a first data read instruction to a first disk in the first node; the first data read instruction is used to read first data in the first disk.
  • Step S402 the first node obtains the first data returned by the first disk based on the first data read instruction; if the first data returned by the first disk is obtained, it is determined that the detection result of the first local IO stream detection is normal; if the first data returned by the first disk is not obtained, it is determined that the detection result of the first local IO stream detection is abnormal.
  • the RAID card manages multiple disks.
  • the operating system needs to read and write data on the disks it manages through the RAID card.
  • an IO flow is formed between the RAID card and the disk. Therefore, it is possible to determine whether a soft fault has occurred in the RAID card by detecting the IO flow between the RAID card and the disk.
  • a soft fault such as a program running error occurs in the RAID card
  • the disk will not respond to the read instruction to return data. Therefore, the IO flow between the RAID card and the system disk can be detected by sending a read data instruction to the disk to determine whether a soft fault has occurred in the RAID card.
  • the underlying IO test tools such as the disk stress test (Flexible Input Output tester, referred to as FIO), IO test software (Input Output meter, referred to as IOmeter), etc., can be called to perform a read operation on the disk.
  • FIO Fexible Input Output tester
  • IOmeter Input Output meter
  • the first node when it performs a first local IO flow detection, it initiates a first read data instruction to the first disk in the first node.
  • the first read data instruction is used to read the first data in the first disk managed by the first RAID card. If the RAID card in the first node is in normal status, the first disk will return the first data to the operating system of the first node based on the first read data instruction; if a soft failure occurs in the RAID card in the first node, the first disk will not return data.
  • the first node obtains the first data returned by the first disk based on the first data read instruction. If the first data returned by the first disk is successfully obtained, it is determined that the detection result of the first local IO stream detection is normal; if the first data returned by the first disk is not obtained, it is determined that the detection result of the first local IO stream detection is abnormal.
  • step S202 may include: the first node or the second node initiates a second read data instruction to the disk in the opposite node; wherein the second read data instruction is used to read the second data in the disk in the opposite node; obtaining the second data returned by the disk in the opposite node based on the second read data instruction; if the second data returned by the disk in the opposite node is obtained, determining that the detection result of the opposite IO flow detection is normal; if the second data returned by the disk in the opposite node is not obtained, determining that the detection result of the opposite IO flow detection is abnormal.
  • step S202 may include the following steps:
  • Step S403 The first node initiates a second data read instruction to the second disk in the second node; the second data read instruction The instruction is used to read the second data in the second disk.
  • Step S404 the first node obtains the second data returned by the second disk based on the second data read instruction; if the second data returned by the second disk is obtained, it is determined that the detection result of the first peer IO stream detection is normal; if the second data returned by the second disk is not obtained, it is determined that the detection result of the first peer IO stream detection is abnormal.
  • the first node also performs IO flow detection on the second node at the opposite end, that is, first opposite end IO flow detection. Specifically, a second read data instruction is initiated to the second disk in the second node, and the second read data instruction is used to read the second data in the second disk. If the RAID card in the second node is in normal state, the second disk will return the second data to the operating system of the first node based on the second read data instruction; if a soft failure occurs in the RAID card in the second node, the second disk will not return data.
  • the first node obtains the second data returned by the second disk based on the second data read instruction; if the second data returned by the second disk is successfully obtained, it is determined that the detection result of the first peer IO flow detection is normal; if the second data returned by the second disk is not obtained, it is determined that the detection result of the first peer IO flow detection is abnormal.
  • the attribute parameters of the IO flow detection include: interval time, timeout time, timeout times, etc.
  • the first node may perform local IO flow detection and peer IO flow detection at preset intervals. For example, the first node obtains the first data returned by the first disk according to the preset time interval nodes within the preset first time period.
  • the detection result of the first local IO flow detection is determined to be normal; or, the number of times the first node obtains the first data returned by the first disk at all time interval points within the preset first time period is greater than or equal to the first number threshold, the detection result of the first local IO flow detection is determined to be normal; if the first node cannot obtain the first data returned by the first disk at all time interval points within the preset first time period, the detection result of the first local IO flow is determined to be abnormal; or the number of times the first node obtains the first data returned by the first disk at all time interval points within the preset first time period is less than the first number threshold, the detection result of the first local IO flow detection is determined to be abnormal; wherein the first number threshold is the minimum number of times the first node can obtain the first data returned by the first disk within the preset time period; the interval time can be 1s, 5s, 1min, 1h, etc.
  • the difference between the time when the first node obtains the first data returned by the first disk and the preset first time is less than the first time threshold, it is determined that the detection result of the first local IO stream detection is normal; if the difference between the time when the first node obtains the first data returned by the first disk and the preset second time is greater than the first time threshold, it is determined that the detection result of the first local IO stream detection is abnormal, wherein the second time is the preset maximum time that the first node can normally obtain the first data returned by the first disk;
  • the detection result of the first local IO stream detection is determined to be abnormal; when the number of timeouts for the first node to obtain the first data returned by the first disk is less than or equal to the second number threshold, the detection result of the first local IO stream detection is determined to be normal.
  • the number of timeouts is the maximum number of times within the time period that the difference between the acquisition time of the first data returned by the first disk obtained by the first node and the preset second time is greater than the first time threshold.
  • the second node performs the second local IO flow detection and the second peer IO flow detection, which can be performed with reference to the above steps S401 to S404.
  • the first node performs the first local IO flow detection and the first peer IO flow detection by respectively initiating a read data instruction to the first disk and the second disk in the first node and reading data from the first disk and the second disk.
  • the operation status of the RAID card of the local node and the RAID card of the opposite node are detected respectively.
  • the IO flow of the local node and the opposite node can be detected without affecting the database business to determine whether a soft failure has occurred in the RAID, which provides the prerequisite for the business cluster to switch the business from the faulty node to the normal node in time when the business needs to be switched due to a soft failure of the RAID card.
  • FIG5 is a flow chart of a third embodiment of a cluster service processing method provided by an embodiment of the present application. Based on the embodiments shown in FIG2 to FIG4 above, the above step S203 specifically includes the following steps:
  • Step S501 Determine the RAID card status of the first node and the second node respectively according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection and the heartbeat detection results of the first node and the second node.
  • one of the first node and the second node is a main node, and the other node is a standby node.
  • the RAID card status of the first node and the second node is determined according to the detection results of the first local IO flow detection, the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node.
  • Local IO flow detection is mainly used to detect whether the local system disk has problems with normal reading and writing; peer IO flow detection is used to verify the local IO flow detection results of the peer node to ensure the consistency and reliability of the results, and to verify whether there are any abnormalities in the IO flow detection service itself.
  • the detection result of the first local IO flow detection is abnormal
  • the detection result of the first peer IO flow detection is normal
  • the first peer IO flow detection result is normal, the situation that the IO flow detection service of the first node is abnormal can be ruled out. Therefore, it can be determined that the RAID state of the first node is abnormal.
  • the heartbeat detection results of the first node and the second node are both normal and the detection result of the first local IO flow detection is normal, it can be preliminarily determined that the RAID card of the first node and the IO detection service of the first node are normal. In order to further verify the above conclusion, it can be judged by the result of the second peer IO flow detection. If the detection result of the second peer IO flow detection is normal, it is determined that the RAID status of the first node is normal.
  • the heartbeat detection results of the first node and the second node are both normal, and the first local IO flow detection is normal, and the second peer IO flow detection is normal. If the detection result of the first peer IO flow detection is abnormal, it can be preliminarily determined that the RAID card of the second node is abnormal. At this time, if the detection result of the second local IO flow detection is abnormal, it is further determined that the RAID state of the second node is abnormal.
  • the heartbeat detection results of the first node and the second node are normal, which means that the network of the first node and the second node is normal; the first local IO flow detection is normal, and the second peer IO flow detection is normal, which can respectively indicate that the IO flow detection services of the first node and the second node are normal.
  • the detection result of the second local IO flow detection is abnormal, it can be preliminarily determined that the local disk of the second node has a problem of not being able to read and write normally. In order to further verify this conclusion, the detection result of the first peer IO flow detection is used for judgment.
  • the RAID card state of the second node is abnormal, which further verifies that the local disk of the second node has a problem of not being able to read and write normally. Therefore, it can be further determined that the RAID state of the second node is abnormal.
  • the heartbeat detection results of the first node and the second node are both normal. If the detection result of the second local IO flow detection is When it is normal, it can be preliminarily determined that the local disk of the second node is reading and writing normally. In order to further verify the above conclusion, verification is performed through the detection result of the first peer IO flow detection. If the above result is normal, it is determined that the RAID status of the second node is normal.
  • the detection result of the first local IO flow detection is abnormal
  • the detection result of the first peer IO flow detection is abnormal
  • the detection result of the second local IO flow detection is normal
  • the detection result of the second peer IO flow detection is normal
  • the heartbeat detection results of the first node and the second node are normal, it means that the network of the first node and the second node is normal; if the detection result of the first local IO flow detection is abnormal, it means that the local disk of the first node may not be able to read and write normally (it may also be that the IO flow detection service of the first node is abnormal), but the detection result of the second peer IO flow detection is normal, which means that the local system disk of the first node can be read and written normally.
  • the abnormal state of the RAID card of the first node can be ruled out, and it can be determined that the abnormal detection result of the first local IO flow detection is due to the abnormality of the IO flow detection service of the first node; if the detection result of the second local IO flow detection is normal, it means that the local disk of the second node can be read and written normally, but the detection result of the first peer IO flow detection is abnormal, which further verifies that the IO flow detection of the first node is abnormal.
  • the detection result of the first local IO flow detection is normal
  • the detection result of the first peer IO flow detection is normal
  • the detection result of the second local IO flow detection is abnormal
  • the detection result of the second peer IO flow detection is abnormal
  • the network conditions of the first node and the second node may also affect the detection result of the peer IO flow detection.
  • the detection result of the peer IO flow detection is also abnormal.
  • Step S502 when it is determined that the RAID card status of the master node is abnormal and the RAID card status of the standby node is normal, the cluster service is switched from the master node to the standby node for operation, and an alarm process for a soft failure of the master node RAID card is performed.
  • Step S503 When the cluster service system is in a hot standby scenario and it is determined that the RAID card status of the standby node is abnormal and the RAID card status of the master node is normal, an alarm processing of a soft fault of the RAID card of the standby node is performed;
  • the cluster service in the active-active scenario, when it is determined that the RAID card status of the standby node is abnormal and the RAID card status of the active node is normal, the cluster service is switched from the standby node to the active node for operation, and the alarm of the soft fault of the RAID card of the standby node is processed.
  • Step S504 when it is determined that the RAID card status of the master node is abnormal and the RAID card status of the standby node is abnormal, an alarm process of a soft failure of the RAID cards of the master node and the standby node is performed.
  • Step S505 when it is determined that the IO flow detection of the primary node and/or the backup node is abnormal, an alarm process of the IO flow detection failure of the primary node and/or the backup node is performed.
  • an IO flow detection anomaly indicates that an IO test tool, such as FIO, IOmeter, etc., has a fault, and an alarm needs to be issued for the IO flow detection fault so that the management personnel can handle the IO flow detection fault.
  • an IO test tool such as FIO, IOmeter, etc.
  • a cluster service system is in a hot standby scenario, including a first node and a second node, the first node is a primary node, the second node is a backup node, and a database service runs on the first node. When a failure occurs in the first node, the database service switches from the first node to the second node.
  • the cluster service system may also include a management node.
  • the first node or the management node determines the RAID card status of the first node and the second node respectively based on the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection, and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, and performs a heartbeat detection on the first node and the second node according to the RAID card status of the first node and the second node.
  • the cluster service is processed.
  • the first node is the main node and the second node is the backup node. The specific method is shown in Table 1 below:
  • Table 1 IO flow detection results and processing solutions of the first node and the second node in the hot standby scenario
  • a cluster service system is in a dual-active scenario, including a first node and a second node, and the first node and the second node are standby nodes for each other.
  • the first node runs a first cluster service
  • the second node runs a second cluster service.
  • the first cluster service switches from the first node to the second node; when the second node fails, the second cluster service switches from the second node to the first node.
  • the cluster service system may also include a management node.
  • the first node, the second node or the management node obtains the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node.
  • the first node, the second node or the management node determines the RAID card status of the first node and the second node respectively according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, and processes the cluster service according to the RAID card status of the first node and the second node.
  • Table 2 The specific method is shown in Table 2 below:
  • FIG6 is a flow chart of a cluster service processing method embodiment 4 provided in the present application.
  • the first node and the second node are standby nodes for each other.
  • Database service A is running on the first node
  • database service A is running on the second node.
  • the heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is normal, the detection result of the second local IO flow detection is normal, and the detection result of the second peer IO flow detection is abnormal. Therefore, it is determined that the system disk read of the first node is abnormal, there is a system pseudo-death situation, and database business A needs to be switched from the first node to the second node.
  • FIG. 7 is a flow chart of a fifth embodiment of a cluster service processing method provided in an embodiment of the present application.
  • the embodiment of the present application performs IO flow detection by calling the underlying IO test tools, such as FIO, IOmeter, etc., to periodically read the system disk. For example, if the result is returned, it is normal; if there is no response after the timeout, it is determined that the system disk read is abnormal, and there is a system pseudo-death situation.
  • the result needs to be synchronized to the node where the cluster service is located to switch the cluster service to the normal node to ensure business continuity.
  • a network channel such as an IPMI channel, can be established through out-of-band management of normal nodes and faulty nodes to restart the faulty node and thus switch the cluster service.
  • the first node determines the RAID card status of the first node and the second node according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection and the heartbeat detection results of the first node and the second node, and processes the cluster service according to the RAID card status of the first node and the second node, and can timely switch the service from the faulty node to the normal node when the service needs to be switched due to the soft failure of the RAID card, and realize the alarm of abnormal situation.
  • This further solves the problem in the prior art that the service cannot be switched from the node with the faulty RAID card to the normal node because the heartbeat network detection cannot detect the soft failure of the RAID card in the node.
  • FIG8 is a schematic diagram of the structure of a server embodiment provided in an embodiment of the present application; as shown in FIG8 , the server 60 includes: an acquisition module 61 and a processing module 62 .
  • the acquisition module 61 is used to acquire the detection result of the local input and output IO flow detection when the heartbeat network of the first node and the heartbeat network of the second node are normal; the acquisition module 61 is also used to acquire the detection result of the peer IO flow detection when the detection result of the local IO flow detection is abnormal; the processing module 62 is used to determine the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, and process the cluster service according to the cause of the fault; wherein the local IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card in the first node and the first disk, and/or a second local IO flow detection of the IO flow between the second RAID card in the second node and the
  • the server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.
  • the acquisition module 61 is specifically configured to: the first node and the second node to send the local node
  • the first read data instruction is initiated by the disk of the local node; wherein the first read data instruction is used to read the first data in the disk in the local node; the first data returned by the disk of the local node based on the first read data instruction is obtained; the detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained.
  • the acquisition module 61 is specifically used to obtain the first data returned by the disk of the local node based on the first read data instruction at a preset time interval within a preset first time period; the detection result of the local IO flow detection is abnormal, and also includes: if within the preset first time period, at the preset time interval, the total number of times the first data returned by the disk of the local node is obtained is less than the first number threshold, then the detection result of the local IO flow detection is determined to be abnormal.
  • the acquisition module 61 is specifically used to acquire, within a preset second time period, the first data returned by the disk of the local node based on the first read data instruction; the detection result of the local IO flow detection is abnormal, and further includes: if the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than the first time threshold, then determining that the detection result of the local IO flow detection is abnormal; wherein the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
  • the acquisition module 61 is specifically used to acquire, within a preset first time period, the first data returned by the disk of the local node based on the first read data instruction; the detection result of the local IO flow detection is abnormal, and also includes: within the preset first time period, the number of times the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than the first time threshold is greater than the second number threshold; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.
  • the server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.
  • the acquisition module 61 is specifically used for the first node or the second node to initiate a second read data instruction to the disk in the opposite node; wherein the second read data instruction is used to read the second data in the disk in the opposite node; obtain the second data returned by the disk in the opposite node based on the second read data instruction; if the second data returned by the disk in the opposite node is obtained, it is determined that the detection result of the opposite IO flow detection is normal; if the second data returned by the disk in the opposite node is not obtained, it is determined that the detection result of the opposite IO flow detection is abnormal.
  • the server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.
  • the processing module 62 is specifically used to determine that the RAID state of the first node is abnormal when the detection result of the first local IO flow detection is abnormal and the detection result of the first peer IO flow detection is normal; determine that the RAID state of the second node is abnormal when the detection result of the second local IO flow detection is abnormal and the detection result of the second peer IO flow detection is normal; determine that the IO flow detection service of the first node is abnormal when the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is abnormal, and the detection result of the second local IO flow detection is normal; determine that the IO flow detection service of the second node is abnormal when the detection result of the second local IO flow detection is abnormal, the detection result of the second peer IO flow detection is abnormal, and the detection result of the first local IO flow detection is normal.
  • the server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.
  • the cluster service system is in a hot standby scenario
  • the first node is the main node
  • the second The node is a spare node
  • the processing module 62 is specifically used to switch the cluster service from the first node to the second node for operation when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, and perform alarm processing of the soft fault of the RAID card of the first node; when it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, perform alarm processing of the soft fault of the RAID card of the second node; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, perform alarm processing of the soft fault of the RAID cards of the first node and the second node; when it is determined that the IO flow detection service of the first node and/or the second node is abnormal, perform alarm processing of the IO flow detection
  • the cluster service system is in a dual-active scenario, the first node and the second node are backup nodes for each other, the first node runs the first cluster service, and the second node runs the second cluster service; the processing module 62 is specifically used to switch the first cluster service from the first node to the second node for operation when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, and perform alarm processing of the soft fault of the RAID card of the first node; when it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, switch the second cluster service from the second node to the first node for operation, and perform alarm processing of the soft fault of the RAID card of the second node; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, perform alarm processing of the soft fault of the RAID cards of the first node and the second node
  • the server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.
  • FIG9 is a schematic diagram of the structure of a server provided in an embodiment of the present application.
  • the server 70 includes: a processor 71, a memory 72, and a communication interface 73; wherein the memory 72 is used to store executable instructions of the processor 71; the processor 71 is configured to execute the technical solution in any of the aforementioned method embodiments by executing the executable instructions.
  • the memory 72 can be independent or integrated with the processor 71.
  • the server 70 may further include: a bus 74 for connecting the above devices.
  • the server is used to execute the technical solution in any of the aforementioned method embodiments, and its implementation principle and technical effect are similar and will not be repeated here.
  • the embodiment of the present application also provides a cluster service system.
  • the cluster service system includes at least one first node and at least one second node, wherein the first node is a master node and the second node is a backup node; wherein the first node executes the technical solution in any of the above method embodiments.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the steps of the above-mentioned method embodiments are executed; and the aforementioned storage medium includes: ROM, RAM, disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Environmental & Geological Engineering (AREA)
  • Hardware Redundancy (AREA)

Abstract

The present application provides a cluster service processing method, a server, and a system. The method is applied to a cluster service system, and the cluster service system comprises a first node and a second node. The method comprises: when a heartbeat network of a first node and a heartbeat network of a second node are normal, acquiring a detection result of local input/output (IO) stream detection; when the detection result of the local IO stream detection is anomalous, acquiring a detection result of opposite-end IO stream detection; and determining a fault reason on the basis of the detection result of the opposite-end IO stream detection and the detection result of the local IO stream detection, and processing a cluster service according to the fault reason. The present application solves the problem in the prior art of cluster service systems being unable to switch a service from a node having an RAID card fault to a normal node for operation as a result of heartbeat networks being unable to detect the RAID card fault in the node.

Description

一种集群业务处理方法、服务器及系统A cluster service processing method, server and system

本申请要求于2023年05月19日提交中国专利局、申请号为202310598620.9、申请名称为“一种集群业务处理方法、服务器及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on May 19, 2023, with application number 202310598620.9 and application name “A Cluster Business Processing Method, Server and System”, all contents of which are incorporated by reference in this application.

技术领域Technical Field

本申请涉及数据库技术领域,尤其涉及一种集群业务处理方法、服务器及系统。The present application relates to the field of database technology, and in particular to a cluster service processing method, server and system.

背景技术Background Art

数字化和信息化的飞速发展,给人们的生产生活带来了很多便利的同时,也伴随着大量数据的产生,现有技术通常会采用数据库的方式对这些数据进行存储和分析。为了保障业务的连续性,通常采用高可用集群的方式来搭建数据库。高可用集群在检测到集群内部的一个或者多个节点出现故障时,会将业务从故障节点切换到正常工作的节点上运行,由此避免了业务的中断。The rapid development of digitalization and informatization has brought a lot of convenience to people's production and life, but it is also accompanied by the generation of a large amount of data. Existing technologies usually use databases to store and analyze this data. In order to ensure business continuity, a high-availability cluster is usually used to build a database. When a high-availability cluster detects that one or more nodes in the cluster have failed, it will switch the business from the failed node to the normal working node, thereby avoiding business interruption.

现有技术中,高可用集群中的节点切换通常依赖于心跳网络检测。心跳网络检测主要通过监听集群中的节点的心跳信号来检测节点是否发生失效故障,当在指定时间内未监听到集群中的某一节点的心跳信号,则确定该节点失效,将该节点上运行的业务切换到正常节点。然而,在节点未失效的情况下,若节点中用于控制数据存储的独立冗余磁盘阵列(Redundant Array of Independent Disks,简称RAID)卡发生故障,则会导致RAID卡管理的多个磁盘的IO流无法正常读写,进而影响数据库业务,此时也需要进行节点切换。而由于该节点并未失效,其心跳信号仍能被监听到,因此集群业务系统不会将业务从RAID卡发生故障的节点切换到正常节点上运行。In the prior art, node switching in a high-availability cluster usually relies on heartbeat network detection. Heartbeat network detection mainly detects whether a node fails by monitoring the heartbeat signals of the nodes in the cluster. When the heartbeat signal of a node in the cluster is not detected within a specified time, the node is determined to be failed, and the business running on the node is switched to a normal node. However, if the node is not failed, if the independent redundant disk array (RAID) card used to control data storage in the node fails, the IO streams of multiple disks managed by the RAID card cannot be read and written normally, thereby affecting the database business. At this time, node switching is also required. Since the node has not failed, its heartbeat signal can still be monitored, so the cluster business system will not switch the business from the node with a failed RAID card to a normal node.

发明内容Summary of the invention

本申请实施例提供一种集群业务处理方法、服务器及系统,用于解决现有技术中因心跳网络检测无法检测到节点中的RAID卡故障,而导致集群业务系统不会将业务从RAID卡发生故障的节点切换到正常节点上运行的问题。The embodiments of the present application provide a cluster service processing method, server and system for solving the problem in the prior art that the heartbeat network detection cannot detect the RAID card failure in the node, resulting in the cluster service system not switching the service from the node with the RAID card failure to the normal node for operation.

第一方面,本申请实施例提供一种集群业务处理方法,所述方法应用于集群业务系统,所述集群业务系统包括:第一节点和第二节点;所述方法包括:在所述第一节点的心跳网络和所述第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果;在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果;基于所述对端IO流检测的检测结果和所述本地IO流检测的检测结果,确定故障原因,并根据所述故障原因对所述集群业务进行处理;其中,所述本地输入输出IO流检测为所述第一节点对所述第一节点中的第一独立冗余磁盘阵列RAID卡与第一磁盘之间的IO流的第一本地IO流检测,和所述第二节点对所述第二节点中的第二RAID卡与第二磁盘之间的IO流的 第二本地IO流检测;所述对端IO流检测为所述第一节点对所述第二节点中的第二RAID卡与第二磁盘之间的IO流的第一对端IO流检测,和/或所述第二节点对所述第一节点中的第一RAID卡与第一磁盘之间的IO流的第二对端IO流检测。In a first aspect, an embodiment of the present application provides a cluster service processing method, which is applied to a cluster service system, and the cluster service system includes: a first node and a second node; the method includes: when the heartbeat network of the first node and the heartbeat network of the second node are normal, obtaining the detection result of the local input and output IO flow detection; when the detection result of the local IO flow detection is abnormal, obtaining the detection result of the peer IO flow detection; based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, determining the cause of the fault, and processing the cluster service according to the cause of the fault; wherein the local input and output IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card and the first disk in the first node by the first node, and a first local IO flow detection of the IO flow between the second RAID card and the second disk in the second node by the second node Second local IO flow detection; the peer IO flow detection is a first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk by the first node, and/or a second peer IO flow detection of the IO flow between the first RAID card in the first node and the first disk by the second node.

本申请实施例中,通过引入IO流检测服务,各节点分别对本地的IO流进行检测,在检测结果异常的情况下,各节点对对端的IO流进行检测。由于本地IO流检测服务异常,可能是本地IO流检测服务异常或者本地RAID卡发生软故障导致,因此,在本地IO流检测异常的情况下,通过对对端IO流检测,可以进一步判断是否是本地RAID卡发生软故障,从而可以解决磁盘无法正常读写时,导致系统假死而不进行集群业务系统主、备节点切换的问题,提高集群业务系统的可靠性。In the embodiment of the present application, by introducing the IO flow detection service, each node detects the local IO flow respectively, and in the case of abnormal detection results, each node detects the IO flow of the opposite end. Since the local IO flow detection service is abnormal, it may be caused by the local IO flow detection service abnormality or a soft failure of the local RAID card. Therefore, in the case of abnormal local IO flow detection, by detecting the IO flow of the opposite end, it can be further determined whether the local RAID card has a soft failure, thereby solving the problem of system pseudo-death without switching the master and standby nodes of the cluster service system when the disk cannot be read and written normally, thereby improving the reliability of the cluster service system.

在一种具体实施方式中,获取本地IO流检测的检测结果,包括:所述第一节点和所述第二节点分别向本地节点的磁盘发起第一读数据指令;其中,所述第一读数据指令用于读取所述本地节点中的磁盘中的第一数据;获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据;所述本地IO流检测的检测结果为异常,包括:没有获取到所述本地节点的磁盘返回的第一数据。In a specific implementation, obtaining a detection result of a local IO flow detection includes: the first node and the second node respectively initiate a first data read instruction to a disk of the local node; wherein the first data read instruction is used to read first data in the disk of the local node; obtaining first data returned by the disk of the local node based on the first data read instruction; and the detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained.

在上述实施方式中,第一节点和第二节点分别向各自本地的磁盘发送读数据指令,并基于该数据读指令,返回数据,若能获取到本地磁盘中的数据,则本地IO流检测结果为正常,没有获取到本地磁盘中数据,则本地IO流检测结果为异常,通过上述方法,以实现各节点对本地IO流的检测。In the above implementation, the first node and the second node respectively send a data read instruction to their respective local disks, and return data based on the data read instruction. If the data in the local disk can be obtained, the local IO flow detection result is normal. If the data in the local disk is not obtained, the local IO flow detection result is abnormal. The above method is used to realize the detection of the local IO flow by each node.

在一种具体实施方式中,获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据,包括:在预设的第一时间段内,按照预设的时间间隔点获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据;所述本地IO流检测的检测结果为异常,还包括:若在所述预设的第一时间段内,按照所述预设的时间间隔点,获取到所述本地节点的磁盘返回第一数据的次数总和小于第一次数阈值,则确定所述本地IO流检测的检测结果为异常。In a specific implementation, obtaining the first data returned by the disk of the local node based on the first read data instruction includes: obtaining the first data returned by the disk of the local node based on the first read data instruction at a preset time interval within a preset first time period; if the detection result of the local IO flow detection is abnormal, it also includes: if within the preset first time period, at the preset time interval, the total number of times the first data returned by the disk of the local node is obtained is less than a first number threshold, then determining that the detection result of the local IO flow detection is abnormal.

在上述实施方式中,通过在一段预设的第一时间段内,每隔一段时间间隔点,各节点就读取本地节点的磁盘中的数据,若在该预设的第一时间段内,获取到本地节点的磁盘返回的第一数据的总次数大于或者等于第一次数阈值,则可确定本地IO流检测的检测结果为正常,反之则异常,通过在一段预设的时间段内,判断成功获取第一数据的次数,从而使得对IO流检测的检测结果的判断更为准确,可靠。In the above implementation, each node reads the data in the disk of the local node at a certain time interval within a preset first time period. If the total number of times the first data returned by the disk of the local node is obtained within the preset first time period is greater than or equal to the first number threshold, it can be determined that the detection result of the local IO flow detection is normal, otherwise it is abnormal. By judging the number of times the first data is successfully obtained within a preset time period, the judgment of the detection result of the IO flow detection is more accurate and reliable.

在一种具体实施方式中,获取本地节点的磁盘基于第一读数据指令返回的第一数据,包括:在预设的第二时间段内,获取所述本地节点的磁盘基于第一读数据指令返回的第一数据;本地IO流检测的检测结果为异常,还包括:若获取到所述本地节点的磁盘返回的第一数据的获取时间与所述预设的第二时间之差大于第一时间阈值,则确定所述本地IO流检测的检测结果为异常;其中,预设的第二时间为正常获取到本地节点的磁盘返回第一数据的最大时间。In a specific implementation, obtaining first data returned by a disk of a local node based on a first read data instruction includes: obtaining the first data returned by the disk of the local node based on the first read data instruction within a preset second time period; if a detection result of a local IO flow detection is abnormal, it also includes: if the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than a first time threshold, determining that the detection result of the local IO flow detection is abnormal; wherein the preset second time is the maximum time for normally obtaining the first data returned by the disk of the local node.

在上述实施方式中,通过在预设的第二时间段内,判断各节点各自获取第一数据的超时时间,若超时时间大于第一时间阈值,则确定本地IO流检测结果为异常,反之则正常,通过上述实施方式,可以使得对本地IO流检测的检测结果更加可靠。In the above implementation, by judging the timeout time for each node to obtain the first data within the preset second time period, if the timeout time is greater than the first time threshold, the local IO flow detection result is determined to be abnormal, otherwise it is normal. Through the above implementation, the detection result of the local IO flow detection can be made more reliable.

在一种具体实施方式中,获取所述本地节点的磁盘基于所述第一读数据指令返回的第 一数据,包括:在预设的第一时间段内,获取本地节点的磁盘基于第一读数据指令返回的第一数据;所述本地IO流检测的检测结果为异常,还包括:在所述预设的第一时间段内,获取到所述本地节点的磁盘返回的第一数据的获取时间与预设的第二时间之差大于第一时间阈值的次数大于第二次数阈值;所述预设的第二时间为正常获取到所述本地节点的磁盘返回所述第一数据的最大时间。In a specific implementation manner, obtaining the disk of the local node based on the first read data instruction returned by the first A data, including: within a preset first time period, obtaining first data returned by a disk of a local node based on a first data read instruction; the detection result of the local IO flow detection is abnormal, and also including: within the preset first time period, the number of times the difference between the acquisition time of the first data returned by the disk of the local node and a preset second time is greater than a first time threshold is greater than a second number threshold; the preset second time is the maximum time for the first data returned by the disk of the local node to be normally obtained.

在上述实施方式中,通过判断各节点在第一时间段内获取第一数据的超时次数,来对本地IO流的检测结果进行判断,进一步提高了检测的可靠性和准确性。In the above implementation, the detection result of the local IO flow is judged by judging the number of timeouts for each node to obtain the first data within the first time period, thereby further improving the reliability and accuracy of the detection.

在一种具体实施方式中,所述获取对端IO流检测的检测结果,包括:所述第一节点或者所述第二节点向对端节点的磁盘发起第二读数据指令;其中,所述第二读数据指令用于读取所述对端节点的磁盘中的第二数据;获取所述对端节点的磁盘基于所述第二读数据指令返回的第二数据;若获取到所述对端节点的磁盘返回的第二数据,则确定所述对端IO流检测的检测结果为正常;若没有获取到所述对端节点的磁盘返回的第二数据,则确定所述对端IO流检测的检测结果为异常。In a specific implementation, the obtaining of the detection result of the peer IO flow detection includes: the first node or the second node initiating a second read data instruction to the disk of the peer node; wherein the second read data instruction is used to read second data in the disk of the peer node; obtaining second data returned by the disk of the peer node based on the second read data instruction; if the second data returned by the disk of the peer node is obtained, determining that the detection result of the peer IO flow detection is normal; if the second data returned by the disk of the peer node is not obtained, determining that the detection result of the peer IO flow detection is abnormal.

在上述实施方式中,各节点通过向对端节点的磁盘发送读数据指令,若能够获取到对端节点返回的第二数据,则确定对端IO流检测结果为正常,反之,则异常。通过增加对端IO流的检测,可以进一步验证本地IO流的检测结果,增强了对故障判断的准确性和可靠性。In the above implementation, each node sends a read data instruction to the disk of the peer node. If the second data returned by the peer node can be obtained, the peer IO flow detection result is determined to be normal, otherwise, it is abnormal. By adding the detection of the peer IO flow, the detection result of the local IO flow can be further verified, thereby enhancing the accuracy and reliability of fault judgment.

在一种具体实施方式中,所述基于所述对端IO流检测的检测结果和所述本地IO流检测的检测结果,确定故障原因,包括:在所述第一本地IO流检测的检测结果为异常,且所述第一对端IO流检测的检测结果为正常的情况下,则确定所述第一节点的RAID状态异常;在所述第二本地IO流检测的检测结果为异常,且所述第二对端IO流检测的检测结果为正常的情况下,则确定所述第二节点的RAID状态异常;在所述第一本地IO流检测的检测结果为异常,所述第一对端IO流检测的检测结果为异常,且所述第二本地IO流检测的检测结果为正常的情况下,则确定所述第一节点的IO流检测服务异常;在所述第二本地IO流检测的检测结果为异常,所述第二对端IO流检测的检测结果为异常,且所述第一本地IO流检测的检测结果为正常的情况下,则确定所述第二节点的IO流检测服务异常。In a specific implementation, the determining the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection includes: when the detection result of the first local IO flow detection is abnormal and the detection result of the first peer IO flow detection is normal, determining that the RAID state of the first node is abnormal; when the detection result of the second local IO flow detection is abnormal and the detection result of the second peer IO flow detection is normal, determining that the RAID state of the second node is abnormal; when the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is abnormal, and the detection result of the second local IO flow detection is normal, determining that the IO flow detection service of the first node is abnormal; when the detection result of the second local IO flow detection is abnormal, the detection result of the second peer IO flow detection is abnormal, and the detection result of the first local IO flow detection is normal, determining that the IO flow detection service of the second node is abnormal. When the detection result of the second local IO flow detection is abnormal, the detection result of the second peer IO flow detection is abnormal, and the detection result of the first local IO flow detection is normal, determining that the IO flow detection service of the second node is abnormal.

在上述实施方式中,在本地IO流检测结果为异常的情况下,通过获取对端IO流的检测结果,并结合本地IO流检测结果,判断故障原因,可以实现对集群业务系统故障的准确判断。In the above implementation, when the local IO flow detection result is abnormal, the cluster service system fault can be accurately determined by obtaining the detection result of the peer IO flow and combining it with the local IO flow detection result to determine the cause of the fault.

在一种具体实施方式中,所述方法还包括:在所述第一节点的心跳检测结果为异常,且所述第二节点的心跳检测结果为正常时,则确定所述第一节点的心跳网络异常;在所述第一节点的心跳检测结果为正常,且所述第二节点的心跳检测结果为异常时,则确定所述第二节点的心跳网络异常;在所述第一节点和第二节点的心跳检测结果均为异常时,则确定所述第一节点和第二节点的心跳网络均异常。In a specific embodiment, the method also includes: when the heartbeat detection result of the first node is abnormal and the heartbeat detection result of the second node is normal, determining that the heartbeat network of the first node is abnormal; when the heartbeat detection result of the first node is normal and the heartbeat detection result of the second node is abnormal, determining that the heartbeat network of the second node is abnormal; when the heartbeat detection results of the first node and the second node are both abnormal, determining that the heartbeat networks of the first node and the second node are both abnormal.

在上述实施实施方式中,集群业务系统除了各节点发生系统假死情况,还可能出现其他故障,通过对各节点的心跳网络进行检测,并根据检测结果判断各节点的心跳网路是否正常,从而对集群业务系统的故障判断更准确。In the above implementation manner, in addition to the system pseudo-death situation of each node, other faults may also occur in the cluster business system. By detecting the heartbeat network of each node and judging whether the heartbeat network of each node is normal based on the detection results, the fault judgment of the cluster business system is more accurate.

在一种具体实施方式中,所述根据所述故障原因对所述集群业务进行处理,包括:所述集群业务系统处于热备场景下,所述第一节点为主节点,所述第二节点为备用节点;在 确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态正常时,将所述集群业务从所述第一节点切换到所述第二节点上运行,并进行所述第一节点RAID卡软故障的告警处理;在确定所述第二节点的RAID卡状态异常,且所述第一节点的RAID卡状态正常时,进行所述第二节点RAID卡软故障的告警处理;在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态异常时,进行所述第一节点和第二节点RAID卡软故障的告警处理;在确定所述第一节点和/或第二节点的IO流检测服务异常时,进行所述第一节点和/或第二节点IO流检测故障的告警处理。In a specific implementation, the processing of the cluster service according to the fault cause includes: the cluster service system is in a hot standby scenario, the first node is a master node, and the second node is a standby node; When it is determined that the RAID card status of the first node is abnormal and the RAID card status of the second node is normal, the cluster service is switched from the first node to the second node for operation, and an alarm processing of a soft fault of the RAID card of the first node is performed; when it is determined that the RAID card status of the second node is abnormal and the RAID card status of the first node is normal, an alarm processing of a soft fault of the RAID card of the second node is performed; when it is determined that the RAID card status of the first node is abnormal and the RAID card status of the second node is abnormal, an alarm processing of a soft fault of the RAID cards of the first node and the second node is performed; when it is determined that the IO flow detection service of the first node and/or the second node is abnormal, an alarm processing of an IO flow detection failure of the first node and/or the second node is performed.

在上述实施方式中,对于系统处于热备场景下,基于每一种故障原因,对集群业务系统进行了相应的处理,提高了系统运行的可靠性。In the above implementation, when the system is in a hot standby scenario, the cluster service system is processed accordingly based on each failure cause, thereby improving the reliability of system operation.

在一种具体实施方式中,所述根据所述故障原因对所述集群业务进行处理,包括:所述集群业务系统处于双活场景下,所述第一节点和所述第二节点互为备用节点,所述第一节点上运行第一集群业务,所述第二节点上运行第二集群业务;在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态正常时,将所述第一集群业务从所述第一节点切换到所述第二节点上运行,并进行所述第一节点RAID卡软故障的告警处理;在确定所述第二节点的RAID卡状态异常,且所述第一节点的RAID卡状态正常时,将所述第二集群业务从所述第二节点切换到所述第一节点上运行,并进行所述第二节点RAID卡软故障的告警处理;在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态异常时,进行所述第一节点和第二节点RAID卡软故障的告警处理;在确定所述第一节点和/或第二节点的IO流检测服务异常时,进行所述第一节点和/或第二节点IO流检测故障的告警处理。In a specific implementation, the processing of the cluster service according to the fault cause includes: the cluster service system is in an active-active scenario, the first node and the second node are standby nodes for each other, the first node runs the first cluster service, and the second node runs the second cluster service; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, the first cluster service is switched from the first node to the second node for operation, and an alarm processing of a soft fault of the RAID card of the first node is performed; when it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, the second cluster service is switched from the second node to the first node for operation, and an alarm processing of a soft fault of the RAID card of the second node is performed; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, an alarm processing of a soft fault of the RAID card of the first node and the second node is performed; when it is determined that the IO flow detection service of the first node and/or the second node is abnormal, an alarm processing of an IO flow detection fault of the first node and/or the second node is performed.

在上述实施方式中,对于系统处于双活场景下,基于每一种故障原因,对集群业务系统进行了相应的处理,提高了系统运行的可靠性。In the above implementation, when the system is in an active-active scenario, the cluster service system is processed accordingly based on each fault cause, thereby improving the reliability of system operation.

第二方面,本申请实施例提供一种服务器,包括:处理器,存储器,通信接口;所述存储器用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行第一方面所述的集群业务处理方法。In a second aspect, an embodiment of the present application provides a server, comprising: a processor, a memory, and a communication interface; the memory is used to store executable instructions of the processor; wherein the processor is configured to execute the cluster business processing method described in the first aspect by executing the executable instructions.

第三方面,本申请实施例提供一种集群业务系统,包括:至少一个第一节点和至少一个第二节点,所述第一节点为主节点,所述第二节点为备用节点;其中,所述第一节点执行第一方面所述的集群业务处理方法。In a third aspect, an embodiment of the present application provides a cluster service system, comprising: at least one first node and at least one second node, the first node is a master node, and the second node is a backup node; wherein the first node executes the cluster service processing method described in the first aspect.

本申请实施例提供一种集群业务处理方法、服务器及系统,该方法应用于集群业务系统,该集群业务系统包括第一节点和第二节点;该方法包括:在该第一节点的心跳网络和该第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果;在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果;基于该对端IO流检测的检测结果和该本地IO流检测的检测结果,确定故障原因,并根据该故障原因对该集群业务进行处理;其中,该本地输入输出IO流检测为该第一节点对该第一节点中的第一独立冗余磁盘阵列RAID卡与第一磁盘之间的IO流的第一本地IO流检测,和该第二节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第二本地IO流检测;该对端IO流检测为该第一节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第一对端IO流检测,和/或该第二节点对该第一节点中的第一RAID卡与第一磁盘之间的IO流的第二对端IO流检测。相较于现有技术依赖于心跳网络检测实现业务从故障节点到正常 节点的切换,本申请根据第一节点和第二节点的本地IO流检测和对端IO流检测,结合第一节点和第二节点的心跳检测结果,确定业务集群的故障原因,根据该故障原因对集群业务进行处理,使得本申请的业务集群能够在因RAID卡软故障导致业务需要进行切换时,及时将业务从故障节点切换到正常节点,解决了现有技术中因心跳网络检测无法检测到节点中的RAID卡软故障,而导致集群业务系统不会将业务从RAID卡发生故障节点切换到正常节点上运行的问题。The embodiment of the present application provides a cluster service processing method, a server and a system, the method is applied to a cluster service system, the cluster service system includes a first node and a second node; the method includes: when the heartbeat network of the first node and the heartbeat network of the second node are normal, obtaining the detection result of the local input and output IO flow detection; when the detection result of the local IO flow detection is abnormal, obtaining the detection result of the opposite end IO flow detection; based on the detection result of the opposite end IO flow detection and the detection result of the local IO flow detection, determining the cause of the fault, and processing the cluster service according to the cause of the fault; wherein the local input and output IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card in the first node and the first disk, and a second local IO flow detection of the IO flow between the second RAID card in the second node and the second disk; the opposite end IO flow detection is a first opposite end IO flow detection of the IO flow between the second RAID card in the second node and the second disk, and/or a second opposite end IO flow detection of the IO flow between the first RAID card in the first node and the first disk. Compared with the existing technology that relies on heartbeat network detection to realize the service from faulty nodes to normal Node switching, the present application determines the cause of the failure of the business cluster according to the local IO flow detection and the opposite IO flow detection of the first node and the second node, combined with the heartbeat detection results of the first node and the second node, and processes the cluster business according to the cause of the failure, so that the business cluster of the present application can switch the business from the faulty node to the normal node in time when the business needs to be switched due to a soft failure of the RAID card, thereby solving the problem in the prior art that the heartbeat network detection cannot detect the soft failure of the RAID card in the node, resulting in the cluster business system not switching the business from the node with the RAID card failure to the normal node for operation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief introduction will be given below to the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative labor.

图1为集群业务系统的结构示意图;FIG1 is a schematic diagram of the structure of a cluster service system;

图2为本申请实施例提供的一种集群业务处理方法实施例一的流程示意图;FIG2 is a flow chart of a cluster service processing method according to an embodiment of the present application;

图3为集群业务系统进行IO流检测的示意图;FIG3 is a schematic diagram of a cluster service system performing IO flow detection;

图4为本申请实施例提供的一种集群业务处理方法实施例二的流程示意图;FIG4 is a flow chart of a second embodiment of a cluster service processing method provided in an embodiment of the present application;

图5为本申请实施例提供的一种集群业务处理方法实施例三的流程示意图;FIG5 is a flow chart of a third embodiment of a cluster service processing method provided in an embodiment of the present application;

图6为本申请实施例提供的一种集群业务处理方法实施例四的流程示意图;FIG6 is a flow chart of a fourth embodiment of a cluster service processing method provided in an embodiment of the present application;

图7为本申请实施例提供的一种集群业务处理方法实施例五的流程示意图;FIG7 is a flow chart of a fifth embodiment of a cluster service processing method provided in an embodiment of the present application;

图8为本申请实施例提供的一种服务器实施例的结构示意图;FIG8 is a schematic diagram of the structure of a server embodiment provided in an embodiment of the present application;

图9为本申请实施例提供的另一种服务器实施例的结构示意图。FIG. 9 is a schematic diagram of the structure of another server embodiment provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在根据本实施例的启示下作出的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present application clearer, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments made by ordinary technicians in this field under the enlightenment of the embodiments belong to the scope of protection of the present application.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

首先对本申请所涉及的名词进行解释:First, the terms involved in this application are explained:

RAID卡:独立冗余磁盘阵列(Redundant Array of Independent Disks,简称RAID)将多个独立的磁盘组合成一个大容量的磁盘组。RAID卡则是对组成磁盘阵列的多个磁盘进行管理的。在实现数据库业务时,操作系统需通过RAID卡对其管理的磁盘进行读写数据的操作。 RAID card: Redundant Array of Independent Disks (RAID) combines multiple independent disks into a large-capacity disk group. The RAID card manages the multiple disks that make up the disk array. When implementing database services, the operating system needs to read and write data on the disks it manages through the RAID card.

图1为集群业务系统的结构示意图。该集群业务系统包括第一节点11和第二节点12。其中,第一节点11为主节点,第二节点12为备用节点。第一节点11和第二节点12之间通过心跳网络链路13进行心跳网络检测。心跳网络检测主要通过监听集群中的节点的心跳信号来检测该节点是否发生失效故障,具体地,第一节点11和第二节点12之间通过心跳网络链路13以固定频率互发心跳报文,并接收对端节点的心跳报文,若第二节点12在指定时间内未接收到第一节点11的心跳报文,则确定第一节点11失效,则将第一节点11上运行的业务切换到第二节点12。Fig. 1 is a schematic diagram of the structure of a cluster service system. The cluster service system includes a first node 11 and a second node 12. The first node 11 is a master node, and the second node 12 is a standby node. A heartbeat network detection is performed between the first node 11 and the second node 12 via a heartbeat network link 13. The heartbeat network detection mainly detects whether a failure occurs to the node by monitoring the heartbeat signal of the node in the cluster. Specifically, the first node 11 and the second node 12 send heartbeat messages to each other at a fixed frequency via the heartbeat network link 13, and receive the heartbeat message of the opposite node. If the second node 12 does not receive the heartbeat message of the first node 11 within a specified time, it is determined that the first node 11 fails, and the service running on the first node 11 is switched to the second node 12.

如图1所示,第一节点11和第二节点12均包括运行有操作系统的处理器,RAID卡,以及RAID卡管理的多个磁盘。要实现数据库业务,操作系统需通过RAID卡对其管理的磁盘进行读写数据的操作。若节点中用于控制数据读写的RAID卡发生故障,例如RAID卡程序运行错误,则会导致RAID卡管理的多个磁盘的IO流无法正常读写,进而影响数据库业务,此时也需要进行节点切换。而由于该节点并未失效,其心跳信号仍能被监听到,因此集群业务系统不会将业务从RAID卡发生故障的节点切换到正常节点。As shown in Figure 1, the first node 11 and the second node 12 both include a processor running an operating system, a RAID card, and multiple disks managed by the RAID card. To implement database services, the operating system needs to read and write data on the disks it manages through the RAID card. If a RAID card used to control data reading and writing in a node fails, such as an error in the RAID card program, the IO streams of the multiple disks managed by the RAID card cannot be read and written normally, thereby affecting the database service, and node switching is also required at this time. Since the node has not failed, its heartbeat signal can still be monitored, so the cluster service system will not switch the service from the node where the RAID card fails to a normal node.

基于上述技术问题,本申请的技术构思过程如下:如何对节点的RAID卡故障进行检测,以将集群业务从RAID卡发生故障的节点切换到正常节点。Based on the above technical problems, the technical conception process of the present application is as follows: how to detect the RAID card failure of a node to switch the cluster service from the node with the RAID card failure to a normal node.

下面,通过具体实施例对本申请的技术方案进行详细说明。需要说明的是,下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。The technical solution of the present application is described in detail below through specific embodiments. It should be noted that the following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

图2为本申请实施例提供的一种集群业务处理方法实施例一的流程示意图。该方法应用于集群业务系统,该集群业务系统包括第一节点和第二节点。参见图2,该集群业务处理方法具体包括以下步骤:FIG2 is a flow chart of a cluster service processing method according to an embodiment of the present application. The method is applied to a cluster service system, which includes a first node and a second node. Referring to FIG2 , the cluster service processing method specifically includes the following steps:

步骤S201:在第一节点的心跳网络和第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果。Step S201: When the heartbeat network of the first node and the heartbeat network of the second node are normal, the detection result of the local input and output IO flow detection is obtained.

在一种示例中,集群业务系统处于热备场景,包括第一节点和第二节点,其中,第一节点为主节点,第二节点为备用节点。数据库业务在第一节点上运行,当第一节点发生故障时,数据库业务从第一节点切换到第二节点上运行。该集群业务系统还可以包括管理节点。第一节点和第二节点外挂存储设备。图3为集群业务系统进行IO流检测的示意图。In one example, a cluster service system is in a hot standby scenario, including a first node and a second node, wherein the first node is a primary node and the second node is a standby node. A database service runs on the first node, and when a failure occurs in the first node, the database service switches from the first node to the second node. The cluster service system may also include a management node. The first node and the second node are external storage devices. FIG3 is a schematic diagram of IO flow detection performed by the cluster service system.

其中,本地IO流检测为第一节点对该第一节点中的第一独立冗余磁盘阵列RAID卡与第一磁盘之间的IO流的第一本地IO流检测,和/或第二节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第二本地IO流检测。如图3所示,第一本地IO流检测为第一节点对该第一节点中的第一RAID卡与第一磁盘之间的IO流的检测;第二本地IO流检测为第二节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的检测。本地IO流检测主要用来检测本地系统盘是否有无法正常读写问题。Among them, the local IO flow detection is the first local IO flow detection of the IO flow between the first independent redundant disk array RAID card and the first disk in the first node, and/or the second local IO flow detection of the IO flow between the second RAID card and the second disk in the second node by the second node. As shown in Figure 3, the first local IO flow detection is the detection of the IO flow between the first RAID card and the first disk in the first node by the first node; the second local IO flow detection is the detection of the IO flow between the second RAID card and the second disk in the second node by the second node. The local IO flow detection is mainly used to detect whether the local system disk has problems with normal reading and writing.

在数据库业务集群中,第一节点为主节点,第二节点为备用节点。第一节点和第二节点中均包括RAID卡及其管理的多个磁盘。其中,该磁盘可以为系统盘。实现数据库业务,操作系统需通过RAID卡对其管理的磁盘进行读写数据的操作,在读写数据的过程中,RAID卡与磁盘之间形成了IO流。因此,可以通过检测RAID卡与磁盘之间的IO流,来确定RAID卡是否发生程序运行错误等软故障。当RAID卡与磁盘之间的IO流的检测结果异常时,则可以确定RAID卡发生了程序运行错误等软故障。 In the database business cluster, the first node is the main node and the second node is the standby node. Both the first node and the second node include a RAID card and multiple disks managed by it. Among them, the disk can be a system disk. To implement database business, the operating system needs to read and write data to the disk it manages through the RAID card. In the process of reading and writing data, an IO flow is formed between the RAID card and the disk. Therefore, by detecting the IO flow between the RAID card and the disk, it can be determined whether the RAID card has a soft fault such as a program running error. When the detection result of the IO flow between the RAID card and the disk is abnormal, it can be determined that the RAID card has a soft fault such as a program running error.

在本实施例中,在第一节点的心跳网络和第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果。具体地,第一节点对第一节点中的第一RAID卡与第一磁盘之间的IO流进行检测,得到第一本地IO流检测的检测结果;第二节点对第二节点中的第二RAID卡与第二磁盘之间的IO流进行检测,得到第二本地IO流检测的检测结果。第一节点作为主节点,获取第一本地IO流检测和第二本地IO流检测的检测结果,也可以由管理节点获取第一本地IO流检测和第二本地IO流检测的检测结果。In this embodiment, when the heartbeat network of the first node and the heartbeat network of the second node are normal, the detection result of the local input and output IO flow detection is obtained. Specifically, the first node detects the IO flow between the first RAID card and the first disk in the first node to obtain the detection result of the first local IO flow detection; the second node detects the IO flow between the second RAID card and the second disk in the second node to obtain the detection result of the second local IO flow detection. The first node, as the master node, obtains the detection results of the first local IO flow detection and the second local IO flow detection, and the management node can also obtain the detection results of the first local IO flow detection and the second local IO flow detection.

步骤S202:在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果。Step S202: when the detection result of the local IO flow detection is abnormal, the detection result of the peer IO flow detection is obtained.

其中,对端IO流检测为第一节点对第二节点中的第二RAID卡与第二磁盘之间的IO流的第一对端IO流检测,和/或第二节点对第一节点中的第一RAID卡与第一磁盘之间的IO流的第二对端IO流检测。如图3所示,第一对端IO流检测为第一节点对第二节点中的第二RAID卡与第二磁盘之间的IO流的检测;第二对端IO流检测为该第二节点对该第一节点中的第一RAID卡与第一磁盘之间的IO流的检测。对端IO流检测一是用来校验结果,保证结果的一致性和可靠性,二是用来校验IO流检测服务自身是否存在异常。Among them, the peer IO flow detection is the first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk, and/or the second peer IO flow detection of the IO flow between the first RAID card in the first node and the first disk. As shown in Figure 3, the first peer IO flow detection is the detection of the IO flow between the second RAID card in the second node and the second disk by the first node; the second peer IO flow detection is the detection of the IO flow between the first RAID card in the first node and the first disk by the second node. The peer IO flow detection is used to verify the results, ensure the consistency and reliability of the results, and to verify whether there is an abnormality in the IO flow detection service itself.

在本实施例中,在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果。具体地,第一节点对第二节点中的第二RAID卡与第二磁盘之间的IO流进行检测,得到第一对端IO流检测的检测结果;第二节点还对第一节点中的第一RAID卡与第一磁盘之间的IO流进行检测,得到第二对端IO流检测的检测结果。第一节点作为主节点,获取第一对端IO流检测和第二对端IO流检测的检测结果,也可以由管理节点获取第一对端IO流检测和第二对端IO流检测的检测结果。In this embodiment, when the detection result of the local IO flow detection is abnormal, the detection result of the peer IO flow detection is obtained. Specifically, the first node detects the IO flow between the second RAID card and the second disk in the second node to obtain the detection result of the first peer IO flow detection; the second node also detects the IO flow between the first RAID card and the first disk in the first node to obtain the detection result of the second peer IO flow detection. The first node, as the master node, obtains the detection results of the first peer IO flow detection and the second peer IO flow detection. The management node can also obtain the detection results of the first peer IO flow detection and the second peer IO flow detection.

在本实施例中,第一节点和第二节点还要进行心跳检测,以固定频率互发心跳报文,并接收对端节点的心跳报文,得到心跳检测结果。第一节点作为主节点,获取第一节点的心跳检测结果和第二节点的心跳检测结果,也可以由管理节点获取第一节点的心跳检测结果和第二节点的心跳检测结果。在第一节点的心跳网络和第二节点的心跳网络正常的情况下,执行上述步骤S201至S202。In this embodiment, the first node and the second node also perform heartbeat detection, send heartbeat messages to each other at a fixed frequency, and receive heartbeat messages from the opposite node to obtain heartbeat detection results. The first node, as the master node, obtains the heartbeat detection results of the first node and the heartbeat detection results of the second node. The management node can also obtain the heartbeat detection results of the first node and the heartbeat detection results of the second node. When the heartbeat network of the first node and the heartbeat network of the second node are normal, the above steps S201 to S202 are performed.

步骤S203:基于该对端IO流检测的检测结果和该本地IO流检测的检测结果,确定故障原因,并根据该故障原因对该集群业务进行处理。Step S203: based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, determine the cause of the fault, and process the cluster service according to the cause of the fault.

在本实施例中,第一节点或者管理节点基于对端IO流检测的检测结果和本地IO流检测的检测结果,确定故障原因。In this embodiment, the first node or the management node determines the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection.

示例性地,在第一节点和第二节点的心跳检测结果为正常的前提下,第一本地IO流检测的检测结果为异常,第一对端IO流检测的检测结果为正常,可以初步确定第一节点的RAID卡状态异常,而第二本地IO流检测的检测结果为正常,则说明第二节点的IO检测服务正常,在第二对端IO流检测的检测结果为异常时,可进一步确定第一节点的RAID状态异常,确定第二节点的RAID状态正常。Exemplarily, under the premise that the heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO flow detection is abnormal, and the detection result of the first peer IO flow detection is normal, it can be preliminarily determined that the RAID card status of the first node is abnormal, and the detection result of the second local IO flow detection is normal, which means that the IO detection service of the second node is normal. When the detection result of the second peer IO flow detection is abnormal, it can be further determined that the RAID status of the first node is abnormal, and the RAID status of the second node is determined to be normal.

第一节点为主节点,第二节点为备用节点,第一节点的RAID状态异常且第二节点的RAID状态正常,则需要对集群业务进行切换,将运行在第一节点上的集群业务切换到第二节点上运行。The first node is the master node, the second node is the standby node, the RAID status of the first node is abnormal and the RAID status of the second node is normal, then the cluster service needs to be switched, and the cluster service running on the first node is switched to run on the second node.

在一种示例中,集群业务系统处于双活场景,包括第一节点和第二节点,第一节点和第二节点互为备用节点。第一节点上运行数据库业务A,第二节点上运行数据库业务B。 当第一节点发生故障时,数据库业务A从第一节点切换到第二节点上运行;当第二节点发生故障时,数据库业务B从第二节点切换到第一节点上运行。该集群业务系统还可以包括管理节点。In an example, the cluster service system is in an active-active scenario, including a first node and a second node, the first node and the second node are standby nodes for each other, database service A runs on the first node, and database service B runs on the second node. When the first node fails, database service A switches from the first node to the second node for operation; when the second node fails, database service B switches from the second node to the first node for operation. The cluster service system may also include a management node.

在第一节点的心跳网络和第二节点的心跳网络正常的情况下,可以由第一节点、第二节点或者管理节点获取本地输入输出IO流检测的检测结果。在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果。第一节点、第二节点或者管理节点基于该对端IO流检测的检测结果和该本地IO流检测的检测结果,确定故障原因,并根据该故障原因对集群业务进行处理。When the heartbeat network of the first node and the heartbeat network of the second node are normal, the first node, the second node or the management node can obtain the detection result of the local input and output IO flow detection. When the detection result of the local IO flow detection is abnormal, the detection result of the peer IO flow detection is obtained. The first node, the second node or the management node determines the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, and processes the cluster service according to the cause of the fault.

在一种示例中,可以在数据库业务启动时,获取第一本地IO流检测和第一对端IO流检测的检测结果,第二本地IO流检测和第二对端IO流检测的检测结果,以及第一节点和第二节点的心跳检测结果,根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,分别确定第一节点和第二节点的RAID卡状态,以确定是否进行集群业务的切换。In one example, when the database service is started, the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node are obtained. According to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, the RAID card status of the first node and the second node is determined respectively to determine whether to switch the cluster service.

在一种示例中,还可以在数据库业务启动后每隔预设时间,重新获取第一本地IO流检测和第一对端IO流检测的检测结果,第二本地IO流检测和第二对端IO流检测的检测结果,以及第一节点和第二节点的心跳检测结果,重新根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,分别确定第一节点和第二节点的RAID卡状态,以确定是否进行集群业务的切换。In one example, after the database service is started, the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node can be re-obtained at preset intervals, and the RAID card status of the first node and the second node can be determined based on the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, respectively, to determine whether to switch the cluster service.

在一种示例中,还可以在预设时间内多次获取第一本地IO流检测和第一对端IO流检测的检测结果,第二本地IO流检测和第二对端IO流检测的检测结果,以及第一节点和第二节点的心跳检测结果,在根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的多个检测结果以及多个第一节点和第二节点的心跳检测结果,确定第一节点和第二节点的RAID卡状态后,统计第一节点和第二节点的RAID卡状态异常的次数,根据第一节点和第二节点的RAID卡状态异常的次数,确定是否进行集群业务的切换。示例性地,第一节点为主节点,第二节点为备用节点,在第一节点的RAID卡状态异常的次数超过次数阈值,且第二节点的RAID卡的状态异常次数为0,也就是说第二节点的RAID卡的多次检测结果均是状态正常时,则进行集群业务的切换。In an example, the detection results of the first local IO flow detection and the first opposite end IO flow detection, the detection results of the second local IO flow detection and the second opposite end IO flow detection, and the heartbeat detection results of the first node and the second node can be obtained multiple times within a preset time. After determining the RAID card status of the first node and the second node according to the multiple detection results of the first local IO flow detection, the first opposite end IO flow detection, the second local IO flow detection and the second opposite end IO flow detection and the heartbeat detection results of the multiple first nodes and the second nodes, the number of abnormal RAID card statuses of the first node and the second node is counted, and whether to switch the cluster service is determined according to the number of abnormal RAID card statuses of the first node and the second node. Exemplarily, the first node is the master node, the second node is the standby node, the number of abnormal RAID card statuses of the first node exceeds the number threshold, and the number of abnormal RAID card statuses of the second node is 0, that is, when the multiple detection results of the RAID card of the second node are all normal, the cluster service is switched.

在本实施例中,集群业务系统包括第一节点和第二节点;在该第一节点的心跳网络和该第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果;在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果;基于该对端IO流检测的检测结果和该本地IO流检测的检测结果,确定故障原因,并根据该故障原因对该集群业务进行处理;其中,该本地IO流检测为该第一节点对该第一节点中的第一独立冗余磁盘阵列RAID卡与第一磁盘之间的IO流的第一本地IO流检测,和/或该第二节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第二本地IO流检测;该对端IO流检测为该第一节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第一对端IO流检测,和/或该第二节点对该第一节点中的第一RAID卡与第一磁盘之间的IO流的第二对端IO流检测。相较于现有技术依赖于心跳网络检测实现业务从故障节点到正常节点的切换,本申请根据第一节点和第二节点的本地IO流检测和对端IO流检测,结合第一 节点和第二节点的心跳检测结果,确定业务集群的故障原因,根据该故障原因对集群业务进行处理,使得本申请的业务集群能够在因RAID卡软故障导致业务需要进行切换时,及时将业务从故障节点切换到正常节点,解决了现有技术中因心跳网络检测无法检测到节点中的RAID卡软故障,而导致集群业务系统不会将业务从RAID卡发生故障节点切换到正常节点上运行的问题。In this embodiment, the cluster service system includes a first node and a second node; when the heartbeat network of the first node and the heartbeat network of the second node are normal, a detection result of a local input/output IO flow detection is obtained; when the detection result of the local IO flow detection is abnormal, a detection result of a peer IO flow detection is obtained; based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, a fault cause is determined, and the cluster service is processed according to the fault cause; wherein the local IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card in the first node and the first disk, and/or a second local IO flow detection of the IO flow between the second RAID card in the second node and the second disk; the peer IO flow detection is a first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk, and/or a second peer IO flow detection of the IO flow between the first RAID card in the first node and the first disk by the second node. Compared with the prior art that relies on heartbeat network detection to implement service switching from a faulty node to a normal node, the present application uses local IO flow detection of the first node and the second node and peer IO flow detection, combined with the first node The heartbeat detection result of the node and the second node is used to determine the cause of the failure of the business cluster, and the cluster business is processed according to the cause of the failure, so that the business cluster of the present application can switch the business from the faulty node to the normal node in time when the business needs to be switched due to a soft failure of the RAID card, which solves the problem in the prior art that the heartbeat network detection cannot detect the soft failure of the RAID card in the node, resulting in the cluster business system not switching the business from the node with the RAID card failure to the normal node for operation.

图4为本申请实施例提供的一种集群业务处理方法实施例二的流程示意图,在上述图2所示实施例的基础上,上述步骤S201可以包括:第一节点和第二节点分别向本地节点的磁盘发起第一读数据指令;其中,该第一读数据指令用于读取本地节点中的磁盘中的第一数据;获取本地节点的磁盘基于该第一读数据指令返回的第一数据;本地IO流检测的检测结果为异常,包括:没有获取到本地节点的磁盘返回的第一数据。Figure 4 is a flow chart of a second embodiment of a cluster business processing method provided by an embodiment of the present application. Based on the embodiment shown in Figure 2 above, the above step S201 may include: the first node and the second node respectively initiate a first read data instruction to the disk of the local node; wherein the first read data instruction is used to read the first data in the disk in the local node; obtain the first data returned by the disk of the local node based on the first read data instruction; the detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained.

具体地,步骤S201可以包括以下步骤:Specifically, step S201 may include the following steps:

步骤S401:第一节点向该第一节点中的第一磁盘发起第一读数据指令;该第一读数据指令用于读取该第一磁盘中的第一数据。Step S401: a first node initiates a first data read instruction to a first disk in the first node; the first data read instruction is used to read first data in the first disk.

步骤S402:该第一节点获取该第一磁盘基于该第一读数据指令返回的第一数据;若获取到该第一磁盘返回的第一数据,则确定该第一本地IO流检测的检测结果为正常;若没有获取到该第一磁盘返回的第一数据,则确定该第一本地IO流检测的检测结果为异常。Step S402: the first node obtains the first data returned by the first disk based on the first data read instruction; if the first data returned by the first disk is obtained, it is determined that the detection result of the first local IO stream detection is normal; if the first data returned by the first disk is not obtained, it is determined that the detection result of the first local IO stream detection is abnormal.

在本实施例中,RAID卡管理多个磁盘。在实现数据库业务时,操作系统需通过RAID卡对其管理的磁盘进行读写数据的操作,在读写数据的过程中,RAID卡与磁盘之间形成了IO流。因此,可以通过检测RAID卡与磁盘之间的IO流,来确定RAID卡是否发生软故障。当RAID卡发生程序运行错误等软故障时,磁盘不会响应读指令返回数据,因此可以通过向磁盘的读数据指令来检测RAID卡与系统盘之间的IO流,进而确定RAID卡是否发生软故障。示例性地,可以调用底层的IO测试工具,例如磁盘压力测试(Flexible Input Output tester,简称FIO)、IO测试软件(Input Output meter,简称IOmeter)等,对磁盘进行读操作。In this embodiment, the RAID card manages multiple disks. When implementing database services, the operating system needs to read and write data on the disks it manages through the RAID card. In the process of reading and writing data, an IO flow is formed between the RAID card and the disk. Therefore, it is possible to determine whether a soft fault has occurred in the RAID card by detecting the IO flow between the RAID card and the disk. When a soft fault such as a program running error occurs in the RAID card, the disk will not respond to the read instruction to return data. Therefore, the IO flow between the RAID card and the system disk can be detected by sending a read data instruction to the disk to determine whether a soft fault has occurred in the RAID card. Exemplarily, the underlying IO test tools, such as the disk stress test (Flexible Input Output tester, referred to as FIO), IO test software (Input Output meter, referred to as IOmeter), etc., can be called to perform a read operation on the disk.

具体地,第一节点在进行第一本地IO流检测时,向第一节点中的第一磁盘发起第一读数据指令,该第一读数据指令用于读取第一RAID卡管理的第一磁盘中的第一数据,若第一节点中的RAID卡状态正常,则第一磁盘会基于第一读数据指令将第一数据返回给第一节点的操作系统;若第一节点中的RAID卡发生了软故障,则第一磁盘不会返回数据。Specifically, when the first node performs a first local IO flow detection, it initiates a first read data instruction to the first disk in the first node. The first read data instruction is used to read the first data in the first disk managed by the first RAID card. If the RAID card in the first node is in normal status, the first disk will return the first data to the operating system of the first node based on the first read data instruction; if a soft failure occurs in the RAID card in the first node, the first disk will not return data.

第一节点获取第一磁盘基于第一读数据指令返回的第一数据,若成功获取到第一磁盘返回的第一数据,则确定第一本地IO流检测的检测结果为正常;若没有获取到第一磁盘返回的第一数据,则确定第一本地IO流检测的检测结果为异常。The first node obtains the first data returned by the first disk based on the first data read instruction. If the first data returned by the first disk is successfully obtained, it is determined that the detection result of the first local IO stream detection is normal; if the first data returned by the first disk is not obtained, it is determined that the detection result of the first local IO stream detection is abnormal.

在上述图2所示实施例的基础上,上述步骤S202可以包括:第一节点或者第二节点向对端节点中的磁盘发起第二读数据指令;其中,该第二读数据指令用于读取该对端节点中的磁盘中的第二数据;获取该对端节点中的磁盘基于该第二读数据指令返回的第二数据;若获取到该对端节点中的磁盘返回的第二数据,则确定该对端IO流检测的检测结果为正常;若没有获取到该对端节点中的磁盘返回的第二数据,则确定该对端IO流检测的检测结果为异常。Based on the embodiment shown in FIG. 2 above, step S202 may include: the first node or the second node initiates a second read data instruction to the disk in the opposite node; wherein the second read data instruction is used to read the second data in the disk in the opposite node; obtaining the second data returned by the disk in the opposite node based on the second read data instruction; if the second data returned by the disk in the opposite node is obtained, determining that the detection result of the opposite IO flow detection is normal; if the second data returned by the disk in the opposite node is not obtained, determining that the detection result of the opposite IO flow detection is abnormal.

具体地,步骤S202可以包括以下步骤:Specifically, step S202 may include the following steps:

步骤S403:该第一节点向第二节点中的第二磁盘发起第二读数据指令;该第二读数据 指令用于读取该第二磁盘中的第二数据。Step S403: The first node initiates a second data read instruction to the second disk in the second node; the second data read instruction The instruction is used to read the second data in the second disk.

步骤S404:该第一节点获取该第二磁盘基于该第二读数据指令返回的第二数据;若获取到该第二磁盘返回的第二数据,则确定该第一对端IO流检测的检测结果为正常;若没有获取到该第二磁盘返回的第二数据,则确定该第一对端IO流检测的检测结果为异常。Step S404: the first node obtains the second data returned by the second disk based on the second data read instruction; if the second data returned by the second disk is obtained, it is determined that the detection result of the first peer IO stream detection is normal; if the second data returned by the second disk is not obtained, it is determined that the detection result of the first peer IO stream detection is abnormal.

需要说明的是,此处对步骤S402至S404的执行顺序不做具体限制。It should be noted that there is no specific limitation on the execution order of steps S402 to S404.

在本实施例中,第一节点还对对端的第二节点进行IO流检测,即第一对端IO流检测。具体地,向第二节点中的第二磁盘发起第二读数据指令,该第二读数据指令用于读取第二磁盘中的第二数据。若第二节点中的RAID卡状态正常,则第二磁盘会基于第二读数据指令将第二数据返回给第一节点的操作系统;若第二节点中的RAID卡发生了软故障,则第二磁盘不会返回数据。In this embodiment, the first node also performs IO flow detection on the second node at the opposite end, that is, first opposite end IO flow detection. Specifically, a second read data instruction is initiated to the second disk in the second node, and the second read data instruction is used to read the second data in the second disk. If the RAID card in the second node is in normal state, the second disk will return the second data to the operating system of the first node based on the second read data instruction; if a soft failure occurs in the RAID card in the second node, the second disk will not return data.

第一节点获取第二磁盘基于第二读数据指令返回的第二数据;若成功获取到第二磁盘返回的第二数据,则确定第一对端IO流检测的检测结果为正常;若没有获取到第二磁盘返回的第二数据,则确定第一对端IO流检测的检测结果为异常。The first node obtains the second data returned by the second disk based on the second data read instruction; if the second data returned by the second disk is successfully obtained, it is determined that the detection result of the first peer IO flow detection is normal; if the second data returned by the second disk is not obtained, it is determined that the detection result of the first peer IO flow detection is abnormal.

在本实施例中,IO流检测的属性参数包括:间隔时间,超时时间,超时次数等。示例性地,第一节点可以每隔预设的间隔时间进行本地IO流检测和对端IO流检测。例如,第一节点在预设的第一时间段内,按照预设的时间间隔节点,获取第一磁盘返回的第一数据,若在该预设的第一时间段内,每个时间间隔点均可以获取第一磁盘返回的第一数据,则确定第一本地IO流检测的检测结果为正常;或者,第一节点在该预设的第一时间段内,所有时间间隔点获取第一磁盘返回的第一数据的次数大于等于第一次数阈值,则确定第一本地IO流检测的检测结果为正常;若第一节点在该预设的第一时间段内,所有时间间隔点均不能获取第一磁盘返回的第一数据,则确定第一本地IO流的检测结果为异常;或者第一节点在该预设的第一时间段内,所有时间间隔点获取第一磁盘返回的第一数据的次数小于第一次数阈值,则确定第一本地IO流检测的检测结果为异常;其中,第一次数阈值为该预设的预设时间段内,第一节点最少能获取第一磁盘返回第一数据的次数;间隔时间可以是1s,5s,1min,1h等,此处不做特别限制。In this embodiment, the attribute parameters of the IO flow detection include: interval time, timeout time, timeout times, etc. Exemplarily, the first node may perform local IO flow detection and peer IO flow detection at preset intervals. For example, the first node obtains the first data returned by the first disk according to the preset time interval nodes within the preset first time period. If the first data returned by the first disk can be obtained at each time interval point within the preset first time period, the detection result of the first local IO flow detection is determined to be normal; or, the number of times the first node obtains the first data returned by the first disk at all time interval points within the preset first time period is greater than or equal to the first number threshold, the detection result of the first local IO flow detection is determined to be normal; if the first node cannot obtain the first data returned by the first disk at all time interval points within the preset first time period, the detection result of the first local IO flow is determined to be abnormal; or the number of times the first node obtains the first data returned by the first disk at all time interval points within the preset first time period is less than the first number threshold, the detection result of the first local IO flow detection is determined to be abnormal; wherein the first number threshold is the minimum number of times the first node can obtain the first data returned by the first disk within the preset time period; the interval time can be 1s, 5s, 1min, 1h, etc., and is not particularly limited here.

在一些实现方式中,若第一节点获取到第一磁盘返回的第一数据的获取时间与预设的第一时间之差小于第一时间阈值,则确定第一本地IO流检测的检测结果为正常;若第一节点获取到第一磁盘返回的第一数据的获取时间与预设的第二时间之差大于第一时间阈值,则确定第一本地IO流检测的检测结果为异常,其中,第二时间是预设的第一节点正常能够获取到第一磁盘返回的第一数据的最大时间;In some implementations, if the difference between the time when the first node obtains the first data returned by the first disk and the preset first time is less than the first time threshold, it is determined that the detection result of the first local IO stream detection is normal; if the difference between the time when the first node obtains the first data returned by the first disk and the preset second time is greater than the first time threshold, it is determined that the detection result of the first local IO stream detection is abnormal, wherein the second time is the preset maximum time that the first node can normally obtain the first data returned by the first disk;

在一些实现方式中,在预设的第一时间段内,第一节点获取到第一磁盘返回的第一数据的超时次数大于第二次数阈值时,则确定第一本地IO流检测的检测结果为异常;第一节点获取到第一磁盘返回的第一数据的超时次数小于或者等于第二次数阈值时,则确定第一本地IO流检测的检测结果为正常。其中,超时次数为在该时间段内,第一节点获取到第一磁盘返回的第一数据的获取时间与预设的第二时间之差大于第一时间阈值的最大次数。In some implementations, when the number of timeouts for the first node to obtain the first data returned by the first disk is greater than a second number threshold within a preset first time period, the detection result of the first local IO stream detection is determined to be abnormal; when the number of timeouts for the first node to obtain the first data returned by the first disk is less than or equal to the second number threshold, the detection result of the first local IO stream detection is determined to be normal. The number of timeouts is the maximum number of times within the time period that the difference between the acquisition time of the first data returned by the first disk obtained by the first node and the preset second time is greater than the first time threshold.

同样地,第二节点进行第二本地IO流检测和第二对端IO流检测,可以参照上述步骤S401至步骤S404进行。Similarly, the second node performs the second local IO flow detection and the second peer IO flow detection, which can be performed with reference to the above steps S401 to S404.

在本实施例中,第一节点通过分别向第一节点中第一磁盘和第二磁盘发起读数据指令,从第一磁盘和第二磁盘中读取数据,来进行第一本地IO流检测和第一对端IO流检测,以 分别对本地节点的RAID卡和对端节点的RAID卡的运行状态进行检测。由此,可以在不影响数据库业务的情况下,对本地节点和对端节点的IO流进行检测,以确定RAID是否发生软故障,为业务集群能够在因RAID卡软故障导致业务需要进行切换时,及时将业务从故障节点切换到正常节点提供了前提条件。In this embodiment, the first node performs the first local IO flow detection and the first peer IO flow detection by respectively initiating a read data instruction to the first disk and the second disk in the first node and reading data from the first disk and the second disk. The operation status of the RAID card of the local node and the RAID card of the opposite node are detected respectively. Thus, the IO flow of the local node and the opposite node can be detected without affecting the database business to determine whether a soft failure has occurred in the RAID, which provides the prerequisite for the business cluster to switch the business from the faulty node to the normal node in time when the business needs to be switched due to a soft failure of the RAID card.

图5为本申请实施例提供的一种集群业务处理方法实施例三的流程示意图,在上述图2至图4所示实施例的基础上,上述步骤S203具体包括以下步骤:FIG5 is a flow chart of a third embodiment of a cluster service processing method provided by an embodiment of the present application. Based on the embodiments shown in FIG2 to FIG4 above, the above step S203 specifically includes the following steps:

步骤S501:根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,分别确定该第一节点和该第二节点的RAID卡状态。Step S501: Determine the RAID card status of the first node and the second node respectively according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection and the heartbeat detection results of the first node and the second node.

其中,第一节点和第二节点中的其中一个节点为主节点,另一个节点为备用节点。Among them, one of the first node and the second node is a main node, and the other node is a standby node.

在本实施例中,根据第一本地IO流检测、第一对端IO流检测的检测结果,第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,确定第一节点和第二节点的RAID卡状态。In this embodiment, the RAID card status of the first node and the second node is determined according to the detection results of the first local IO flow detection, the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node.

本地IO流检测主要用来检测本地系统盘是否有无法正常读写的问题;对端IO流检测一是用来校验对端节点的本地IO流检测结果,保证结果的一致性和可靠性,二是用来校验IO流检测服务自身是否存在异常。Local IO flow detection is mainly used to detect whether the local system disk has problems with normal reading and writing; peer IO flow detection is used to verify the local IO flow detection results of the peer node to ensure the consistency and reliability of the results, and to verify whether there are any abnormalities in the IO flow detection service itself.

具体地,在第一节点和第二节点的心跳检测结果均为正常,第一本地IO流检测的检测结果为异常,且第一对端IO流检测的检测结果为正常时,则确定该第一节点的RAID状态异常。第一节点和第二节点的心跳检测结果正常,则说明第一节点和第二节点的网络正常;第一本地IO流检测的检测结果为异常,则说明第一节点的本地磁盘有可能存在无法正常读写的问题(也有可能是第一节点的IO流检测服务异常),为了进一步验证第一节点的本地磁盘是否存在无法正常读写的问题,需要判断第一对端IO流检测结果,若第一对端IO流检测结果为正常,则可以排除第一节点IO流检测服务异常的情况,因此,可以确定第一节点的RAID状态异常。Specifically, when the heartbeat detection results of the first node and the second node are both normal, the detection result of the first local IO flow detection is abnormal, and the detection result of the first peer IO flow detection is normal, it is determined that the RAID state of the first node is abnormal. If the heartbeat detection results of the first node and the second node are normal, it means that the network of the first node and the second node is normal; if the detection result of the first local IO flow detection is abnormal, it means that the local disk of the first node may have a problem of not being able to read and write normally (it may also be that the IO flow detection service of the first node is abnormal). In order to further verify whether the local disk of the first node has a problem of not being able to read and write normally, it is necessary to judge the first peer IO flow detection result. If the first peer IO flow detection result is normal, the situation that the IO flow detection service of the first node is abnormal can be ruled out. Therefore, it can be determined that the RAID state of the first node is abnormal.

在第一节点和第二节点的心跳检测结果均为正常,第一本地IO流检测的检测结果为正常时,可以初步判断第一节点的RAID卡和第一节点的IO检测服务均正常,为了进一步验证上述结论,可以通过第二对端IO流检测的结果进行判断,若第二对端IO流检测的检测结果为正常时,则确定该第一节点的RAID状态正常。When the heartbeat detection results of the first node and the second node are both normal and the detection result of the first local IO flow detection is normal, it can be preliminarily determined that the RAID card of the first node and the IO detection service of the first node are normal. In order to further verify the above conclusion, it can be judged by the result of the second peer IO flow detection. If the detection result of the second peer IO flow detection is normal, it is determined that the RAID status of the first node is normal.

同样地,在第一节点和第二节点的心跳检测结果均为正常,且第一本地IO流检测正常,第二对端IO流检测正常,若第一对端IO流检测的检测结果为异常,则可初步判断,第二节点RAID卡异常,此时,若第二本地IO流检测的检测结果为异常时,则进一步确定该第二节点的RAID状态异常。第一节点和第二节点的心跳检测结果正常,则说明第一节点和第二节点的网络正常;第一本地IO流检测正常,第二对端IO流检测正常,则可以分别说明第一节点和第二节点的IO流检测服务均正常,此时,若第二本地IO流检测的检测结果为异常,则可以初步判断第二节点的本地磁盘有无法正常读写的问题,为了进一步验证该结论,通过第一对端IO流检测的检测结果来进行判断,若检测结果为异常,则第二节点的RAID卡状态异常,进一步验证了第二节点的本地磁盘有无法正常读写的问题。因此,可以进一步确定第二节点的RAID状态异常。Similarly, the heartbeat detection results of the first node and the second node are both normal, and the first local IO flow detection is normal, and the second peer IO flow detection is normal. If the detection result of the first peer IO flow detection is abnormal, it can be preliminarily determined that the RAID card of the second node is abnormal. At this time, if the detection result of the second local IO flow detection is abnormal, it is further determined that the RAID state of the second node is abnormal. The heartbeat detection results of the first node and the second node are normal, which means that the network of the first node and the second node is normal; the first local IO flow detection is normal, and the second peer IO flow detection is normal, which can respectively indicate that the IO flow detection services of the first node and the second node are normal. At this time, if the detection result of the second local IO flow detection is abnormal, it can be preliminarily determined that the local disk of the second node has a problem of not being able to read and write normally. In order to further verify this conclusion, the detection result of the first peer IO flow detection is used for judgment. If the detection result is abnormal, the RAID card state of the second node is abnormal, which further verifies that the local disk of the second node has a problem of not being able to read and write normally. Therefore, it can be further determined that the RAID state of the second node is abnormal.

在第一节点和第二节点的心跳检测结果均为正常,若第二本地IO流检测的检测结果 为正常时,可以初步判断第二节点的本地磁盘读写正常,为了进一步验证上述结论,通过第一对端IO流检测的检测结果进行验证,若上述结果为正常,则确定该第二节点的RAID状态正常。The heartbeat detection results of the first node and the second node are both normal. If the detection result of the second local IO flow detection is When it is normal, it can be preliminarily determined that the local disk of the second node is reading and writing normally. In order to further verify the above conclusion, verification is performed through the detection result of the first peer IO flow detection. If the above result is normal, it is determined that the RAID status of the second node is normal.

具体地,在第一节点和第二节点的心跳检测结果均为正常,第一本地IO流检测的检测结果为异常,在第一对端IO流检测的检测结果为异常,第二本地IO流检测的检测结果为正常,且第二对端IO流检测的检测结果为正常时,则确定该第一节点的IO流检测服务异常。第一节点和第二节点的心跳检测结果正常,则说明第一节点和第二节点的网络正常;第一本地IO流检测的检测结果为异常,表明第一节点的本地磁盘可能有无法正常读写的问题(也有可能是第一节点的IO流检测服务异常),但第二对端IO流检测的检测结果为正常,说明第一节点的本地系统盘可以正常读写,此时,可以排除第一节点RAID卡状态异常的情况,可以确定第一本地IO流检测的检测结果异常是第一节点的IO流检测服务出现了异常;第二本地IO流检测的检测结果为正常,说明第二节点的本地磁盘可以正常读写,但第一对端IO流检测的检测结果为异常,进一步验证了第一节点的IO流检测异常。Specifically, when the heartbeat detection results of the first node and the second node are both normal, the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is abnormal, the detection result of the second local IO flow detection is normal, and the detection result of the second peer IO flow detection is normal, it is determined that the IO flow detection service of the first node is abnormal. If the heartbeat detection results of the first node and the second node are normal, it means that the network of the first node and the second node is normal; if the detection result of the first local IO flow detection is abnormal, it means that the local disk of the first node may not be able to read and write normally (it may also be that the IO flow detection service of the first node is abnormal), but the detection result of the second peer IO flow detection is normal, which means that the local system disk of the first node can be read and written normally. At this time, the abnormal state of the RAID card of the first node can be ruled out, and it can be determined that the abnormal detection result of the first local IO flow detection is due to the abnormality of the IO flow detection service of the first node; if the detection result of the second local IO flow detection is normal, it means that the local disk of the second node can be read and written normally, but the detection result of the first peer IO flow detection is abnormal, which further verifies that the IO flow detection of the first node is abnormal.

同样地,在第一节点和第二节点的心跳检测结果均为正常,第一本地IO流检测的检测结果为正常,第一对端IO流检测的检测结果为正常,第二本地IO流检测的检测结果为异常,且第二对端IO流检测的检测结果为异常时,则确定该第二节点的IO流检测服务异常。Similarly, when the heartbeat detection results of the first node and the second node are both normal, the detection result of the first local IO flow detection is normal, the detection result of the first peer IO flow detection is normal, the detection result of the second local IO flow detection is abnormal, and the detection result of the second peer IO flow detection is abnormal, it is determined that the IO flow detection service of the second node is abnormal.

在一些实现方式中,第一节点和第二节点的网络状况也会影响对端IO流检测的检测结果。当第一节点和/或第二节点的心跳检测结果异常时,对端IO流检测的检测结果也为异常。In some implementations, the network conditions of the first node and the second node may also affect the detection result of the peer IO flow detection. When the heartbeat detection result of the first node and/or the second node is abnormal, the detection result of the peer IO flow detection is also abnormal.

步骤S502:在确定主节点的RAID卡状态异常,且备用节点的RAID卡状态正常时,将集群业务从主节点切换到备用节点上运行,并进行主节点RAID卡软故障的告警处理。Step S502: when it is determined that the RAID card status of the master node is abnormal and the RAID card status of the standby node is normal, the cluster service is switched from the master node to the standby node for operation, and an alarm process for a soft failure of the master node RAID card is performed.

步骤S503:在集群业务系统处于热备场景下,确定备用节点的RAID卡状态异常,且主节点的RAID卡状态正常时,进行备用节点RAID卡软故障的告警处理;Step S503: When the cluster service system is in a hot standby scenario and it is determined that the RAID card status of the standby node is abnormal and the RAID card status of the master node is normal, an alarm processing of a soft fault of the RAID card of the standby node is performed;

在集群业务系统处于双活场景下,确定备用节点的RAID卡状态异常,且主节点的RAID卡状态正常时,将集群业务从备用节点切换到主节点上运行,并进行备用节点RAID卡软故障的告警处理。In the cluster service system in the active-active scenario, when it is determined that the RAID card status of the standby node is abnormal and the RAID card status of the active node is normal, the cluster service is switched from the standby node to the active node for operation, and the alarm of the soft fault of the RAID card of the standby node is processed.

步骤S504:在确定主节点的RAID卡状态异常,且备用节点的RAID卡状态异常时,进行主节点和备用节点RAID卡软故障的告警处理。Step S504: when it is determined that the RAID card status of the master node is abnormal and the RAID card status of the standby node is abnormal, an alarm process of a soft failure of the RAID cards of the master node and the standby node is performed.

步骤S505:在确定主节点和/或备用节点的IO流检测异常时,进行主节点和/或备用节点IO流检测故障的告警处理。Step S505: when it is determined that the IO flow detection of the primary node and/or the backup node is abnormal, an alarm process of the IO flow detection failure of the primary node and/or the backup node is performed.

具体地,IO流检测异常,表明IO测试工具,例如FIO、IOmeter等出现了故障,需对IO流检测故障进行告警,以使管理人员对IO流检测故障进行处理。Specifically, an IO flow detection anomaly indicates that an IO test tool, such as FIO, IOmeter, etc., has a fault, and an alarm needs to be issued for the IO flow detection fault so that the management personnel can handle the IO flow detection fault.

在一种示例中,集群业务系统处于热备场景,包括第一节点和第二节点,第一节点为主节点,第二节点为备用节点,数据库业务在第一节点上运行,当第一节点发生故障时,数据库业务从第一节点切换到第二节点上运行。集群业务系统还可以包括管理节点。第一节点或者管理节点根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,分别确定该第一节点和该第二节点的RAID卡状态,并根据该第一节点和该第二节点的RAID卡状态对 集群业务进行处理。其中第一节点为主节点,第二节点为备用节点。具体方式如下表1所示:In one example, a cluster service system is in a hot standby scenario, including a first node and a second node, the first node is a primary node, the second node is a backup node, and a database service runs on the first node. When a failure occurs in the first node, the database service switches from the first node to the second node. The cluster service system may also include a management node. The first node or the management node determines the RAID card status of the first node and the second node respectively based on the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection, and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, and performs a heartbeat detection on the first node and the second node according to the RAID card status of the first node and the second node. The cluster service is processed. The first node is the main node and the second node is the backup node. The specific method is shown in Table 1 below:

表1热备场景下第一节点和第二节点的IO流检测结果及处理方案

Table 1 IO flow detection results and processing solutions of the first node and the second node in the hot standby scenario

在一种示例中,集群业务系统处于双活场景,包括第一节点和第二节点,第一节点和第二节点互为备用节点。第一节点上运行第一集群业务,第二节点上运行第二集群业务。当第一节点发生故障时,第一集群业务从第一节点切换到第二节点上运行;当第二节点发生故障时,第二集群业务从第二节点切换到第一节点上运行。该集群业务系统还可以包括管理节点。In one example, a cluster service system is in a dual-active scenario, including a first node and a second node, and the first node and the second node are standby nodes for each other. The first node runs a first cluster service, and the second node runs a second cluster service. When the first node fails, the first cluster service switches from the first node to the second node; when the second node fails, the second cluster service switches from the second node to the first node. The cluster service system may also include a management node.

第一节点、第二节点或者管理节点获取第一本地IO流检测和第一对端IO流检测的检测结果,第二本地IO流检测和第二对端IO流检测的检测结果,以及第一节点和第二节点的心跳检测结果。第一节点、第二节点或者管理节点根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,分别确定该第一节点和该第二节点的RAID卡状态,并根据该第一节点和该第二节点的RAID卡状态对集群业务进行处理。具体方式如下表2所示:The first node, the second node or the management node obtains the detection results of the first local IO flow detection and the first peer IO flow detection, the detection results of the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node. The first node, the second node or the management node determines the RAID card status of the first node and the second node respectively according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection, and the heartbeat detection results of the first node and the second node, and processes the cluster service according to the RAID card status of the first node and the second node. The specific method is shown in Table 2 below:

表2双活场景下第一节点和第二节点的IO流检测结果及处理方案

Table 2 IO flow detection results and processing solutions of the first node and the second node in the active-active scenario

以双活场景下,第一节点的RAID状态异常,第二节点的RAID状态正常的情况进行说明。图6为本申请实施例提供的一种集群业务处理方法实施例四的流程示意图。如图6所示,第一节点和第二节点互为备用节点,第一节点上运行数据库业务A,第二节点上运 行数据库业务B。第一节点和第二节点的心跳检测结果均正常,第一本地IO流检测的检测结果异常,第一对端IO流检测的检测结果正常,第二本地IO流检测的检测结果正常,第二对端IO流检测的检测结果的检测结果异常,由此确定第一节点的系统盘读异常,存在系统假死情况,数据库业务A需从第一节点切换至第二节点。In the active-active scenario, the RAID status of the first node is abnormal, while the RAID status of the second node is normal. FIG6 is a flow chart of a cluster service processing method embodiment 4 provided in the present application. As shown in FIG6, the first node and the second node are standby nodes for each other. Database service A is running on the first node, and database service A is running on the second node. Carry out database business B. The heartbeat detection results of the first node and the second node are normal, the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is normal, the detection result of the second local IO flow detection is normal, and the detection result of the second peer IO flow detection is abnormal. Therefore, it is determined that the system disk read of the first node is abnormal, there is a system pseudo-death situation, and database business A needs to be switched from the first node to the second node.

此外,由于RAID卡软故障可能会导致操作系统的运行出现问题,进而导致集群业务无法正常进行切换,因此需要确定集群业务是否成功切换。若切换成功,则结束切换流程;若切换失败,则第一节点可以通过带外管理通道,重新启动第二节点,以使集群业务从第二节点切换至第一节点。示例性地,带外管理通道可以为智能平台管理接口协议(Intelligent Platform Management Interface,简称IPMI)通道。图7为本申请实施例提供的一种集群业务处理方法实施例五的流程示意图。In addition, since a soft failure of the RAID card may cause problems in the operation of the operating system, which in turn causes the cluster service to be unable to switch normally, it is necessary to determine whether the cluster service has been successfully switched. If the switch is successful, the switching process ends; if the switch fails, the first node can restart the second node through the out-of-band management channel to switch the cluster service from the second node to the first node. Exemplarily, the out-of-band management channel can be an Intelligent Platform Management Interface (IPMI) channel. Figure 7 is a flow chart of a fifth embodiment of a cluster service processing method provided in an embodiment of the present application.

如图7所示,本申请实施例通过调用底层的IO测试工具,比如FIO、IOmeter等定期对系统盘进行读操作,以进行IO流检测。示例性的,如果返回结果则是正常的;如果超时未响应,则确定系统盘读异常,存在系统假死情况,需将结果同步给集群业务所在的节点以实现集群业务切换至正常节点运行,保证业务连续性。如果此时的集群业务所在节点在假死的系统下无法实现切换,则可通过正常节点和故障节点的带外管理建立网络通道,比如IPMI通道,实现故障节点的重启从而实现集群业务的切换。As shown in FIG7 , the embodiment of the present application performs IO flow detection by calling the underlying IO test tools, such as FIO, IOmeter, etc., to periodically read the system disk. For example, if the result is returned, it is normal; if there is no response after the timeout, it is determined that the system disk read is abnormal, and there is a system pseudo-death situation. The result needs to be synchronized to the node where the cluster service is located to switch the cluster service to the normal node to ensure business continuity. If the node where the cluster service is located cannot be switched under the pseudo-death system at this time, a network channel, such as an IPMI channel, can be established through out-of-band management of normal nodes and faulty nodes to restart the faulty node and thus switch the cluster service.

在本实施例中,第一节点根据第一本地IO流检测、第一对端IO流检测、第二本地IO流检测和第二对端IO流检测的检测结果以及第一节点和第二节点的心跳检测结果,确定第一节点和第二节点的RAID卡状态,根据该第一节点和该第二节点的RAID卡状态对集群业务进行处理,能够在因RAID卡软故障导致业务需要进行切换时,及时将业务从故障节点切换到正常节点的基础上,实现异常情况的告警。进一步解决了现有技术中因心跳网络检测无法检测到节点中的RAID卡软故障,而导致业务无法从RAID卡发生故障节点切换到正常节点的问题。In this embodiment, the first node determines the RAID card status of the first node and the second node according to the detection results of the first local IO flow detection, the first peer IO flow detection, the second local IO flow detection and the second peer IO flow detection and the heartbeat detection results of the first node and the second node, and processes the cluster service according to the RAID card status of the first node and the second node, and can timely switch the service from the faulty node to the normal node when the service needs to be switched due to the soft failure of the RAID card, and realize the alarm of abnormal situation. This further solves the problem in the prior art that the service cannot be switched from the node with the faulty RAID card to the normal node because the heartbeat network detection cannot detect the soft failure of the RAID card in the node.

下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。The following is an embodiment of the device of the present application, which can be used to execute the embodiment of the method of the present application. For details not disclosed in the embodiment of the device of the present application, please refer to the embodiment of the method of the present application.

图8为本申请实施例提供的一种服务器实施例的结构示意图;如图8所示,该服务器60包括:获取模块61以及处理模块62。其中,获取模块61用于在第一节点的心跳网络和第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果;获取模块61还用于在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果;处理模块62用于基于该对端IO流检测的检测结果和该本地IO流检测的检测结果,确定故障原因,并根据该故障原因对集群业务进行处理;其中,该本地IO流检测为该第一节点对该第一节点中的第一独立冗余磁盘阵列RAID卡与第一磁盘之间的IO流的第一本地IO流检测,和/或该第二节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第二本地IO流检测;该对端IO流检测为该第一节点对该第二节点中的第二RAID卡与第二磁盘之间的IO流的第一对端IO流检测,和/或该第二节点对该第一节点中的第一RAID卡与第一磁盘之间的IO流的第二对端IO流检测。FIG8 is a schematic diagram of the structure of a server embodiment provided in an embodiment of the present application; as shown in FIG8 , the server 60 includes: an acquisition module 61 and a processing module 62 . The acquisition module 61 is used to acquire the detection result of the local input and output IO flow detection when the heartbeat network of the first node and the heartbeat network of the second node are normal; the acquisition module 61 is also used to acquire the detection result of the peer IO flow detection when the detection result of the local IO flow detection is abnormal; the processing module 62 is used to determine the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, and process the cluster service according to the cause of the fault; wherein the local IO flow detection is a first local IO flow detection of the IO flow between the first independent redundant disk array RAID card in the first node and the first disk, and/or a second local IO flow detection of the IO flow between the second RAID card in the second node and the second disk; the peer IO flow detection is a first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk, and/or a second peer IO flow detection of the IO flow between the first RAID card in the first node and the first disk by the second node.

本申请实施例提供的服务器可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。The server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.

在一种可能的实施方案中,获取模块61具体用于第一节点和第二节点分别向本地节 点的磁盘发起第一读数据指令;其中,该第一读数据指令用于读取该本地节点中的磁盘中的第一数据;获取该本地节点的磁盘基于该第一读数据指令返回的第一数据;该本地IO流检测的检测结果为异常,包括:没有获取到该本地节点的磁盘返回的第一数据。In a possible implementation, the acquisition module 61 is specifically configured to: the first node and the second node to send the local node The first read data instruction is initiated by the disk of the local node; wherein the first read data instruction is used to read the first data in the disk in the local node; the first data returned by the disk of the local node based on the first read data instruction is obtained; the detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained.

在一种可能的实施方案中,获取模块61具体用于在预设的第一时间段内,按照预设的时间间隔点获取该本地节点的磁盘基于该第一读数据指令返回的第一数据;本地IO流检测的检测结果为异常,还包括:若在该预设的第一时间段内,按照该预设的时间间隔点,获取到该本地节点的磁盘返回第一数据的次数总和小于第一次数阈值,则确定该本地IO流检测的检测结果为异常。In a possible implementation scheme, the acquisition module 61 is specifically used to obtain the first data returned by the disk of the local node based on the first read data instruction at a preset time interval within a preset first time period; the detection result of the local IO flow detection is abnormal, and also includes: if within the preset first time period, at the preset time interval, the total number of times the first data returned by the disk of the local node is obtained is less than the first number threshold, then the detection result of the local IO flow detection is determined to be abnormal.

在一种可能的实施方案中,获取模块61具体用于在预设的第二时间段内,获取该本地节点的磁盘基于该第一读数据指令返回的第一数据;本地IO流检测的检测结果为异常,还包括:若获取到该本地节点的磁盘返回的第一数据的获取时间与该预设的第二时间之差大于第一时间阈值,则确定该本地IO流检测的检测结果为异常;其中,该预设的第二时间为正常获取到该本地节点的磁盘返回第一数据的最大时间。In a possible implementation scheme, the acquisition module 61 is specifically used to acquire, within a preset second time period, the first data returned by the disk of the local node based on the first read data instruction; the detection result of the local IO flow detection is abnormal, and further includes: if the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than the first time threshold, then determining that the detection result of the local IO flow detection is abnormal; wherein the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.

在一种可能的实施方案中,获取模块61具体用于在预设的第一时间段内,获取该本地节点的磁盘基于该第一读数据指令返回的第一数据;本地IO流检测的检测结果为异常,还包括:在该预设的第一时间段内,获取到该本地节点的磁盘返回的第一数据的获取时间与预设的第二时间之差大于第一时间阈值的次数大于第二次数阈值;该预设的第二时间为正常获取到该本地节点的磁盘返回第一数据的最大时间。In a possible implementation scheme, the acquisition module 61 is specifically used to acquire, within a preset first time period, the first data returned by the disk of the local node based on the first read data instruction; the detection result of the local IO flow detection is abnormal, and also includes: within the preset first time period, the number of times the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than the first time threshold is greater than the second number threshold; the preset second time is the maximum time for normally acquiring the first data returned by the disk of the local node.

本申请实施例提供的服务器可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。The server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.

在一种可能的实施方案中,获取模块61具体用于第一节点或者第二节点向对端节点中的磁盘发起第二读数据指令;其中,该第二读数据指令用于读取该对端节点中的磁盘中的第二数据;获取该对端节点中的磁盘基于该第二读数据指令返回的第二数据;若获取到该对端节点中的磁盘返回的第二数据,则确定该对端IO流检测的检测结果为正常;若没有获取到该对端节点中的磁盘返回的第二数据,则确定该对端IO流检测的检测结果为异常。In a possible implementation scheme, the acquisition module 61 is specifically used for the first node or the second node to initiate a second read data instruction to the disk in the opposite node; wherein the second read data instruction is used to read the second data in the disk in the opposite node; obtain the second data returned by the disk in the opposite node based on the second read data instruction; if the second data returned by the disk in the opposite node is obtained, it is determined that the detection result of the opposite IO flow detection is normal; if the second data returned by the disk in the opposite node is not obtained, it is determined that the detection result of the opposite IO flow detection is abnormal.

本申请实施例提供的服务器可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。The server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.

在一种可能的实施方案中,处理模块62具体用于在第一本地IO流检测的检测结果为异常,且第一对端IO流检测的检测结果为正常的情况下,则确定第一节点的RAID状态异常;在第二本地IO流检测的检测结果为异常,且第二对端IO流检测的检测结果为正常的情况下,则确定第二节点的RAID状态异常;在第一本地IO流检测的检测结果为异常,第一对端IO流检测的检测结果为异常,且第二本地IO流检测的检测结果为正常的情况下,则确定第一节点的IO流检测服务异常;在第二本地IO流检测的检测结果为异常,第二对端IO流检测的检测结果为异常,且第一本地IO流检测的检测结果为正常的情况下,则确定第二节点的IO流检测服务异常。In a possible implementation, the processing module 62 is specifically used to determine that the RAID state of the first node is abnormal when the detection result of the first local IO flow detection is abnormal and the detection result of the first peer IO flow detection is normal; determine that the RAID state of the second node is abnormal when the detection result of the second local IO flow detection is abnormal and the detection result of the second peer IO flow detection is normal; determine that the IO flow detection service of the first node is abnormal when the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is abnormal, and the detection result of the second local IO flow detection is normal; determine that the IO flow detection service of the second node is abnormal when the detection result of the second local IO flow detection is abnormal, the detection result of the second peer IO flow detection is abnormal, and the detection result of the first local IO flow detection is normal.

本申请实施例提供的服务器可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。The server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.

在一种可能的实施方案中,集群业务系统处于热备场景下,第一节点为主节点,第二 节点为备用节点;处理模块62具体用于在确定该第一节点的RAID卡状态异常,且该第二节点的RAID卡状态正常时,将该集群业务从该第一节点切换到该第二节点上运行,并进行该第一节点RAID卡软故障的告警处理;在确定该第二节点的RAID卡状态异常,且该第一节点的RAID卡状态正常时,进行该第二节点RAID卡软故障的告警处理;在确定该第一节点的RAID卡状态异常,且该第二节点的RAID卡状态异常时,进行该第一节点和第二节点RAID卡软故障的告警处理;在确定该第一节点和/或第二节点的IO流检测服务异常时,进行该第一节点和/或第二节点IO流检测故障的告警处理。In a possible implementation scheme, the cluster service system is in a hot standby scenario, the first node is the main node, the second The node is a spare node; the processing module 62 is specifically used to switch the cluster service from the first node to the second node for operation when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, and perform alarm processing of the soft fault of the RAID card of the first node; when it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, perform alarm processing of the soft fault of the RAID card of the second node; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, perform alarm processing of the soft fault of the RAID cards of the first node and the second node; when it is determined that the IO flow detection service of the first node and/or the second node is abnormal, perform alarm processing of the IO flow detection failure of the first node and/or the second node.

在一种可能的实施方案中,集群业务系统处于双活场景下,第一节点和第二节点互为备用节点,该第一节点上运行第一集群业务,该第二节点上运行第二集群业务;处理模块62具体用于在确定该第一节点的RAID卡状态异常,且该第二节点的RAID卡状态正常时,将该第一集群业务从该第一节点切换到该第二节点上运行,并进行该第一节点RAID卡软故障的告警处理;在确定该第二节点的RAID卡状态异常,且该第一节点的RAID卡状态正常时,将该第二集群业务从该第二节点切换到该第一节点上运行,并进行该第二节点RAID卡软故障的告警处理;在确定该第一节点的RAID卡状态异常,且该第二节点的RAID卡状态异常时,进行该第一节点和第二节点RAID卡软故障的告警处理;在确定该第一节点和/或第二节点的IO流检测服务异常时,进行该第一节点和/或第二节点IO流检测故障的告警处理。In a possible implementation scheme, the cluster service system is in a dual-active scenario, the first node and the second node are backup nodes for each other, the first node runs the first cluster service, and the second node runs the second cluster service; the processing module 62 is specifically used to switch the first cluster service from the first node to the second node for operation when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, and perform alarm processing of the soft fault of the RAID card of the first node; when it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, switch the second cluster service from the second node to the first node for operation, and perform alarm processing of the soft fault of the RAID card of the second node; when it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, perform alarm processing of the soft fault of the RAID cards of the first node and the second node; when it is determined that the IO flow detection service of the first node and/or the second node is abnormal, perform alarm processing of the IO flow detection failure of the first node and/or the second node.

本申请实施例提供的服务器可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。The server provided in the embodiment of the present application can execute the technical solution shown in the above method embodiment, and its implementation principle and beneficial effects are similar, which will not be repeated here.

图9为本申请实施例提供的一种服务器的结构示意图。如图9所示,该服务器70包括:处理器71,存储器72,以及通信接口73;其中,存储器72用于存储处理器71的可执行指令;处理器71配置为经由执行可执行指令来执行前述任一方法实施例中的技术方案。FIG9 is a schematic diagram of the structure of a server provided in an embodiment of the present application. As shown in FIG9 , the server 70 includes: a processor 71, a memory 72, and a communication interface 73; wherein the memory 72 is used to store executable instructions of the processor 71; the processor 71 is configured to execute the technical solution in any of the aforementioned method embodiments by executing the executable instructions.

可选的,存储器72既可以是独立的,也可以跟处理器71集成在一起。Optionally, the memory 72 can be independent or integrated with the processor 71.

可选的,当存储器72是独立于处理器71之外的器件时,服务器70还可以包括:总线74,用于将上述器件连接起来。Optionally, when the memory 72 is a device independent of the processor 71, the server 70 may further include: a bus 74 for connecting the above devices.

该服务器用于执行前述任一方法实施例中的技术方案,其实现原理和技术效果类似,在此不再赘述。The server is used to execute the technical solution in any of the aforementioned method embodiments, and its implementation principle and technical effect are similar and will not be repeated here.

本申请实施例还提供一种集群业务系统。该集群业务系统包括至少一个第一节点和至少一个第二节点,其中,该第一节点为主节点,该第二节点为备用节点;其中,第一节点执行前述任一方法实施例中的技术方案。The embodiment of the present application also provides a cluster service system. The cluster service system includes at least one first node and at least one second node, wherein the first node is a master node and the second node is a backup node; wherein the first node executes the technical solution in any of the above method embodiments.

本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those skilled in the art can understand that all or part of the steps of implementing the above-mentioned method embodiments can be completed by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the steps of the above-mentioned method embodiments are executed; and the aforementioned storage medium includes: ROM, RAM, disk or optical disk and other media that can store program codes.

最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或对其中部分或全部技术特 征进行等同替换;而这些修改或替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the above embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the above embodiments, or modify some or all of the technical features thereof. However, these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the scope of the technical solutions of the embodiments of this application.

Claims (10)

一种集群业务处理方法,其特征在于,所述方法应用于集群业务系统,所述集群业务系统包括:第一节点和第二节点;所述方法包括:A cluster service processing method, characterized in that the method is applied to a cluster service system, the cluster service system comprising: a first node and a second node; the method comprises: 在所述第一节点的心跳网络和所述第二节点的心跳网络正常的情况下,获取本地输入输出IO流检测的检测结果;When the heartbeat network of the first node and the heartbeat network of the second node are normal, obtaining a detection result of a local input and output IO flow detection; 在本地IO流检测的检测结果为异常的情况下,获取对端IO流检测的检测结果;When the detection result of the local IO flow detection is abnormal, the detection result of the peer IO flow detection is obtained; 基于所述对端IO流检测的检测结果和所述本地IO流检测的检测结果,确定故障原因,并根据所述故障原因对所述集群业务进行处理;Determine the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection, and process the cluster service according to the cause of the fault; 其中,所述本地输入输出IO流检测为所述第一节点对所述第一节点中的第一独立冗余磁盘阵列RAID卡与第一磁盘之间的IO流的第一本地IO流检测,和所述第二节点对所述第二节点中的第二RAID卡与第二磁盘之间的IO流的第二本地IO流检测;The local input/output IO flow detection is a first local IO flow detection of an IO flow between a first independent redundant disk array RAID card in the first node and a first disk by the first node, and a second local IO flow detection of an IO flow between a second RAID card in the second node and a second disk by the second node by the second node; 所述对端IO流检测为所述第一节点对所述第二节点中的第二RAID卡与第二磁盘之间的IO流的第一对端IO流检测,和/或所述第二节点对所述第一节点中的第一RAID卡与第一磁盘之间的IO流的第二对端IO流检测。The peer IO flow detection is a first peer IO flow detection of the IO flow between the second RAID card in the second node and the second disk by the first node, and/or a second peer IO flow detection of the IO flow between the first RAID card in the first node and the first disk by the second node. 根据权利要求1所述的集群业务处理方法,其特征在于,获取本地IO流检测的检测结果,包括:The cluster service processing method according to claim 1, characterized in that obtaining the detection result of the local IO flow detection comprises: 所述第一节点和所述第二节点分别向本地节点的磁盘发起第一读数据指令;其中,所述第一读数据指令用于读取所述本地节点中的磁盘中的第一数据;The first node and the second node respectively initiate a first data read instruction to the disk of the local node; wherein the first data read instruction is used to read the first data in the disk in the local node; 获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据;Acquire first data returned by the disk of the local node based on the first data read instruction; 所述本地IO流检测的检测结果为异常,包括:没有获取到所述本地节点的磁盘返回的第一数据。The detection result of the local IO flow detection is abnormal, including: the first data returned by the disk of the local node is not obtained. 根据权利要求2所述的集群业务处理方法,其特征在于,所述获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据,包括:The cluster service processing method according to claim 2, characterized in that the step of obtaining the first data returned by the disk of the local node based on the first read data instruction comprises: 在预设的第一时间段内,按照预设的时间间隔点获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据;Acquire, within a preset first time period, first data returned by the disk of the local node based on the first data read instruction at preset time intervals; 所述本地IO流检测的检测结果为异常,还包括:若在所述预设的第一时间段内,按照所述预设的时间间隔点,获取到所述本地节点的磁盘返回第一数据的次数总和小于第一次数阈值,则确定所述本地IO流检测的检测结果为异常。The detection result of the local IO flow detection is abnormal, and also includes: if within the preset first time period, according to the preset time interval point, the total number of times the disk of the local node returns the first data is obtained is less than the first number threshold, then the detection result of the local IO flow detection is determined to be abnormal. 根据权利要求2所述的集群业务处理方法,其特征在于,所述获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据,包括:The cluster service processing method according to claim 2, characterized in that the step of obtaining the first data returned by the disk of the local node based on the first read data instruction comprises: 在预设的第二时间段内,获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据;Within a preset second time period, obtaining first data returned by the disk of the local node based on the first data read instruction; 所述本地IO流检测的检测结果为异常,还包括:若获取到所述本地节点的磁盘返回的第一数据的获取时间与所述预设的第二时间之差大于第一时间阈值,则确定所述本地IO流检测的检测结果为异常;The detection result of the local IO flow detection is abnormal, further comprising: if the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than the first time threshold, determining that the detection result of the local IO flow detection is abnormal; 其中,所述预设的第二时间为正常获取到所述本地节点的磁盘返回所述第一数据的最大时间。The preset second time is the maximum time for the disk of the local node to return the first data normally. 根据权利要求2所述的集群业务处理方法,其特征在于,所述获取所述本地节点的 磁盘基于所述第一读数据指令返回的第一数据,包括:The cluster service processing method according to claim 2, characterized in that the obtaining of the local node The first data returned by the disk based on the first data read instruction includes: 在预设的第一时间段内,获取所述本地节点的磁盘基于所述第一读数据指令返回的第一数据;Within a preset first time period, obtaining first data returned by the disk of the local node based on the first data read instruction; 所述本地IO流检测的检测结果为异常,还包括:在所述预设的第一时间段内,获取到所述本地节点的磁盘返回的第一数据的获取时间与预设的第二时间之差大于第一时间阈值的次数大于第二次数阈值;所述预设的第二时间为正常获取到所述本地节点的磁盘返回所述第一数据的最大时间。The detection result of the local IO flow detection is abnormal, and also includes: within the preset first time period, the number of times the difference between the acquisition time of the first data returned by the disk of the local node and the preset second time is greater than the first time threshold is greater than the second number threshold; the preset second time is the maximum time for the first data to be returned by the disk of the local node to be normally acquired. 根据权利要求1所述的集群业务处理方法,其特征在于,所述获取对端IO流检测的检测结果,包括:The cluster service processing method according to claim 1, wherein the step of obtaining the detection result of the peer IO flow detection comprises: 所述第一节点或者所述第二节点向对端节点的磁盘发起第二读数据指令;其中,所述第二读数据指令用于读取所述对端节点的磁盘中的第二数据;The first node or the second node initiates a second data read instruction to the disk of the opposite node; wherein the second data read instruction is used to read second data in the disk of the opposite node; 获取所述对端节点的磁盘基于所述第二读数据指令返回的第二数据;若获取到所述对端节点的磁盘返回的第二数据,则确定所述对端IO流检测的检测结果为正常;若没有获取到所述对端节点的磁盘返回的第二数据,则确定所述对端IO流检测的检测结果为异常。Obtain the second data returned by the disk of the opposite node based on the second data read instruction; if the second data returned by the disk of the opposite node is obtained, determine that the detection result of the opposite IO flow detection is normal; if the second data returned by the disk of the opposite node is not obtained, determine that the detection result of the opposite IO flow detection is abnormal. 根据权利要求1所述的集群业务处理方法,其特征在于:所述基于所述对端IO流检测的检测结果和所述本地IO流检测的检测结果,确定故障原因,包括:The cluster service processing method according to claim 1 is characterized in that: the determining the cause of the fault based on the detection result of the peer IO flow detection and the detection result of the local IO flow detection comprises: 在所述第一本地IO流检测的检测结果为异常,且所述第一对端IO流检测的检测结果为正常的情况下,则确定所述第一节点的RAID状态异常;When the detection result of the first local IO flow detection is abnormal and the detection result of the first peer IO flow detection is normal, determining that the RAID state of the first node is abnormal; 在所述第二本地IO流检测的检测结果为异常,且所述第二对端IO流检测的检测结果为正常的情况下,则确定所述第二节点的RAID状态异常;When the detection result of the second local IO flow detection is abnormal and the detection result of the second peer IO flow detection is normal, determining that the RAID state of the second node is abnormal; 在所述第一本地IO流检测的检测结果为异常,所述第一对端IO流检测的检测结果为异常,且所述第二本地IO流检测的检测结果为正常的情况下,则确定所述第一节点的IO流检测服务异常;When the detection result of the first local IO flow detection is abnormal, the detection result of the first peer IO flow detection is abnormal, and the detection result of the second local IO flow detection is normal, it is determined that the IO flow detection service of the first node is abnormal; 在所述第二本地IO流检测的检测结果为异常,所述第二对端IO流检测的检测结果为异常,且所述第一本地IO流检测的检测结果为正常的情况下,则确定所述第二节点的IO流检测服务异常。When the detection result of the second local IO flow detection is abnormal, the detection result of the second peer IO flow detection is abnormal, and the detection result of the first local IO flow detection is normal, it is determined that the IO flow detection service of the second node is abnormal. 根据权利要求7所述的集群业务处理方法,其特征在于,所述根据所述故障原因对所述集群业务进行处理,包括:The cluster service processing method according to claim 7, wherein the processing of the cluster service according to the fault cause comprises: 所述集群业务系统处于热备场景下,所述第一节点为主节点,所述第二节点为备用节点;The cluster service system is in a hot standby scenario, the first node is a master node, and the second node is a standby node; 在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态正常时,将所述集群业务从所述第一节点切换到所述第二节点上运行,并进行所述第一节点RAID卡软故障的告警处理;When it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, the cluster service is switched from the first node to the second node for operation, and an alarm processing of a soft fault of the RAID card of the first node is performed; 在确定所述第二节点的RAID卡状态异常,且所述第一节点的RAID卡状态正常时,进行所述第二节点RAID卡软故障的告警处理;When it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, performing alarm processing of a soft fault of the RAID card of the second node; 在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态异常时,进行所述第一节点和第二节点RAID卡软故障的告警处理;When it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, an alarm process of a soft failure of the RAID cards of the first node and the second node is performed; 在确定所述第一节点和/或第二节点的IO流检测服务异常时,进行所述第一节点和/或第二节点IO流检测故障的告警处理。 When it is determined that the IO flow detection service of the first node and/or the second node is abnormal, an alarm process of the IO flow detection failure of the first node and/or the second node is performed. 根据权利要求7所述的集群业务处理方法,其特征在于,所述根据所述故障原因对所述集群业务进行处理,包括:The cluster service processing method according to claim 7, wherein the processing of the cluster service according to the fault cause comprises: 所述集群业务系统处于双活场景下,所述第一节点和所述第二节点互为备用节点,所述第一节点上运行第一集群业务,所述第二节点上运行第二集群业务;The cluster service system is in an active-active scenario, the first node and the second node are each other's standby nodes, a first cluster service runs on the first node, and a second cluster service runs on the second node; 在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态正常时,将所述第一集群业务从所述第一节点切换到所述第二节点上运行,并进行所述第一节点RAID卡软故障的告警处理;When it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is normal, the first cluster service is switched from the first node to the second node for operation, and an alarm processing of a soft fault of the RAID card of the first node is performed; 在确定所述第二节点的RAID卡状态异常,且所述第一节点的RAID卡状态正常时,将所述第二集群业务从所述第二节点切换到所述第一节点上运行,并进行所述第二节点RAID卡软故障的告警处理;When it is determined that the RAID card state of the second node is abnormal and the RAID card state of the first node is normal, switching the second cluster service from the second node to the first node for operation, and performing alarm processing for a soft fault of the RAID card of the second node; 在确定所述第一节点的RAID卡状态异常,且所述第二节点的RAID卡状态异常时,进行所述第一节点和第二节点RAID卡软故障的告警处理;When it is determined that the RAID card state of the first node is abnormal and the RAID card state of the second node is abnormal, an alarm process of a soft failure of the RAID cards of the first node and the second node is performed; 在确定所述第一节点和/或第二节点的IO流检测服务异常时,进行所述第一节点和/或第二节点IO流检测故障的告警处理。When it is determined that the IO flow detection service of the first node and/or the second node is abnormal, an alarm process of the IO flow detection failure of the first node and/or the second node is performed. 一种集群业务系统,其特征在于,包括:A cluster service system, characterized by comprising: 至少一个第一节点和至少一个第二节点,其中,所述第一节点为主节点,所述第二节点为备用节点;At least one first node and at least one second node, wherein the first node is a primary node and the second node is a backup node; 其中,所述第一节点执行权利要求1至9中任一项所述的集群业务处理方法。 Wherein, the first node executes the cluster service processing method according to any one of claims 1 to 9.
PCT/CN2023/134453 2023-05-19 2023-11-27 Cluster service processing method, server, and system Pending WO2024239569A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310598620.9A CN116668335A (en) 2023-05-19 2023-05-19 A cluster service processing method, server and system
CN202310598620.9 2023-05-19

Publications (1)

Publication Number Publication Date
WO2024239569A1 true WO2024239569A1 (en) 2024-11-28

Family

ID=87713022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/134453 Pending WO2024239569A1 (en) 2023-05-19 2023-11-27 Cluster service processing method, server, and system

Country Status (2)

Country Link
CN (1) CN116668335A (en)
WO (1) WO2024239569A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116668335A (en) * 2023-05-19 2023-08-29 超聚变数字技术有限公司 A cluster service processing method, server and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030182592A1 (en) * 2002-03-22 2003-09-25 Dieter Massa Failure detection and failure handling in cluster controller networks
JP2011076528A (en) * 2009-10-01 2011-04-14 Nec Corp Method and device for providing redundancy to raid card
CN107247564A (en) * 2017-07-17 2017-10-13 郑州云海信息技术有限公司 A kind of method and system of data processing
CN115686951A (en) * 2021-07-30 2023-02-03 网联清算有限公司 Method and device for troubleshooting a database server
CN116668335A (en) * 2023-05-19 2023-08-29 超聚变数字技术有限公司 A cluster service processing method, server and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629225B (en) * 2011-12-31 2014-05-07 华为技术有限公司 Dual-controller disk array, storage system and data storage path switching method
CN103354503A (en) * 2013-05-23 2013-10-16 浙江闪龙科技有限公司 Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof
CN104268038B (en) * 2014-10-09 2017-03-08 浪潮(北京)电子信息产业有限公司 The high-availability system of disk array
CN106407052B (en) * 2015-07-31 2019-09-13 华为技术有限公司 A method and device for detecting a magnetic disk
CN109358808B (en) * 2018-09-26 2021-06-29 郑州云海信息技术有限公司 A data processing method, system and related components
CN111209146B (en) * 2019-12-23 2023-08-22 曙光信息产业(北京)有限公司 RAID card aging test method and system
CN114625680A (en) * 2022-03-16 2022-06-14 长沙景嘉微电子股份有限公司 Disk array storage device and hot switching method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030182592A1 (en) * 2002-03-22 2003-09-25 Dieter Massa Failure detection and failure handling in cluster controller networks
JP2011076528A (en) * 2009-10-01 2011-04-14 Nec Corp Method and device for providing redundancy to raid card
CN107247564A (en) * 2017-07-17 2017-10-13 郑州云海信息技术有限公司 A kind of method and system of data processing
CN115686951A (en) * 2021-07-30 2023-02-03 网联清算有限公司 Method and device for troubleshooting a database server
CN116668335A (en) * 2023-05-19 2023-08-29 超聚变数字技术有限公司 A cluster service processing method, server and system

Also Published As

Publication number Publication date
CN116668335A (en) 2023-08-29

Similar Documents

Publication Publication Date Title
CN105095001B (en) Virtual machine abnormal restoring method under distributed environment
WO2022228499A1 (en) Pcie fault self-repairing method, apparatus and device, and readable storage medium
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN109120522B (en) Multipath state monitoring method and device
CN115499294B (en) A distributed storage environment network sub-health detection and fault automatic processing method
WO2022088861A1 (en) Database fault handling method and apparatus
CN116781488A (en) Database high availability implementation methods, devices, database architectures, equipment and products
CN115686951A (en) Method and device for troubleshooting a database server
CN103684918A (en) Method and device for detecting link failure
WO2024239569A1 (en) Cluster service processing method, server, and system
CN115220937A (en) Method, electronic device and program product for storage management
CN113076210B (en) Server fault diagnosis result notification method, system, terminal and storage medium
CN114760317A (en) Fault detection method of virtual gateway cluster and related equipment
CN109885420B (en) PCIe link fault analysis method, BMC and storage medium
CN119922071A (en) A server fault detection and processing method and device
CN119814529A (en) Fault alarm method, device, computer equipment and storage medium
CN115484267B (en) Multi-cluster deployment processing method and device, electronic equipment and storage medium
JPH07183891A (en) Computer system
CN117527653A (en) Cluster heartbeat management method, system, equipment and medium
CN117909117A (en) Fault repair method, device, non-volatile storage medium and computer equipment
JPH07319836A (en) Fault monitoring method
CN120234225B (en) Controller processing method, device, electronic device and medium
CN118626303B (en) Storage system fault processing method and device, product, storage system and medium
CN116155691B (en) Data processing method and device
CN111986707A (en) Disk link error injection method, exception handling testing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23938236

Country of ref document: EP

Kind code of ref document: A1