CN119814800A

CN119814800A - A cluster management method, device, system, storage medium and computer program product

Info

Publication number: CN119814800A
Application number: CN202411793243.5A
Authority: CN
Inventors: 程康; 杨旭荣
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2024-12-05
Filing date: 2024-12-05
Publication date: 2025-04-11

Abstract

The embodiment of the application discloses a cluster management method applied to a server node, which comprises the steps of receiving online event information reported by a first proxy node, wherein the first proxy node operates in the first cluster node and is used for carrying out node management on the first cluster node, the online event information comprises attribute characteristic parameters of the first cluster node perceived by the first proxy node when the first proxy node operates, determining a working role of the first proxy node based on the online event information, wherein the working role is a master node or a slave node, and sending first indication information to the first proxy node, wherein the indication information is used for indicating the first proxy node to execute operation corresponding to the working role. The embodiment of the application also discloses a first cluster management device, a second cluster management device, a system, a storage medium and a computer program product.

Description

Cluster management method, device, system, storage medium and computer program product

Technical Field

The present application relates to the field of computer application technologies, and in particular, to a cluster management method, apparatus, system, storage medium, and computer program product.

Background

With the development of information technology, high availability clusters (High Availability Cluster, HACluster) play an increasingly important role in guaranteeing critical business continuity and data integrity. The HA clusters reduce the risk of single-point faults through redundancy and a fault transfer mechanism, and improve the usability of the system. However, existing HA cluster management schemes are generally designed for specific cluster types and topologies, and lack versatility, resulting in the need to develop and maintain different failover strategies under different environments, increasing the complexity and maintenance costs of the system.

Disclosure of Invention

In view of this, embodiments of the present application expect to provide a cluster management method, apparatus, system, storage medium, and computer program product, which solve the problem of poor generality of the current HA cluster management solution, and provide a reliable cluster management method, so as to implement the HA cluster management method that can be adopted for different cluster types and topologies in different environments, thereby having versatility and reducing complexity of the system.

In order to achieve the above purpose, the technical scheme of the application is realized as follows:

In a first aspect, a cluster management method, where the method is applied to a server node, the method includes:

The method comprises the steps of receiving online event information reported by a first proxy node, wherein the first proxy node operates in a first cluster node and is used for carrying out node management on the first cluster node, and the online event information comprises attribute characteristic parameters of the first cluster node perceived when the first proxy node operates;

Determining a working role of the first proxy node based on the online event information, wherein the working role is a master node or a slave node;

and sending first indication information to the first proxy node, wherein the indication information is used for indicating the first proxy node to execute the operation corresponding to the working role.

In the above scheme, the method further comprises:

and recording node identification information of the first proxy node and the working role.

In the above solution, the determining, based on the online event information, the working role of the first proxy node includes:

And calculating the online event information by adopting a finite state machine FSM, and determining the working role of the first proxy node.

In the above solution, after the sending the indication information to the first proxy node, the method further includes:

If fault notification information sent by the first proxy node is received, determining a current working role of a fault proxy node indicated in the fault notification information based on the fault notification information, wherein the fault proxy node is the first proxy node or a second proxy node perceived by the first proxy node, and a second cluster node running the second proxy node and the first cluster node belong to the same cluster;

Based on the current working role, determining a management operation corresponding to the fault proxy node;

and executing the management operation.

In the above solution, the determining, based on the current working role, a management operation corresponding to the failed proxy node includes:

If the current working role is the master node, calculating at least one third generation node which is currently managed and is in a working state by adopting a Finite State Machine (FSM), and determining a fourth proxy node which is used as the master node, wherein the second cluster node and the cluster node running the fourth proxy node belong to the same cluster;

And sending second instruction information to the fourth proxy node, wherein the second instruction information is used for instructing the fourth proxy node to execute a main node function.

In the above solution, if the current working role is the master node, the finite state machine FSM is used to calculate at least one third generation node currently managed and in a working state, and after determining a fourth proxy node used as the master node, the method further includes:

Based on the network information of the fourth proxy node, the routing configuration of the reverse proxy is modified to subsequently forward access requests sent to the failed proxy node to the fourth proxy node.

In the above scheme, the method further comprises:

Generating fault prompt information of faults of fault cluster nodes corresponding to the fault proxy nodes;

And outputting the fault prompt information.

In the above scheme, the method further comprises:

Acquiring user characteristic configuration information of the first cluster node;

And sending the user characteristic configuration information to the first proxy node so that the first proxy node accesses the first cluster node based on the user characteristic configuration information.

In the above scheme, the method further comprises:

The server node comprises one or more management nodes, and the load managed by each management node is obtained by distributing the load of the cluster system by adopting a consistent hash algorithm.

In a second aspect, a cluster management method applied to a first proxy node running in a first cluster node, the method comprising:

If the first proxy node operates for the first time, determining online event information;

sending the online event information to a server node;

receiving first indication information sent by the server node, wherein the first indication information is generated by the server node based on the online event information;

And calling the resource information of the first cluster node to realize the function corresponding to the working role indicated by the first indication information based on the first indication information.

In the above scheme, the method further comprises:

receiving user characteristic configuration information of the first cluster node sent by the server node;

And configuring the user characteristic configuration information.

In the above scheme, the method further comprises:

performing health detection on the first cluster node according to a preset period to obtain a detection result;

If the detection result is that the continuous preset times do not pass the health detection, generating fault notification information of the first cluster node with faults;

and sending the fault notification information to the server node.

In the above scheme, the method further comprises:

If the heartbeat information of the second proxy node is not sensed, generating fault notification information of faults of the second cluster node, wherein the second cluster node running the second proxy node and the first cluster node belong to the same cluster;

and sending the fault notification information to the server node.

In a third aspect, a first cluster management device is applied to a server node, and the device comprises a first receiving unit, a first determining unit and a first sending unit, wherein:

The first receiving unit is used for receiving the online event information reported by a first proxy node, wherein the first proxy node operates in a first cluster node and is used for carrying out node management on the first cluster node, and the online event information comprises attribute characteristic parameters of the first cluster node perceived by the first proxy node during operation;

The first determining unit is used for determining the working role of the first proxy node based on the online event information, wherein the working role is a master node or a slave node;

the first sending unit is configured to send first indication information to the first proxy node, where the indication information is configured to instruct the first proxy node to execute an operation corresponding to the working role.

The fourth aspect is a second cluster management device, where the device is applied to a first proxy node, and the device includes a second determining unit, a second sending unit, a second receiving unit, and a calling unit, where:

The second determining unit is configured to determine online event information if the first proxy node runs for the first time;

the second sending unit is configured to send the online event information to a server node;

The second receiving unit is used for receiving first indication information sent by the server node, wherein the first indication information is generated by the server node based on the online event information;

And the calling unit is used for calling the resource information of the first cluster node based on the first indication information to realize the function corresponding to the working role indicated by the first indication information.

In a fifth aspect, a cluster management system at least includes a cluster, a server node, and one or more proxy nodes including a first proxy node, wherein:

the server node is configured to implement the step of the cluster management method described in any one of the foregoing cluster management methods;

the first proxy node is configured to implement the step of the cluster management method described in any one of the foregoing cluster management methods.

In a sixth aspect, a storage medium has stored thereon a cluster management program, which when executed by a processor, implements the steps of the cluster management method according to any of the preceding claims.

A seventh aspect, a computer program product comprising a computer program which, when executed by a processor, implements the steps of the cluster management method as claimed in any one of the preceding claims.

The cluster management method, the device, the system, the storage medium and the computer program product provided by the embodiment of the application have the advantages that if the first proxy node runs for the first time, the online event information is determined, the online event information is sent to the server node, after the server node receives the online event information reported by the first proxy node, the working role of the first proxy node is determined based on the online event information, the first indication information is sent to the first proxy node, the first proxy node receives the first indication information sent by the server node, and the resource information of the first cluster node is called based on the first indication information to realize the function corresponding to the working role indicated by the first indication information. Therefore, the proxy node running in the cluster node is managed through the server node, so that the control management of the cluster node is realized through the control of the proxy node, the problem of poor universality of the conventional HA cluster management scheme is solved, a reliable cluster management method is provided, the HA cluster management method which can be adopted for different cluster types and topological structures under different environments is realized, the universality is realized, and the complexity of the system is reduced.

Drawings

Fig. 1 is a schematic flow chart of a cluster management method according to an embodiment of the present application;

fig. 2 is a second flow chart of a cluster management method according to an embodiment of the present application;

fig. 3 is a flow chart diagram of a cluster management method according to an embodiment of the present application;

Fig. 4 is a flow chart diagram of a cluster management method according to an embodiment of the present application;

fig. 5 is a flowchart of a cluster management method according to an embodiment of the present application;

Fig. 6 is a flowchart of a cluster management method according to an embodiment of the present application;

fig. 7 is a flow chart of a cluster management method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a relationship structure between proxy nodes and cluster nodes according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an FSM according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a routing configuration according to an embodiment of the present application;

FIG. 11 is a schematic diagram of an alarm provided in an embodiment of the present application;

fig. 12 is a schematic application structure diagram of a multi-service end node according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of a first cluster management device according to an embodiment of the present application;

Fig. 14 is a schematic structural diagram of a second cluster management device according to an embodiment of the present application;

Fig. 15 is a schematic structural diagram of a cluster management system according to an embodiment of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

An embodiment of the present application provides a cluster management method, which is applied to a server node, and is shown with reference to fig. 1, and the method includes the following steps:

And step 101, receiving the online event information reported by the first proxy node.

The first proxy node operates in the first cluster node, and is used for performing node management on the first cluster node, and the online event information comprises attribute characteristic parameters of the first cluster node perceived by the first proxy node during operation.

In the embodiment of the application, the service end node is a management node for managing at least one proxy node. The first proxy node operates in a first cluster node in the cluster system to be managed, so that the service end node manages the first cluster node operated by the first proxy node. The first proxy node may be manually installed into the first cluster node by a user, or may be automatically installed into the first cluster node under the control of the server node. After the first proxy node is installed in the first cluster node, when the first proxy node starts to operate, acquiring attribute characteristic parameters of the first cluster node where the first proxy node is located, generating online event information of the first proxy node, and sending the online event information to the server node by the first proxy node.

Step 102, determining the working role of the first proxy node based on the online event information.

The working role is a master node or a slave node.

In the embodiment of the application, the master node is a node capable of comprehensively managing the Slave nodes, which can be called master node, and the corresponding Slave node can be called Slave node. And the server node performs information analysis decision processing on the received online event information and determines the working role of the first proxy node. In general, for a proxy node corresponding to one cluster, only one master node is usually set, and then the corresponding other proxy nodes are slave nodes.

Step 103, sending the first indication information to the first proxy node.

The indication information is used for indicating the first proxy node to execute the operation corresponding to the working role.

In the embodiment of the application, after the service end node determines the working role of the first proxy node, the service end node generates the first indication information and sends the first indication information to the first proxy node so as to control the first proxy node to execute the operation corresponding to the working role and realize the function corresponding to the working role.

According to the cluster management method provided by the embodiment of the application, after the server side node receives the online event information reported by the first proxy node, the working role of the first proxy node is determined based on the online event information, and the first indication information is sent to the first proxy node, so that the first proxy node receives the first indication information sent by the server side node, and the function corresponding to the working role indicated by the first indication information is realized by calling the resource information of the first cluster node based on the first indication information. Therefore, the proxy node running in the cluster node is managed through the server node, so that the control management of the cluster node is realized through the control of the proxy node, the problem of poor universality of the conventional HA cluster management scheme is solved, a reliable cluster management method is provided, the HA cluster management method which can be adopted for different cluster types and topological structures under different environments is realized, the universality is realized, and the complexity of the system is reduced.

Based on the foregoing embodiments, an embodiment of the present application provides a cluster management method, referring to fig. 2, the method is applied to a first proxy node, and the method includes the following steps:

Step 201, if the first proxy node operates for the first time, determining the online event information.

In the embodiment of the present application, the online event information at least includes a topology type of a cluster to which the first cluster node belongs, a current state of the cluster, and further may further include information contents such as remaining resources of the cluster, where the current state of the cluster to which the first cluster node belongs may include information such as which cluster nodes, roles of each cluster node, whether the cluster nodes are online, and the like. And when the first proxy node starts to run after being installed, determining the online event information corresponding to the first cluster node.

Step 202, sending the online event information to the server node.

In the embodiment of the application, the first proxy node sends the determined online event information to the server node through a communication interface between the first proxy node and the server node, so that after the server node receives the online event information sent by the first proxy node, the online event information sent by the first proxy node is analyzed, the working role of the first proxy node is determined, the first indication information is generated based on the working role of the first proxy node, and then the first indication information is sent to the first proxy node.

Step 203, receiving first indication information sent by a server node.

The first indication information is generated by the server node based on the online event information.

In the embodiment of the application, the first proxy node sends the first indication information to the server node through a communication interface between the first proxy node and the server node.

And 204, calling the resource information of the first cluster node to realize the function corresponding to the working role indicated by the first indication information based on the first indication information.

In the embodiment of the application, the first proxy node responds to the received first indication information, and invokes the resource information corresponding to the first cluster node according to the first indication information, so as to realize the function corresponding to the working role indicated by the first indication information.

In the cluster management method provided by the embodiment of the application, if the first proxy node runs for the first time, the online event information is determined, the online event information is sent to the server node, then the first proxy node receives the first indication information sent by the server node, and based on the first indication information, the resource information of the first cluster node is called to realize the function corresponding to the working role indicated by the first indication information. Therefore, the proxy node running in the cluster node is managed through the server node, so that the control management of the cluster node is realized through the control of the proxy node, the problem of poor universality of the conventional HA cluster management scheme is solved, a reliable cluster management method is provided, the HA cluster management method which can be adopted for different cluster types and topological structures under different environments is realized, the universality is realized, and the complexity of the system is reduced.

Based on the foregoing embodiments, an embodiment of the present application provides a cluster management method, referring to fig. 3, including the steps of:

step 301, if the first proxy node operates for the first time, the first proxy node determines the online event information.

In the embodiment of the application, when the server node determines that a certain cluster needs to be managed, a user installs each first proxy node in cluster nodes included in the cluster according to an instruction, and when each installed first proxy node runs for the first time, acquires the online event information corresponding to the environment where the user is located.

Step 302, the first proxy node sends the online event information to the server node.

In the embodiment of the application, the first proxy node sends the online event information to the server node, performs online registration at the server node, and requests the server node to determine the working role of the first proxy node according to the online event information.

Step 303, the server node receives the online event information reported by the first proxy node.

Step 304, the service end node determines the working role of the first proxy node based on the online event information.

The working role is a master node or a slave node.

In the embodiment of the application, the server node analyzes the received online event information by adopting a preset analysis mode, and determines to obtain the working role of the first proxy node.

In some application scenarios, after the service end node receives the online event information of all the managed proxy nodes, the service end node performs working role analysis on all the online event information, so as to determine and obtain the working roles of the first proxy node.

Step 305, the server node sends first indication information to the first proxy node.

In the embodiment of the application, after the first proxy node is successfully registered, the server node sends the determined working role of the first proxy node to the first proxy node through the first indication information.

Step 306, the first proxy node receives the first indication information sent by the server node.

Step 307, the first proxy node invokes the resource information of the first cluster node to implement the function corresponding to the working role indicated by the first indication information based on the first indication information.

In the embodiment of the application, the first proxy node calls some resource information such as codes, function parameters and the like in the first cluster node according to the first indication information to realize the function corresponding to the working role indicated by the first indication information.

For example, when the working role of the first proxy node is determined as the master node, the first proxy node invokes the resource information of the first cluster node to implement the master node function, for example, receives a service request sent by a user, provides service for the user, receives service data such as data sent by other proxy nodes and slave nodes, or controls the data to be stored sent by the user to be stored in the first cluster node through other proxy nodes. And if the working role of the first proxy node is determined to be the slave node, the first proxy node executes the task indicated by the master node and performs business communication with the master node.

Based on the foregoing embodiment, in other embodiments of the present application, after the server node performs step 304, referring to fig. 4, the server node is further configured to perform step 308:

step 308, the server node records node identification information and working roles of the first proxy node.

In the embodiment of the present application, the node identification information of the first proxy node may be information for uniquely identifying the first proxy node, for example, may be a node label, a name, a serial number, or the like of the first proxy node. After determining the working roles of the first proxy node, the server node also establishes a corresponding relation between the node identification information of the first proxy node and the working roles thereof, records the corresponding relation to obtain the record information, and then carries out corresponding adjustment operation according to the record information.

Based on the foregoing embodiment, in other embodiments of the present application, step 304 may be implemented by calculating the online event information using a finite state machine FSM to determine the working role of the first proxy node.

In the embodiment of the application, when the service end node works as the first proxy node, a finite state machine (FINITE STATE MACHINE, FSM) can be adopted to analyze and calculate the online event information of the first proxy node.

Based on the foregoing embodiments, in other embodiments of the present application, referring to fig. 5, after the first proxy node performs step 302, the first proxy node is further configured to perform steps 309 to 311, or steps 312 to 313:

Step 309, the first proxy node performs health detection on the first cluster node according to a preset period, so as to obtain a detection result.

In the embodiment of the application, the preset period is an experience time period obtained according to a large number of experiments or an experience period set by a user according to actual requirements. And the first proxy node carries out health detection on the first cluster node where the first proxy node is located according to a preset period set in advance, and a detection result is obtained. The process of health detection may be, for example, whether the operation state of the first cluster node is normal, whether the service state of the provided service is normal, etc.

Step 310, if the detection result is that the continuous preset times do not pass the health detection, the first proxy node generates fault notification information that the first cluster node has a fault.

In the embodiment of the application, the preset times can be experience values obtained according to a large number of experiments, or can be experience values set by a user according to own needs, for example, 1 time, 2 times, 5 times and the like. And in a period of time, when the first proxy node detects that the corresponding first cluster node continuously preset times do not pass the health detection, the first proxy node determines fault notification information of faults.

Step 311, the first proxy node sends the fault notification information to the server node.

Step 312, if the heartbeat information of the second proxy node is not sensed, the first proxy node generates a fault notification message that the second cluster node has a fault.

Wherein the second cluster node running the second proxy node and the first cluster node belong to the same cluster.

In the embodiment of the application, the proxy node running in the cluster periodically generates heartbeat information to indicate the normal operation of the proxy node, so that when other proxy nodes cannot detect the heartbeat information sent by a certain proxy node, the fault of the cluster node corresponding to the proxy node can be determined, and thus, other proxy nodes can generate fault notification information of the fault of the cluster node corresponding to the proxy node and send the fault notification information to the server node.

Step 313, the first proxy node sends the fault notification information to the server node.

Correspondingly, after the first proxy node performs steps 309 to 311 or steps 312 to 313, the server node may perform steps 314 to 316:

Step 314, if the fault notification information sent by the first proxy node is received, the server node determines, based on the fault notification information, a current working role of the fault proxy node indicated in the fault notification information.

The fault proxy node is the first proxy node or the second proxy node perceived by the first proxy node, and the second cluster node running the second proxy node and the first cluster node belong to the same cluster.

In the embodiment of the application, after receiving the fault notification information sent by the first proxy node, the server node analyzes the record information of the node identification information and the working roles of all the managed proxy nodes recorded with the fault notification information and the fault notification information, and determines the current working role of the fault proxy node indicated in the fault notification information.

Step 315, the server node determines a management operation corresponding to the fault proxy node based on the current working role.

In the embodiment of the application, the service end node determines the management operation needed to manage the fault proxy node according to the working role of the fault proxy node

Step 316, the server node performs management operations.

In the embodiment of the application, for example, when the current working role of the fault proxy node is the master node, the server node needs to determine a new master node from other proxy nodes to replace the work of the fault proxy node, and at the moment, the corresponding management operation is to indicate the new proxy node as the master node and update the state of the fault proxy node in the record information to be the fault state.

Based on the foregoing embodiments, in other embodiments of the present application, step 315 may be implemented by steps 315a to 315 b:

Step 315a, if the current working role is the master node, the server node calculates at least one third generation node in the working state, which is currently managed, by adopting a finite state machine FSM, and determines a fourth proxy node used as the master node.

Wherein the second cluster node and the cluster node running the fourth proxy node belong to the same cluster.

In the embodiment of the application, when the current working role of the fault proxy node is the master node, the server side node calculates at least one third proxy node which is currently managed and is in a working state, namely normally works, by adopting the FSM, specifically, the information parameters which are periodically reported by the FSM or one third proxy node and correspond to the online event information are calculated and analyzed, and a fourth proxy node which can be used as the master node is determined from the at least one third proxy node.

Step 315b, the server node sends the second indication information to the fourth proxy node.

The second indication information is used for indicating the fourth proxy node to execute the main node function.

Based on the foregoing embodiment, in other embodiments of the present application, after the server node performs step 315a, the server node is further configured to perform step 315c:

step 315c, the server node modifies the routing configuration of the reverse proxy based on the network information of the fourth proxy node, so as to forward the access request sent to the failed proxy node to the fourth proxy node subsequently.

In the embodiment of the present application, the network information of the fourth proxy node may be equal network parameter information of the fourth proxy node, which refers to an internet protocol (Internet Protocol, IP) address, and after the server node determines that the fourth proxy node is a master node, the server node further modifies the routing configuration of the reverse proxy corresponding to the fourth proxy node, so that complete transparency to the service may be achieved.

Based on the foregoing embodiments, in other embodiments of the present application, referring to fig. 6, after the server node performs step 314, the server node is further configured to perform steps 317 to 318:

step 317, the server node generates fault prompt information of faults of the fault cluster nodes corresponding to the fault proxy nodes.

In the embodiment of the application, the server node can also generate the fault prompt information of the fault occurrence of the fault cluster node, so that the server node can be used for rapidly positioning the fault cluster node and repairing the fault of the fault cluster node.

And step 318, the server node outputs fault prompt information.

In the embodiment of the application, the server node outputs the fault prompt information to the display area of the display device of the server node for display, or the server node outputs the fault prompt information to the communication device with the communication link with the server node for displaying the fault prompt information on the side of the communication device. The communication device may be a mobile terminal device of a user.

Based on the foregoing embodiments, in other embodiments of the present application, referring to fig. 7, the server node is further configured to execute steps 319 to 322:

Step 319, the server node obtains user feature configuration information of the first cluster node.

In the embodiment of the application, when the user characteristic configuration information is the first cluster node running, some running characteristic parameters configured by the user can be parameters such as a user name and a user password of the first cluster node, port information called by the first cluster node, and the like. The server node has higher authority and can dynamically acquire the user characteristic configuration information of the cluster node to be managed.

Step 320, the server node sends the user feature configuration information to the first proxy node.

In the embodiment of the application, the service end node sends the user characteristic information of the first cluster node where the first proxy node is located to the first proxy node, so that the first proxy node can access the first cluster node after receiving the user characteristic configuration information, the dynamic configuration of the first cluster node at the first proxy node is realized, and the first proxy node can be ensured to access the cluster node quickly.

Step 321, the first proxy node receives the user feature configuration information of the first cluster node sent by the server node.

Step 322, the first proxy node configures user feature configuration information.

Based on the foregoing embodiment, in other embodiments of the present application, the server node includes one or more management nodes, and the load managed by each management node is obtained by distributing the load of the cluster system by adopting a consistent hash algorithm.

In the embodiment of the application, the service end node can be one or more preset management nodes, namely, the service end node can be composed of one or more management nodes, so that when the service end node comprises a plurality of management nodes, one management node can be determined from the plurality of management nodes to serve as the service end node, and the plurality of management nodes can also serve simultaneously, thereby ensuring that when a single management node fails, other management nodes can serve, and ensuring the management of the cluster.

For ease of understanding, some of the used noun interpretations may be as follows:

A high availability cluster (High Availability Cluster, HA cluster) is a clustered system consisting of multiple servers, aimed at improving the availability and reliability of the overall system through redundancy and failover mechanisms.

Failover refers to the ability of a system to switch to a standby component when one of the components in the system fails, to ensure continued operation of the system and availability of services. This is a common high availability (High Availability, HA) technology that is widely used in server, network, and database critical systems.

Automatic failover (Automatic Failover) refers to the ability of a system to automatically switch to a standby component or path to maintain continuity and availability of service when a computer system, network, or related device encounters a failure. This process generally does not require human intervention, and can respond quickly to faults, reducing system downtime.

The principles of operation of automatic failover generally include the steps of (1) monitoring, the system constantly monitors the status of critical components, including servers, network connections, databases, applications, etc. (2) And detecting, namely triggering a fault transfer mechanism when the monitoring mechanism detects that a certain component is faulty or abnormal. (3) The system will automatically transfer the workload from the failed component to the preconfigured spare component upon switching. Among other things, it may involve changing network routes, reallocating resources, starting up a standby server, etc. (4) Restoration, after the failed component resumes normal operation, the system may switch the workload back to the original component or continue to use the spare component, depending on the configuration of the system and the failure recovery policy.

Advantages of automatic failover include (1) reduced downtime, automated handling of failures, reduced time for human intervention, and thus reduced downtime of the system. (2) The reliability is improved, the system can continue to operate when the fault occurs, and the service reliability is improved. (3) The management burden is reduced, and automated fault handling reduces the workload of internet technology (Internet Technology, IT) administrators, enabling them to concentrate on other tasks.

The cluster type can be divided into different types according to different cluster services, such as MySQL clusters, mongoDB clusters, redis clusters and the like.

Topology, refers to the physical or logical layout of how computer nodes in a cluster are connected, organized, and co-operative with each other. The cluster topology has an important influence on the performance, scalability, fault tolerance and communication efficiency of the clusters. The common basic topology structures in the HA cluster are a master-slave structure, a master-multiple-slave structure, a chained master-slave structure, a double-master structure and the like.

Based on the foregoing embodiment, the embodiment of the present application provides a system structure for managing cluster nodes, which mainly includes server, client and agents, wherein:

The server, corresponding to the server node of the present application, is configured to provide arbitration service, and may be installed and deployed independently, support clustered deployment, and cooperate together by multiple nodes, so that load balancing can be achieved, and high availability can be ensured. The method comprises the following functions of arbitrating service, carrying out election on nodes and sending notification to agents, arranging reverse proxy (Ingress) route to stream traffic, carrying out configuration mapping of storage clusters and carrying out dynamic issuing, realizing high-availability deployment based on Raft protocol, and realizing load balancing among a plurality of service end nodes based on a consistent Hash (Hash) algorithm.

Client, a command line client for invoking a server, may invoke the application programming interface (English Application Programming Interface, API) of the server's hypertext transfer protocol (Hypertext Transfer Protocol, HTTP)/high-performance remote procedure call protocol (Google Remote Procedure Call, gRPC). The system user may be a management cluster system, a query management cluster system, etc. that is to use the full functionality of the system for the client.

The agent, corresponding to the agent node of the present application, is installed on each cluster node of the managed cluster. The server performs corresponding callback when event notification is performed, for example, the callback is performed promote on the slave node to switch to a new Master when the Master node fails, and the agent and the server can use xRPC protocol to perform communication so as to support end-to-end encryption.

The operation process when the cluster is managed based on the system structure for managing the cluster nodes can be as follows:

In step a11, when a certain HA cluster needs to be managed, the user creates a xcluster by calling arbiter-server API through the client.

After creating xcluster, the user installs agents on each cluster node of the HA cluster in turn according to the need.

The agent is used as a daemon to run on the cluster nodes of the HA cluster so as to manage the cluster nodes.

Illustratively, as shown in FIG. 8, taking MySQL master-slave clusters as an example, agents are installed on all cluster nodes of the cluster.

And a step a13, registering the agent node on line.

After the agent operates, a gossip network is added, the server senses an online event of a cluster node operating the agent, and then processes the online event according to the online event information of the cluster node operating the agent, namely the cluster topology type of the cluster node operating the agent, the current state of the cluster (including information of which nodes, roles of each node, whether online and the like), through an FSM, calculates the correct role of the online cluster node, records the node information including the correct role of the online node into a database built in the server, and simultaneously sends the node information and a corresponding callback instruction (for example promote) to the agent, thereby completing node registration.

The FSM can ensure that when the node fails, the state of the current cluster is subjected to established transfer, and a new role is redistributed for each node, so that the high available state of the cluster is maintained. The finite state machine can effectively solve the problem of node fault recovery, and improves the reliability and usability of the system.

The underlying principles of FSM include the following:

State FSM state refers to the state of the system at a certain moment, which is an abstract concept that can be represented by an identifier;

input FSM refers to the external signal received by the system, which may be a character, a number, an event, etc.;

Transfer function FSM transfer function refers to the rule by which the system transitions from one state to another. It describes the process by which a system transitions from one state to another state given an input;

output FSM output refers to the response of the system to an input in a certain state. The output may be a character, a number, an event, etc.;

Because the state machines of different cluster topologies are different, the corresponding finite state machine needs to be implemented for each cluster topology, for example, the standby node in the primary-standby topology is also used as the primary node of another node, and the primary-standby topology is not used.

Taking the master-slave topology as an example, state transition of a state machine is described, and the master-slave topology state machine has the following 4 states:

STATEINIT # initial state, no normal node

StateOnlyMaster # Master only node

StateOnlySlave # Slave node only

STATEMASTERSLAVE # has target state under Master and Slave node Master-Slave topology

Illustratively, as shown in FIG. 9, the events of node up/down are taken as inputs to the state machine, which will perform transfer functions, outputting new roles for the remaining nodes, thereby maintaining a high availability of clusters.

And a14, the agent receives notification of the server, and executes corresponding callback actions according to the configuration file.

For example, assuming a master-slave MySQL cluster, when a first node a is online, a server elects to be a master and notifies it to execute promote-master callback, and when a second node is online, a arbiter-server elects to be slave and associates node a, notifies it to be a slave node of node a and notifies it to execute demote-slave callback, thereby realizing the function of the slave node.

Step a15, fault detection and automatic fault transfer.

When the agent detects the following 2 conditions, it may determine that the cluster node currently has a fault. The first condition is that the agent regularly performs health check on the cluster node, when the health check fails continuously for many times, the agent determines that the cluster node fails, and reports the failure information to the server. In the second case, when the node where the agent is located fails or the network is interrupted, the gossip heartbeat of the agent stops, surrounding agents cannot sense the heartbeat of the agent, so that the agent senses that the agent has a failure, and at the moment, other agents sensing that the agent has a failure can transmit failure information to the server.

In this way, after receiving the fault information of the agent, the server processes the offline event of the cluster node by adopting the FSM according to the cluster topology type, the current state and other offline event information of the cluster node corresponding to the agent, and then updates the roles and associated node information of all the current nodes to correspond to the record information!

And the agent receives the notification of the server and executes the corresponding callback action according to the configuration file, thereby completing automatic fault transfer.

The server informs the node B to execute promote-master instructions on the assumption that the node A is a new master node when the node A fails based on the master-slave MySQL cluster, further, the node A can be selected to be a new slave, the agent of the node B executes callback actions corresponding to promote-master instructions after receiving the notification sent by the server, and the agent of the node B is switched to be a new master to finish automatic fault transfer.

When the agent performs the health check or callback on the cluster node, some configuration information, such as the port, the user name, the password, etc. of MySQL of the current node, is generally required, otherwise the agent may not be able to connect to MySQL to complete the health check or callback. In the embodiment of the application, dynamic configuration can be carried out through server variable configuration management (configmap) so as to support dynamic creation and update of configuration in xcluster clusters, and the created and updated configuration can be pushed to agents on related cluster nodes in real time.

For MySQL clusters, exemplary, configuration information that typically needs to be used is as follows:

Node SSH information

SSH_USER

SSH_PORT

SSH_PASSWD

-MySQL port, username and password

MYSQL_USER

MYSQL_PORT

MYSQL_PASSWD

Then, the corresponding command in server may be created as follows, similar to configmap:

arbiter-client configmap create--xcluster-id＝"<xclusterId>"--data＝"SSH_USER"＝"sysadm","SSH_PORT"＝"22345","SSH_PASSWD"＝"adminsangfornetwor k","MYSQL_USER"＝"root","MYSQL_PORT"＝"3306","MYSQL_PASSWD"＝"admin"

thus, after the command is successfully executed, the server stores the information into a built-in database and supports encryption storage. If the subsequent configuration is changed, the subsequent configuration can be dynamically modified by using client configmap update commands, and the modified configuration can be immediately pushed to the agent.

In the foregoing embodiment, after the corresponding cluster node fails when the agent is the master node, it is determined that the cluster node corresponding to the new agent is a new master node, and after the failure transfer is implemented, the service may be completely transparent by modifying the routing configuration of the reverse proxy. Illustratively, proxySQL may be used as a reverse proxy for MySQL master-slave clusters, as shown in fig. 10, and transparent read-write separation may be implemented.

In the foregoing embodiment, when the cluster node managed by the agent fails, an alarm for the failure may also be implemented. The specific implementation process can be realized through xcluster alarm configuration, namely when determining that the cluster node has faults, the system user can be timely notified of the fault cluster node through the configured global alarm rule or the alarm rule configured at xcluster level. By way of example, one implementation of the alert configuration may be implemented by the following commands:

arbiter-client alert create webhook-alerter--class-name＝"webhook"--xcluster-id＝<xcluster-id>-f＝alert.yaml

correspondingly, an example of an alarm implemented may be seen with reference to fig. 11.

It should be noted that for server applications, a single server node may be used to manage a large number of HA clusters with minimal deployment. Furthermore, in order to solve the problem that a single point failure of the server exists potentially, the server can also be deployed in a clustered mode, so that high availability of the server can be achieved, and meanwhile, horizontal expansion of the server can be achieved, namely, system load is equally distributed to a plurality of server nodes. For example, as shown in fig. 12, when there are 3 servers with server clusters, a consistent hash algorithm may be used to split the load, that is, when a server receives an event of a cluster node corresponding to a certain agent, it calculates whether the event belongs to the category of self management according to xcluster id of the server, if the event does not belong to self management, the event is discarded, and if the event belongs to self management, event processing is performed, so, it is assumed that server 1 is determined to be responsible for managing clusters of xcluster id as 1, 4, 7, and server 2 is responsible for managing clusters of xcluster id as 2, 5, 8, and server 3 is responsible for managing clusters of xcluster id as 3, 6, 9.

Therefore, the automatic failure transfer scheme of the general HA cluster can adapt to the HA cluster environments of different types and topological structures, improves the universality and flexibility of the HA cluster, reduces the requirements of customizing solutions for different environments, simplifies the maintenance and management work of the HA cluster, can simultaneously perform one-stop management on a large number of heterogeneous clusters by one set of system, reduces the operation and maintenance cost, further improves the failure recovery speed and data consistency of the HA cluster, and enhances the service continuity.

It should be noted that, in this embodiment, the explanation of the same steps or concepts as those in other embodiments may refer to the descriptions in other embodiments, and are not repeated here.

In the cluster management method provided by the embodiment of the application, if the first proxy node runs for the first time, the online event information is determined, and the online event information is sent to the server node, after the server node receives the online event information reported by the first proxy node, the server node determines the working role of the first proxy node based on the online event information, sends the first indication information to the first proxy node, and the first proxy node receives the first indication information sent by the server node and invokes the resource information of the first cluster node based on the first indication information to realize the function corresponding to the working role indicated by the first indication information. Therefore, the proxy node running in the cluster node is managed through the server node, so that the control management of the cluster node is realized through the control of the proxy node, the problem of poor universality of the conventional HA cluster management scheme is solved, a reliable cluster management method is provided, the HA cluster management method which can be adopted for different cluster types and topological structures under different environments is realized, the universality is realized, and the complexity of the system is reduced.

Based on the foregoing embodiments, the present application provides a first cluster management device 4, where the device is applied to a server node, and the first cluster management device 4 may be applied to the embodiments corresponding to fig. 1 to 7, and referring to fig. 13, the first cluster management device 4 includes a first receiving unit 41, a first determining unit 42, and a first sending unit 43, where:

The first receiving unit 41 is configured to receive online event information reported by a first proxy node, where the first proxy node is running in a first cluster node and is configured to perform node management on the first cluster node, and the online event information includes attribute feature parameters of the first cluster node perceived when the first proxy node runs;

A first determining unit 42, configured to determine a working role of the first proxy node based on the online event information, where the working role is a master node or a slave node;

and a first sending unit 43, configured to send first indication information to the first proxy node, where the indication information is used to instruct the first proxy node to execute an operation corresponding to the working role.

In other embodiments of the present application, the first cluster management device further includes a recording unit, wherein:

and the recording unit is used for recording the node identification information and the working roles of the first proxy node.

In other embodiments of the present application, the first determining unit is specifically configured to implement the following steps:

In other embodiments of the present application, after the first sending unit, the first cluster management device further includes a third determining unit, a fourth determining unit, and an executing unit, where:

The third determining unit is used for determining the current working role of the fault proxy node indicated in the fault notification information based on the fault notification information if the fault notification information sent by the first proxy node is received, wherein the fault proxy node is the first proxy node or a second proxy node perceived by the first proxy node, and the second cluster node running the second proxy node and the first cluster node belong to the same cluster;

The fourth determining unit is further used for determining management operation corresponding to the fault proxy node based on the current working role;

And the execution unit is used for executing the management operation.

In other embodiments of the present application, the fourth determining unit includes a determining module and a transmitting module, where:

The determining module is used for calculating at least one third generation node in a working state which is currently managed by adopting a finite state machine FSM if the current working role is a master node, and determining a fourth proxy node which is used as the master node;

and the sending module is used for sending second instruction information to the fourth proxy node, wherein the second instruction information is used for instructing the fourth proxy node to execute the main node function.

In other embodiments of the present application, after determining the module, the method further comprises a modifying module, wherein:

And the modification module is used for modifying the routing configuration of the reverse proxy based on the network information of the fourth proxy node so as to forward the access request sent to the fault proxy node to the fourth proxy node later.

In other embodiments of the present application, after the fourth determining unit, the first cluster management device further includes a generating unit and an output unit, where:

The generating unit is used for generating fault prompt information of faults of the fault cluster nodes corresponding to the fault proxy nodes;

And the output unit is used for outputting fault prompt information.

In other embodiments of the present application, the first cluster management device further includes an obtaining unit, where:

The acquisition unit is used for acquiring the user characteristic configuration information of the first cluster node;

The first sending unit is further configured to send the user feature configuration information to the first proxy node, so that the first proxy node accesses the first cluster node based on the user feature configuration information.

In other embodiments of the present application, the server node includes one or more management nodes, and the load managed by each management node is obtained by distributing the load of the cluster system by adopting a consistent hash algorithm.

It should be noted that, in this embodiment, the information interaction process between the units and the modules may refer to the information interaction process described in the foregoing method embodiment, which is not described herein again.

According to the first cluster management device provided by the embodiment of the application, after the server side node receives the online event information reported by the first proxy node, the working role of the first proxy node is determined based on the online event information, and the first indication information is sent to the first proxy node, so that the first proxy node receives the first indication information sent by the server side node, and the resource information of the cluster node is called based on the first indication information to realize the function corresponding to the working role indicated by the first indication information. Therefore, the proxy node running in the cluster node is managed through the server node, so that the control management of the cluster node is realized through the control of the proxy node, the problem of poor universality of the conventional HA cluster management scheme is solved, a reliable cluster management method is provided, the HA cluster management method which can be adopted for different cluster types and topological structures under different environments is realized, the universality is realized, and the complexity of the system is reduced.

Based on the foregoing embodiments, the embodiment of the present application provides a second cluster management device 5, where the second cluster management device is applied to a first proxy node, and the second cluster management device 5 may be applied to the embodiments corresponding to fig. 2 to 7, and referring to fig. 14, the second cluster management device 5 includes a second determining unit 51, a second sending unit 52, a second receiving unit 53, and a calling unit 54, where:

a second determining unit 51, configured to determine online event information if the first proxy node runs for the first time;

a second sending unit 52, configured to send the online event information to the server node;

A second receiving unit 53, configured to receive first indication information sent by the server node, where the first indication information is generated by the server node based on the online event information;

And the calling unit 54 is configured to call the resource information of the first cluster node to implement a function corresponding to the work role indicated by the first indication information based on the first indication information.

In other embodiments of the present application, the second cluster management device further includes a configuration unit, where:

the second receiving unit is further used for receiving user characteristic configuration information of the first cluster node sent by the server node;

and the configuration unit is used for configuring the user characteristic configuration information.

In other embodiments of the present application, the second cluster management device further includes a detecting unit and a generating unit, where:

The detection unit is used for carrying out health detection on the first cluster nodes according to a preset period to obtain a detection result;

the generating unit is used for generating fault notification information of faults of the first cluster node if the detection result is that the continuous preset times do not pass the health detection;

And the second sending unit is also used for sending the fault notification information to the server node.

In other embodiments of the present application, the generating unit is further configured to generate failure notification information that the second cluster node has a failure if the heartbeat information of the second proxy node is not sensed, where the second cluster node running the second proxy node and the first cluster node belong to the same cluster;

It should be noted that, in the embodiment, the information interaction process between the units and the modules may refer to the information interaction process described in the foregoing method embodiment, which is not described herein again.

In the second cluster management device provided by the embodiment of the application, if the first proxy node runs for the first time, the online event information is determined, the online event information is sent to the server node, then the first proxy node receives the first indication information sent by the server node, and based on the first indication information, the resource information of the first cluster node is called to realize the function corresponding to the working role indicated by the first indication information. Therefore, the proxy node running in the cluster node is managed through the server node, so that the control management of the cluster node is realized through the control of the proxy node, the problem of poor universality of the conventional HA cluster management scheme is solved, a reliable cluster management method is provided, the HA cluster management method which can be adopted for different cluster types and topological structures under different environments is realized, the universality is realized, and the complexity of the system is reduced.

Based on the foregoing embodiments, an embodiment of the present application provides a cluster management system 6, where the cluster management system 6 may be applied to the embodiments corresponding to fig. 1 to 4, and referring to fig. 15, the cluster management system 6 includes at least a cluster 61, a server node 62, and one or more proxy nodes 63 including a first proxy node, where:

The server node 62 is configured to implement, for the cluster 61, a cluster management method provided in an embodiment corresponding to fig. 1, 3 to 4 or fig. 2 to 4, which is not described herein again;

The first proxy node is configured to implement, for the cluster 61, a cluster management method provided in the embodiments corresponding to fig. 1, 3-4 or fig. 2-4, which is not described herein again.

Based on the foregoing embodiments, the embodiments of the present application provide a computer readable storage medium, abbreviated as a storage medium, where one or more cluster management methods are stored, and the one or more cluster management methods may be executed by one or more processors to implement the cluster management method provided in the embodiments corresponding to fig. 1 to 7 or fig. 2 to 7, which is not described herein again.

Based on the foregoing embodiments, the embodiments of the present application provide a computer readable storage medium, simply referred to as a storage medium, where one or more programs are stored, and the one or more programs may be executed by one or more processors, so as to implement the implementation procedure in the method provided by the embodiments referring to fig. 1 to 7, or fig. 2 to 7, which are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, an..the air conditioner, or a network communication link device, etc.) to perform the method described in the embodiments of the present application.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A cluster management method, wherein the method is applied to a server node, the method comprising:

2. The method of claim 1, wherein the determining the operational role of the first proxy node based on the online event information comprises:

3. The method of claim 1, wherein after the sending the indication information to the first proxy node, the method further comprises:

and executing the management operation.

4. A method according to claim 3, wherein said determining a corresponding management operation for said failed proxy node based on said current work role comprises:

5. The method of claim 4, wherein if the current working role is the master node, calculating, using a finite state machine FSM, at least one third generation node currently managed in a working state, and determining a fourth proxy node for use as a master node, the method further comprises:

6. A method according to claim 3, characterized in that the method further comprises:

And outputting the fault prompt information.

7. The method according to claim 1, wherein the method further comprises:

8. A cluster management method, the method being applied to a first proxy node running in a first cluster node, the method comprising:

sending the online event information to a server node;

9. The first cluster management device is characterized by being applied to a server node, and comprises a first receiving unit, a first determining unit and a first sending unit, wherein:

10. The second cluster management device is applied to a first proxy node, and is characterized by comprising a second determining unit, a second sending unit, a second receiving unit and a calling unit, wherein:

11. A cluster management system is characterized in that the system at least comprises a cluster, a server node and one or more proxy nodes comprising a first proxy node, wherein:

the server node being configured to implement the method for cluster management according to any one of claims 1 to 7;

the first proxy node is configured to implement the steps of the cluster management method according to any one of the claims 8 for managing clusters.

12. A storage medium having stored thereon a cluster management program which when executed by a processor implements the steps of the cluster management method according to any of claims 1 to 7, or claim 8.

13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the cluster management method of any of claims 1 to 7, or of claim 8.