CN108632057A

CN108632057A - A kind of fault recovery method of cloud computing server, device and management system

Info

Publication number: CN108632057A
Application number: CN201710160761.7A
Authority: CN
Inventors: 欧亚聪
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2018-10-09

Abstract

The embodiment of the invention discloses a kind of fault recovery method of cloud computing server, device and management system, this method to include：The hardware resource fault message transmitted by IaaS management platforms is obtained, the operating system failure information of cloud computing server is obtained, obtains the application and trouble information of cloud computing server；According to accessed hardware resource fault message, operating system failure information and application and trouble information determine the failure root of the cloud computing server because；According to failure root because determining troubleshooting strategy；Operation indicated by troubleshooting strategy carries out fault recovery.Implement the embodiment of the present invention, high reliability guarantee can be provided for the Legacy System of enterprise in cloud computing platform, advantageously ensure that the reliable even running of Legacy System.

Description

A cloud computing server failure recovery method, device and management system

技术领域technical field

本发明涉及云计算技术领域，尤其涉及一种云计算服务器的故障恢复方法、装置及管理系统。The invention relates to the technical field of cloud computing, in particular to a fault recovery method, device and management system for a cloud computing server.

背景技术Background technique

云计算(Cloud Computing)是一种新兴的商业计算模型，它将计算任务分布在大量计算机构成的资源池上，使各种应用系统能够根据需要获取计算能力、存储空间和各种软件服务。为了获取云计算带来的一系列好处的，包括降低运维的复杂度，节约硬件成本等，越来越多的企业选择将传统的IT系统放迁移到云计算相关的资源池上面运行，让整个IT系统可以利用云计算的服务来来实现统一的运维，这些IT系统的运行环境随之发生了巨大的变化，由于云计算平台的可靠性并没有专用的服务器高，所以在云计算平台中必须充分考虑当部分计算资源失效时如何继续保证系统的可靠运行。在云计算平台下，计算资源是按需从资源池中分配的，当计算资源失效时，需要等待云重新调度分配计算资源，例如通过弹性伸缩来触发等。在现有技术中，为了适应云计算的架构，如果要保证传统的IT系统迁移到云计算相关的资源池之后，也能获得高可靠性(High Availability，HA)的保障，通常要求该IT系统是云就绪(Cloud-Ready)类型的系统。对于Cloud-Ready类型的系统，首先，它应该是一个分布式的系统，具有高度的内聚性和透明性；其次，它应该是冗余的，能处理服务器失效的情况，不存在单点故障。Cloud computing (Cloud Computing) is an emerging business computing model, which distributes computing tasks on a resource pool composed of a large number of computers, enabling various application systems to obtain computing power, storage space and various software services as needed. In order to obtain a series of benefits brought by cloud computing, including reducing the complexity of operation and maintenance and saving hardware costs, more and more enterprises choose to migrate traditional IT systems to cloud computing-related resource pools, so that The entire IT system can use cloud computing services to achieve unified operation and maintenance, and the operating environment of these IT systems has undergone tremendous changes. Since the reliability of the cloud computing platform is not as high as that of dedicated servers, the cloud computing platform It must fully consider how to continue to ensure the reliable operation of the system when some computing resources fail. Under the cloud computing platform, computing resources are allocated from resource pools on demand. When computing resources fail, it is necessary to wait for the cloud to reschedule and allocate computing resources, such as triggering through elastic scaling. In the existing technology, in order to adapt to the architecture of cloud computing, if it is to ensure that the traditional IT system can also obtain the guarantee of high reliability (High Availability, HA) after migrating to the resource pool related to cloud computing, it is usually required that the IT system It is a cloud-ready (Cloud-Ready) type system. For a Cloud-Ready type system, first, it should be a distributed system with a high degree of cohesion and transparency; second, it should be redundant, able to handle server failure, and there is no single point of failure .

然而，企业内部的IT系统中往往也存在部分不具有上述特点的遗留系统，这些遗留系统采取烟囱式垂直系统构建，在架构层面没有充分考虑云环境中资源动态分配，资源失效等情况，属于非“Cloud-Ready”类型的系统。从架构兼容的角度来看，垂直系统和分布式系统并不具备耦合性，而目前云计算一般针对于分布式系统而设计，所以目前云计算平台通用的HA方案不能适用于企业遗留系统，当企业将整个IT系统(包括这些遗留系统)，全部都迁移到云计算相关的资源池上后，对于其中的遗留系统而言，只是简单将系统重新部署到云分配的计算资源上，并不能获得云计算对其可靠性的保障，例如无法实现弹性伸缩，按需分配资源等，因此在可靠性方面将面临很大的挑战。However, there are often some legacy systems that do not have the above-mentioned characteristics in the internal IT system of the enterprise. These legacy systems are built as chimney-style vertical systems, and the dynamic allocation of resources in the cloud environment and resource failure are not fully considered at the architectural level. "Cloud-Ready" type systems. From the perspective of architecture compatibility, vertical systems and distributed systems do not have coupling, and cloud computing is generally designed for distributed systems, so the current common HA solution for cloud computing platforms cannot be applied to enterprise legacy systems. After the enterprise migrates the entire IT system (including these legacy systems) to cloud computing-related resource pools, for the legacy systems, it simply redeploys the system to the computing resources allocated by the cloud, and cannot obtain cloud computing resources. Computing guarantees its reliability, such as the inability to achieve elastic scaling and allocate resources on demand, so it will face great challenges in terms of reliability.

发明内容Contents of the invention

本发明实施例提供一种云计算服务器的故障恢复方法、装置及管理系统，以解决遗留系统迁移到云上后的可靠性问题。Embodiments of the present invention provide a cloud computing server failure recovery method, device and management system to solve the reliability problem after the legacy system is migrated to the cloud.

第一方面，本发明实施例提供了云计算服务器的故障恢复方法，应用于云计算服务器，包括：In the first aspect, the embodiment of the present invention provides a fault recovery method for a cloud computing server, which is applied to a cloud computing server, including:

PaaS管理平台获取基础设施即服务IaaS管理平台所发送的硬件资源故障信息，其中，所述IaaS管理平台用于管理所述云计算服务器的硬件资源，还用于检测所述硬件资源的硬件资源故障信息，所述IaaS管理平台独立于所述云计算服务器；获取所述云计算服务器的操作系统故障信息，所述操作系统故障信息用于指示安装于所述云计算服务器的操作系统所出现的故障；获取所述云计算服务器的应用故障信息，所述应用故障信息用于指示安装于所述操作系统的应用所出现的故障；The PaaS management platform obtains the hardware resource failure information sent by the infrastructure as a service IaaS management platform, wherein the IaaS management platform is used to manage the hardware resources of the cloud computing server, and is also used to detect hardware resource failures of the hardware resources Information, the IaaS management platform is independent of the cloud computing server; obtain operating system failure information of the cloud computing server, the operating system failure information is used to indicate the failure of the operating system installed on the cloud computing server ; Acquiring application failure information of the cloud computing server, where the application failure information is used to indicate a failure of an application installed in the operating system;

根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因；根据所述故障根因确定故障处理策略；根据所述故障处理策略所指示的操作进行故障恢复。Determine the fault root cause of the cloud computing server according to the acquired hardware resource fault information, the operating system fault information, and the application fault information; determine a fault handling strategy according to the fault root cause; Handle the actions indicated by the policy for failover.

上述第一方面从PaaS管理平台侧描述了本发明实施例所提供一种云计算服务器的故障恢复方法，通过实施该方法，PaaS管理平台可全面地检测云计算服务器的硬件资源层、操作系统层以及应用层所出现的故障，并基于上述故障进行综合分析，确定故障根因，并采用相对应的故障处理策略进行故障恢复。本发明实施例中，当企业的遗留系统迁移到云计算服务器后，PaaS管理平台对所述遗留系统提供HA方案，当该云计算服务器发生故障，PaaS管理平台可准确地确定故障是发生在硬件资源层、操作系统层还是应用层，并对应于该层进行相应的故障恢复，故本发明实施例提供的HA方案具有全面性。The above first aspect describes a cloud computing server fault recovery method provided by the embodiment of the present invention from the side of the PaaS management platform. By implementing this method, the PaaS management platform can comprehensively detect the hardware resource layer and the operating system layer of the cloud computing server. And the faults that occur in the application layer, and based on the comprehensive analysis of the above faults, determine the root cause of the fault, and adopt the corresponding fault handling strategy for fault recovery. In the embodiment of the present invention, after the enterprise's legacy system is migrated to the cloud computing server, the PaaS management platform provides an HA solution for the legacy system. When the cloud computing server fails, the PaaS management platform can accurately determine whether the failure occurred in the hardware The resource layer and the operating system layer are also the application layer, and corresponding fault recovery is performed corresponding to this layer, so the HA solution provided by the embodiment of the present invention is comprehensive.

结合第一方面，在一些可能的实施方式中，所述操作系统还具有第一代理应用；With reference to the first aspect, in some possible implementation manners, the operating system further has a first proxy application;

获取所述云计算服务器的操作系统故障信息，包括：通过检测所述第一代理应用的心跳信息来确定所述操作系统故障信息，所述心跳信息用于指示所述操作系统是否发生故障。Acquiring operating system failure information of the cloud computing server includes: determining the operating system failure information by detecting heartbeat information of the first proxy application, the heartbeat information being used to indicate whether the operating system fails.

也就是说，PaaS管理平台在部署应用(包括应用程序、应用系统、企业IT系统等等)的云计算服务器上所有虚拟机的操作系统上都安装一个第一代理应用(Agent)，该第一代理应用与PaaS管理平台进行心跳通信。PaaS管理平台检测与第一代理应用的心跳，当某个第一代理应用心跳消失，则表明该虚拟机(操作系统)发生断连故障，PaaS管理平台相应的获得操作系统故障信息。That is to say, the PaaS management platform installs a first agent application (Agent) on the operating systems of all virtual machines on the cloud computing server where applications are deployed (including applications, application systems, enterprise IT systems, etc.). The proxy application performs heartbeat communication with the PaaS management platform. The PaaS management platform detects the heartbeat with the first agent application. When the heartbeat of a first agent application disappears, it indicates that the virtual machine (operating system) has a disconnection failure, and the PaaS management platform obtains the operating system failure information accordingly.

结合第一方面，在一些可能的实施方式中，在所述操作系统还具有第二代理应用；With reference to the first aspect, in some possible implementation manners, the operating system further has a second agent application;

获取所述云计算服务器的应用故障信息，包括：通过所述第二代理应用调用所述应用的状态检测脚本，根据所述状态检测脚本的返回值确定所述应用故障信息。Obtaining the application failure information of the cloud computing server includes: calling the application status detection script through the second proxy application, and determining the application failure information according to a return value of the status detection script.

其中，第二代理应用和第一代理应用可以是同一个代理应用，也可以是不同的代理应用。Wherein, the second proxy application and the first proxy application may be the same proxy application, or different proxy applications.

第二代理应用同样部署于应用层，可用于管理云计算服务器中的应用，并定期监控虚拟机上的应用的运行状态，例如，第二代理应用通过应用(应用系统)所提供的状态检测脚本进行相关运行状态的监控。在具体的应用场景中，应用(应用系统)在运行过程中，动态地提供一个状态检测脚本，第二代理应用定期在该安装目录调用status.sh，并获取相应的返回值，可以理解的，第二代理应用根据脚本的返回值判断应用(应用系统)的运行状态，并将相应的运行状态发送至PaaS。在确定应用发生故障的情况下，第二代理应用生成应用故障信息，并将所述应用故障信息发送至PaaS。The second agent application is also deployed in the application layer, and can be used to manage the application in the cloud computing server, and regularly monitor the running status of the application on the virtual machine, for example, the second agent application provides a status detection script through the application (application system) Monitor related operating status. In a specific application scenario, the application (application system) dynamically provides a status detection script during the running process, and the second agent application periodically calls status.sh in the installation directory and obtains the corresponding return value. Understandably, The second proxy application judges the running state of the application (application system) according to the return value of the script, and sends the corresponding running state to the PaaS. If it is determined that the application is faulty, the second proxy application generates application fault information and sends the application fault information to the PaaS.

可以理解的，当第二代理应用和第一代理应用可以是同一个代理应用时，那么PaaS管理平台既可以通过该代理应用监控虚拟机(操作系统)的运行状态，还可以通过该代理应用监控应用的运行状态，从而使得本发明实施例所提供的HA方案可以快速、便捷地进行部署。It can be understood that when the second proxy application and the first proxy application can be the same proxy application, then the PaaS management platform can monitor the running status of the virtual machine (operating system) through the proxy application, and can also monitor the virtual machine (operating system) through the proxy application. The running state of the application, so that the HA solution provided by the embodiment of the present invention can be quickly and conveniently deployed.

结合第一方面，在一些可能的实施方式中，根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因，至少包括：With reference to the first aspect, in some possible implementation manners, determining the root cause of the cloud computing server failure according to the acquired hardware resource failure information, the operating system failure information, and the application failure information includes at least :

在预设时间内皆检测到所述硬件资源故障信息和所述操作系统故障信息情况下，确定故障根因为所述硬件资源出现故障；或在预设时间内检测到所述操作系统故障信息和所述应用故障信息，且没有检测到所述硬件资源故障信息情况下，确定故障根因为所述操作系统出现故障；或在预设时间内仅检测到应用故障信息情况下，确定故障根因为所述应用出现故障。If both the hardware resource fault information and the operating system fault information are detected within a preset time, it is determined that the root cause of the fault is that the hardware resource is faulty; or the operating system fault information and the operating system fault information are detected within a preset time. If the application fault information is not detected and the hardware resource fault information is not detected, determine that the root cause of the fault is that the operating system is faulty; or if only the application fault information is detected within a preset time, determine that the root cause of the fault is that all The above application fails.

可以看到，当云计算服务器发生故障时，故障不再是在本层直接进行处理，而是统一汇聚到PaaS管理平台，由于PaaS管理平台掌握云计算服务器各个层面上的状态信息，具有全局的信息视图，因此PaaS管理平台基于预设时间内(例如3分钟)所获取到所有故障信息进行综合分析判断，从而准确地确定导致云计算服务器出现上述故障的故障根因。可以防止误报漏报，故本发明实施例提供的HA方案具有准确性。It can be seen that when a cloud computing server fails, the fault is no longer directly processed at this layer, but is converged to the PaaS management platform. Since the PaaS management platform grasps the status information of the cloud computing server at all levels, it has a global view Information view, so the PaaS management platform conducts comprehensive analysis and judgment based on all fault information acquired within a preset time (for example, 3 minutes), so as to accurately determine the root cause of the above-mentioned fault on the cloud computing server. False positives and negative negatives can be prevented, so the HA solution provided by the embodiment of the present invention is accurate.

结合第一方面，在一些可能的实施方式中，根据所述故障根因确定故障处理策略包括：With reference to the first aspect, in some possible implementation manners, determining the fault handling strategy according to the root cause of the fault includes:

在故障根因为所述硬件资源出现故障的情况下，所述故障处理策略包括重启虚拟机、本地重建虚拟机和迁移虚拟机；或在故障根因为所述操作系统出现故障的情况下，所述故障处理策略至少包括重启虚拟机；或在故障根因为所述应用出现故障的情况下，所述故障处理策略至少包括重启虚拟机、重启应用。In the case that the root cause of the failure is a failure of the hardware resource, the failure handling strategy includes restarting the virtual machine, rebuilding the virtual machine locally, and migrating the virtual machine; or in the case of a failure of the operating system, the The failure handling strategy at least includes restarting the virtual machine; or in the case that the root cause of the failure is the failure of the application, the failure handling strategy at least includes restarting the virtual machine and restarting the application.

在一具体的应用场景中，PaaS管理平台基于故障的类型确定具体的故障处理策略，比如可以在PaaS中预设故障诊断数据库，该故障诊断数据库存储有多种故障信息，针对属于同一层次的故障信息赋予不同的故障等级，如故障等级一、故障等级二、故障等级三等等。比如对于针对硬件资源层出现故障所预设的故障处理策略中，预设故障等级一所对应的故障处理策略为重启虚拟机，故障等级二对应的故障处理策略为本地重建虚拟机，故障等级三对应的故障处理策略为迁移虚拟机，以此类推。在确定故障根因后，PaaS基于实际获取的硬件资源层出现故障进行分析，确定该硬件资源层出现故障所对应的故障等级，并基于该故障等级相应的确定故障处理策略。In a specific application scenario, the PaaS management platform determines a specific fault handling strategy based on the type of fault. For example, a fault diagnosis database can be preset in PaaS. The information is assigned different failure levels, such as failure level one, failure level two, failure level three and so on. For example, among the fault handling strategies preset for faults in the hardware resource layer, the fault handling strategy corresponding to fault level 1 is to restart the virtual machine, the fault handling strategy corresponding to fault level 2 is to rebuild the virtual machine locally, and the fault handling strategy corresponding to fault level 3 The corresponding fault handling strategy is to migrate the virtual machine, and so on. After determining the root cause of the failure, PaaS analyzes the failure of the hardware resource layer based on the actual acquisition, determines the failure level corresponding to the failure of the hardware resource layer, and determines the corresponding fault handling strategy based on the failure level.

在另一具体的应用场景中，PaaS预先赋予同层的不同故障处理策略不同的优先级，在首次基于收到的故障信息确定故障根源为某层出现故障后，自动选择该层优先级最高的故障处理策略作为需要执行的故障处理策略。在优先级高的故障处理策略无法实现故障恢复的情况下，PaaS重新选择优先级较低的故障处理策略，并重复进行上述步骤。In another specific application scenario, PaaS assigns different priorities to different fault handling strategies of the same layer in advance, and automatically selects the layer with the highest priority after the fault root is determined to be a fault on a certain layer based on the received fault information for the first time. The fault handling policy is the fault handling policy that needs to be executed. In the case that the fault handling strategy with high priority cannot achieve fault recovery, PaaS reselects the fault handling strategy with lower priority, and repeats the above steps.

比如说，在故障根因为所述硬件资源出现故障的情况下，所述故障处理策略为重启虚拟机；在执行重启虚拟机不能实现硬件资源故障恢复的情况下，所述故障处理策略为本地重建虚拟机；在执行重启虚拟机和本地重建虚拟机皆不能实现硬件资源故障恢复的情况下，所述故障处理策略为迁移虚拟机。For example, when the root cause of the failure is a failure of the hardware resource, the fault handling strategy is to restart the virtual machine; when the restart of the virtual machine cannot realize the recovery of the hardware resource failure, the fault handling strategy is to rebuild locally A virtual machine; in the case that neither restarting the virtual machine nor locally rebuilding the virtual machine can restore the hardware resource failure, the fault handling strategy is to migrate the virtual machine.

又比如说，在故障根因为所述应用出现故障的情况下，所述故障处理策略为重启应用；在执行重启应用不能实现应用故障恢复的情况下，所述故障处理策略为重启虚拟机。For another example, in the case that the root cause of the fault is a fault in the application, the fault handling strategy is to restart the application; in the case that restarting the application fails to recover the application fault, the fault handling strategy is to restart the virtual machine.

可以看到，本发明实施例针对不同的故障根因提供了多种故障恢复手段，在其中一种故障恢复手段无法实现故障恢复时，还会继续使用其他故障恢复手段进行相应的故障恢复。进而保障云计算服务器在出现故障之后，能够尽快地从将该故障恢复，从而保障云计算平台的高可用性。It can be seen that the embodiments of the present invention provide multiple fault recovery means for different fault root causes, and when one of the fault recovery means fails to achieve fault recovery, other fault recovery means will continue to be used for corresponding fault recovery. This ensures that the cloud computing server can recover from the fault as soon as possible after a fault occurs, thereby ensuring the high availability of the cloud computing platform.

结合第一方面，在一些可能的实施方式中，执行故障处理策略所指示的操作，包括：With reference to the first aspect, in some possible implementation manners, performing the operations indicated by the fault handling policy includes:

在故障根因为所述硬件资源出现故障的情况下，执行故障处理策略所指示的操作至少包括：调用所述IaaS管理平台接口执行相应的故障处理策略所指示的操作；或在故障根因为所述操作系统出现故障的情况下，执行故障处理策略所指示的操作包括：调用所述IaaS管理平台接口执行相应的故障处理策略所指示的操作；或在故障根因为所述应用出现故障的情况下，执行故障处理策略所指示的操作包括：调用所述第二代理应用执行相应的故障处理策略所指示的操作。In the case that the root cause of the fault is a failure of the hardware resource, executing the operation indicated by the fault handling policy at least includes: calling the interface of the IaaS management platform to perform the operation indicated by the corresponding fault handling policy; When the operating system fails, executing the operation indicated by the fault handling policy includes: calling the IaaS management platform interface to perform the operation indicated by the corresponding fault handling policy; or when the root cause of the failure is the failure of the application, Executing the operation indicated by the fault handling policy includes: invoking the second proxy application to execute the corresponding operation indicated by the fault handling policy.

可以看到，在需要进行故障恢复的时候，由PaaS管理平台PaaS基于故障处理策略发起故障恢复，保障故障恢复能力，PaaS管理平台可以调用IaaS管理平台或代理应用进行相应的故障恢复，基于不同的故障根因采取不同的故障恢复手段，故故障恢复的能力和效率可由PaaS保障，也就是说，本发明实施例提供的HA方案不受IaaS的HA能力的制约，无论IaaS的HA能力如何都可以保障运行云计算服务器上的应用的可靠性，在所以故本发明实施例提供的HA方案具有通用性。It can be seen that when fault recovery is required, the PaaS management platform PaaS initiates fault recovery based on the fault handling strategy to ensure the fault recovery capability. The PaaS management platform can call the IaaS management platform or proxy application to perform corresponding fault recovery based on different The root cause of the fault is to adopt different fault recovery means, so the ability and efficiency of fault recovery can be guaranteed by PaaS, that is to say, the HA solution provided by the embodiment of the present invention is not restricted by the HA capability of IaaS, regardless of the HA capability of IaaS. The reliability of the application running on the cloud computing server is guaranteed, so the HA solution provided by the embodiment of the present invention is universal.

结合第一方面，在一些可能的实施方式中，执行故障处理策略所指示的操作，还包括：With reference to the first aspect, in some possible implementation manners, performing the operations indicated by the fault handling policy further includes:

基于故障信息生成故障日志，将所述故障日志存档，并向网管系统上报所述故障日志，所述故障信息包括所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息。所述故障日志用于指示故障发生的时间、位置、故障类型、故障恢复历史等信息。Generate a fault log based on the fault information, archive the fault log, and report the fault log to a network management system, where the fault information includes the hardware resource fault information, the operating system fault information, and the application fault information. The fault log is used to indicate the time, location, fault type, fault recovery history and other information of the fault.

当PaaS管理平台的所有故障处理策略以及相应的故障恢复皆不能实现故障恢复时，PaaS管理平台向网管系统进行告警，上报所述故障日志，以便于运维人员通过所述网管系统及时发现该故障和进行人工维护，避免云计算服务器因为不能实现故障恢复而停机，保障云计算平台的高可用性。When all fault handling strategies and corresponding fault recovery of the PaaS management platform fail to achieve fault recovery, the PaaS management platform will send an alarm to the network management system and report the fault log so that the operation and maintenance personnel can find the fault in time through the network management system And manual maintenance to avoid downtime of the cloud computing server due to the inability to achieve fault recovery, and ensure the high availability of the cloud computing platform.

第二方面，本发明实施例提供了一种实现云计算服务器的故障恢复的装置，包括：故障检测模块、故障分析模块、故障策略模块和故障恢复模块，以执行第一方面所提供的一种实现云计算服务器的故障恢复的方法，其中：In the second aspect, an embodiment of the present invention provides a device for implementing fault recovery of a cloud computing server, including: a fault detection module, a fault analysis module, a fault strategy module, and a fault recovery module to implement a fault recovery module provided in the first aspect. A method for realizing fault recovery of a cloud computing server, wherein:

故障检测模块用于获取基础设施即服务IaaS管理平台所发送的硬件资源故障信息，其中，所述IaaS管理平台用于管理所述云计算服务器的硬件资源，还用于检测所述硬件资源的硬件资源故障信息，所述IaaS管理平台独立于所述云计算服务器；还用于获取所述云计算服务器的操作系统故障信息，所述操作系统故障信息用于指示安装于所述云计算服务器的操作系统所出现的故障；还用于获取所述云计算服务器的应用故障信息，所述应用故障信息用于指示安装于所述操作系统的应用所出现的故障；The fault detection module is used to obtain the hardware resource fault information sent by the infrastructure as a service IaaS management platform, wherein the IaaS management platform is used to manage the hardware resources of the cloud computing server, and is also used to detect the hardware resources of the hardware resources Resource fault information, the IaaS management platform is independent of the cloud computing server; it is also used to obtain the operating system fault information of the cloud computing server, and the operating system fault information is used to indicate the operation installed on the cloud computing server A fault in the system; it is also used to obtain application fault information of the cloud computing server, and the application fault information is used to indicate a fault in an application installed in the operating system;

故障分析模块用于根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因；The failure analysis module is used to determine the root cause of the failure of the cloud computing server according to the acquired hardware resource failure information, the operating system failure information and the application failure information;

故障策略模块用于根据所述故障根因确定故障处理策略；The fault strategy module is used to determine a fault handling strategy according to the root cause of the fault;

故障恢复模块用于根据所述故障处理策略所指示的操作进行故障恢复。The fault recovery module is used to perform fault recovery according to the operations indicated by the fault handling policy.

第三方面，本发明实施例提供了又一种实现云计算服务器的故障恢复的装置(服务器)，包括：存储器以及与所述存储器耦合的处理器、发射器和接收器，其中：所述发射器用于与向外部发送指令数据，所述接收器用于接收外部发送的数据，所述存储器用于存储程序代码以及相关数据，所述处理器用于执行所述存储器中存储的程序代码，以执行一种云计算服务器的故障恢复方法，其中，所述方法为如第一方面所述的方法。In a third aspect, the embodiment of the present invention provides yet another device (server) for implementing fault recovery of a cloud computing server, including: a memory, a processor coupled to the memory, a transmitter, and a receiver, wherein: the transmitting The device is used to send instruction data to the outside, the receiver is used to receive the data sent from the outside, the memory is used to store program codes and related data, and the processor is used to execute the program codes stored in the memory to execute a A fault recovery method for a cloud computing server, wherein the method is the method described in the first aspect.

第四方面，本发明实施例提供一种管理系统，所述管理系统包括IaaS管理平台、PaaS管理平台和SaaS服务平台，其中，PaaS管理平台包括故障检测模块、故障分析模块、故障策略模块和故障恢复模块，SaaS服务平台包括代理应用。PaaS管理平台的不同模块与IaaS管理平台通过周期性通讯接口第一IF连接，PaaS管理平台的不同模块与SaaS服务平台通过第二IF连接。所述管理系统用于实现第一方面所述的云计算服务器的故障恢复方法。In a fourth aspect, an embodiment of the present invention provides a management system, the management system includes an IaaS management platform, a PaaS management platform, and a SaaS service platform, wherein the PaaS management platform includes a fault detection module, a fault analysis module, a fault strategy module, and a fault As for the recovery module, the SaaS service platform includes agent applications. Different modules of the PaaS management platform are connected to the IaaS management platform through the first IF of the periodic communication interface, and different modules of the PaaS management platform are connected to the SaaS service platform through the second IF. The management system is used to implement the cloud computing server failure recovery method described in the first aspect.

第五方面，本发明实施例提供了一种计算机可读存储介质，所述计算机可读存储介质存储有指令(实现代码)，当其在计算机上运行时，可使得计算机基于所述指令执行上述第一方面所述的方法。In the fifth aspect, the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores instructions (implementation codes), and when it runs on a computer, it can make the computer execute the above-mentioned The method described in the first aspect.

第七方面，本发明实施例提供了一种包含指令的计算机程序产品，当其在计算机上运行时，可使得计算机基于所述指令执行上述第一方面所述的方法。In a seventh aspect, an embodiment of the present invention provides a computer program product containing instructions, which, when run on a computer, can cause the computer to execute the method described in the first aspect above based on the instructions.

可以看到，通过实施本发明实施例，在企业将遗留系统迁移到云计算平台的云计算服务器后，PaaS可通过IaaS监控硬件资源层的故障，可通过代理应用监控操作系统的运行状态和遗留系统的运行状态。PaaS获取到故障信息时，继续获取预设时间(如2分钟)内的其他故障信息，在预设时间结束后，基于汇总的所有故障信息进行综合分析，确定导致故障发生的故障根因，并基于故障根因确定具体的故障处理策略，进而调用IaaS或代理应用进行相应的故障恢复以及故障告警，从而确保了遗留系统在云计算平台所具有的高可用性，本发明实施例的HA方案具有全面性、准确性和通用性等完备特征。It can be seen that by implementing the embodiment of the present invention, after the enterprise migrates the legacy system to the cloud computing server of the cloud computing platform, the PaaS can monitor the failure of the hardware resource layer through the IaaS, and can monitor the running status and legacy of the operating system through the proxy application. The operating state of the system. When the PaaS obtains the fault information, it continues to obtain other fault information within the preset time (such as 2 minutes). After the preset time is over, it conducts a comprehensive analysis based on the summary of all fault information to determine the root cause of the fault, and Determine the specific fault handling strategy based on the root cause of the fault, and then call the IaaS or proxy application to perform corresponding fault recovery and fault alarm, thereby ensuring the high availability of the legacy system on the cloud computing platform. The HA solution in the embodiment of the present invention has comprehensive Complete features such as reliability, accuracy and versatility.

附图说明Description of drawings

为了更清楚地说明本发明实施例或背景技术中的技术方案，下面将对本发明实施例或背景技术中所需要使用的附图进行说明。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the background technology, the following will describe the drawings that need to be used in the embodiments of the present invention or the background technology.

图1是现有技术提供的一种云计算平台架构示意图；FIG. 1 is a schematic diagram of a cloud computing platform architecture provided by the prior art;

图2是本发明实施例提供的一种云计算服务器的故障恢复方法流程示意图；Fig. 2 is a schematic flow chart of a fault recovery method for a cloud computing server provided by an embodiment of the present invention;

图3是本发明实施例提供的又一种云计算服务器的故障恢复方法流程示意图；FIG. 3 is a schematic flow chart of another method for recovering from a cloud computing server failure provided by an embodiment of the present invention;

图4是本发明实施例提供的一种PaaS综合检测云计算服务器故障的示意图；4 is a schematic diagram of a PaaS comprehensive detection cloud computing server failure provided by an embodiment of the present invention;

图5是本发明实施例提供的一种PaaS判断云计算服务器是否发生故障的示意图。FIG. 5 is a schematic diagram of a PaaS judging whether a cloud computing server fails according to an embodiment of the present invention.

图6是本发明实施例提供的一种PaaS基于优先级选择故障处理策略的流程示意图；FIG. 6 is a schematic flowchart of a priority-based selection of a fault handling strategy by a PaaS provided by an embodiment of the present invention;

图7是本发明实施例提供的一种实现云计算服务器的故障恢复的装置示意图；FIG. 7 is a schematic diagram of a device for realizing fault recovery of a cloud computing server provided by an embodiment of the present invention;

图8是本发明实施例提供的又一种实现云计算服务器的故障恢复的装置示意图；FIG. 8 is a schematic diagram of another device for realizing fault recovery of a cloud computing server provided by an embodiment of the present invention;

图9是本发明实施例提供的一种管理系统；Fig. 9 is a management system provided by an embodiment of the present invention;

图10是本发明实施例提供的又一种管理系统。Fig. 10 is another management system provided by the embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图对本发明实施例进行描述。Embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

在当今互联网以及大数据技术快速发展的云时代，云计算(cloud computing)已经逐渐演变为新型信息化系统的主流计算泛型。云计算是并行计算、分布式计算、效用计算以及虚拟化等一系列网络技术和计算技术融合的产物。请参见图1，图1是现有技术提供的一种云计算平台架构示意图，云计算平台按照提供服务层次的不同通常可分为软件即服务(Software a s a Service，SaaS)、平台即服务(Platform as a Service，PaaS)和基础设备即服务(Infra-structure a s a Service，IaaS)三大服务模式，其中，PaaS和IaaS可以直接通过面向服务的体系结构(SOA，Service-Oriented Architecture)或网络服务器向平台用户提供服务，也可以作为SaaS模式的支撑平台间接向最终用户服务。其中：In today's cloud era with the rapid development of Internet and big data technology, cloud computing has gradually evolved into the mainstream computing generic type of new information systems. Cloud computing is the product of the integration of a series of network technologies and computing technologies such as parallel computing, distributed computing, utility computing, and virtualization. Please refer to FIG. 1. FIG. 1 is a schematic diagram of a cloud computing platform architecture provided by the prior art. Cloud computing platforms can generally be divided into Software as a Service (Software as a Service, SaaS) and Platform as a Service (Platform as a Service) according to the different service levels provided. as a Service, PaaS) and infrastructure as a service (Infra-structure as a Service, IaaS), among which, PaaS and IaaS can directly send to Platform users provide services, and they can also indirectly serve end users as a supporting platform of the SaaS model. in:

对于独立的IaaS(I层)服务模式，I层了提供了虚拟机及虚拟机上的操作系统(opera system，OS)、服务器虚拟计算、虚拟存储和虚拟网络资源。用户通常关注虚拟机的类型以及相关配置(CPU、内存、磁盘、网络等)，虚拟机的操作系统上层的中间件(middleware)、运行时(runtime)以及应用等等皆由用户自己部署。IaaS提供给消费者的服务是对所有设施的利用，包括处理、存储、网络和其它基本的计算资源，用户能够部署和运行任意软件，消费者不管理或控制任何云计算基础设施，但能控制操作系统的选择、储存空间、部署的应用，也有可能获得有限制的网络组件的控制。For an independent IaaS (layer I) service model, layer I provides virtual machines and operating systems (opera system, OS) on virtual machines, server virtual computing, virtual storage, and virtual network resources. Users usually pay attention to the type of virtual machine and related configurations (CPU, memory, disk, network, etc.), and the middleware, runtime, and applications on the upper layer of the operating system of the virtual machine are deployed by the user. The service provided by IaaS to consumers is the utilization of all facilities, including processing, storage, network and other basic computing resources, users can deploy and run arbitrary software, consumers do not manage or control any cloud computing infrastructure, but can control Choice of operating system, storage space, deployed applications, and control over limited network components may also be gained.

对于独立的PaaS(P层)服务模式：P层提供给用户的服务是把用户所需要采用开发语言和开发工具部署到云计算基础设施上去，向用户提供应用软件的运行环境、中间件服务、生命周期管理等等。客户不需要管理或控制底层的云基础设施，包括网络、服务器、操作系统、存储、操作系统、中间件和运行时等，但用户能监控所部署的应用(应用、应用系统等)，也可能控制运行应用的托管环境配置。用户往往只关注应用软件的开发及在PaaS中部署相关数据和应用。For the independent PaaS (P layer) service model: the service provided by the P layer to users is to deploy the development language and development tools required by users to the cloud computing infrastructure, and provide users with the operating environment of application software, middleware services, lifecycle management and more. Customers do not need to manage or control the underlying cloud infrastructure, including networks, servers, operating systems, storage, operating systems, middleware, and runtimes, but users can monitor deployed applications (applications, application systems, etc.), and may also Control the configuration of the hosting environment where your application runs. Users often only focus on the development of application software and the deployment of related data and applications in PaaS.

对于独立的SaaS(S层)服务模式：S层向用户提供运行在云计算基础设施上的应用(应用、应用系统)服务。用户可以在各种设备上通过客户端界面访问，如浏览器，从而直接获得S层所提供的应用服务。用户不需要管理或控制任何云计算基础设施，包括网络、服务器、操作系统、存储、开发环境、应用等等。For the independent SaaS (S layer) service model: the S layer provides users with application (application, application system) services running on the cloud computing infrastructure. Users can access through the client interface on various devices, such as browsers, so as to directly obtain the application services provided by the S layer. Users do not need to manage or control any cloud computing infrastructure, including networks, servers, operating systems, storage, development environments, applications, and more.

在现有技术中，云计算平台中的计算资源是有很大的可能会失效的。这就要求云计算中需要保证所运行的应用系统具有高可用性(High Availability，HA)，高可用性是指一个系统是高度可靠的，即很少出现故障，或者出现故障后能够很快恢复。也就是说云计算平台中的应用系统出现计算资源失效或者出现其他故障时，必须有相应的HA机制保证应用系统的尽快恢复，缩短因计划内日常维护操作或计划外系统崩溃导致的停机时间，从而避免引起业务的中断，提高应用系统的可用性。通常采用云管理平台为云计算平台提供HA方案，目前，最流行的云管理平台为OpenStack，OpenStack是一个旨在为公共及私有云的建设与管理提供软件的开源项目。OpenStack社区中的机构与个人都将OpenStack作为基础设施即服务(IaaS)资源的通用前端。OpenStack已经成为当今工业界和学术界IaaS云平台事实上的部署标准，OpenStack广泛的应用于各行各业。OpenStack的首要任务是简化云的部署过程并为其带来良好的可扩展性，从OpenStack的观点来看，在云计算平台中，IaaS作为云计算的支撑基础设施，IaaS提供了弹性、可扩展的基础设施服务，能够给上层应用提供大规模、按需分配的计算服务、存储服务和网络服务，IaaS云平台的网络服务作为其最为核心的服务，是影响各类云应用服务质量的关键。所以，OpenStack部署在云平台底层物理的计算、存储和网络资源之上，以实现计算、存储和网络资源的统一管理，提供IaaS层的云基础设施统一服务。In the prior art, computing resources in the cloud computing platform are likely to fail. This requires cloud computing to ensure that the running application system has high availability (High Availability, HA). High availability means that a system is highly reliable, that is, there are few failures, or it can be recovered quickly after a failure. That is to say, when the application system in the cloud computing platform has computing resource failure or other failures, there must be a corresponding HA mechanism to ensure the application system can be restored as soon as possible, shortening the downtime caused by planned daily maintenance operations or unplanned system crashes. In this way, business interruption is avoided and the availability of the application system is improved. Cloud management platforms are usually used to provide HA solutions for cloud computing platforms. At present, the most popular cloud management platform is OpenStack, which is an open source project designed to provide software for the construction and management of public and private clouds. Institutions and individuals in the OpenStack community use OpenStack as a common front end to Infrastructure as a Service (IaaS) resources. OpenStack has become the de facto deployment standard for IaaS cloud platforms in industry and academia today. OpenStack is widely used in various industries. The primary task of OpenStack is to simplify the cloud deployment process and bring good scalability to it. From the perspective of OpenStack, in the cloud computing platform, IaaS is the supporting infrastructure of cloud computing. IaaS provides elasticity and scalability. The infrastructure services of the IaaS cloud platform can provide large-scale, on-demand computing services, storage services, and network services to upper-layer applications. The network service of the IaaS cloud platform, as its core service, is the key to the service quality of various cloud applications. Therefore, OpenStack is deployed on the underlying physical computing, storage, and network resources of the cloud platform to achieve unified management of computing, storage, and network resources, and provide unified cloud infrastructure services at the IaaS layer.

OpenStack提出了一个虚拟机(Virtual Machine,VM)HA的解决方案，OpenStack的HA方案致力于解决基础设施层的故障监控和故障恢复。OpenStack的HA方案主要包括：(1)监控(Monitoring),检测虚拟化层故障，监测计算节点的故障；(2)隔离(Fencing),隔离失败计算节点；(3)恢复(Recovery),将故障虚拟机恢复过来。OpenStack proposes a virtual machine (Virtual Machine, VM) HA solution, and the OpenStack HA solution is dedicated to solving fault monitoring and fault recovery at the infrastructure layer. OpenStack's HA solution mainly includes: (1) Monitoring (Monitoring), detecting virtualization layer failures, and monitoring computing node failures; (2) Fencing (Fencing), isolating failed computing nodes; (3) Recovery (Recovery), resolving failures The virtual machine recovers.

从上面描述中可以看出，现有的OpenStack架构从设计上只保障IaaS层的故障，通过保障IaaS层的HA从而实现云计算平台的HA，对于上层的操作系统或者应用层面，OpenStack认为应该由应用自身进行解决，所以现有OpenStack HA方案并不能检测上层的操作系统或者应用层面的故障。实际上，到目前为止，OpenStack社区并没有一个完备的虚拟机HA解决方案。It can be seen from the above description that the existing OpenStack architecture only guarantees the failure of the IaaS layer from the design, and realizes the HA of the cloud computing platform by guaranteeing the HA of the IaaS layer. For the upper operating system or application layer, OpenStack believes that The application solves it by itself, so the existing OpenStack HA solution cannot detect faults at the upper-layer operating system or application layer. In fact, so far, the OpenStack community does not have a complete virtual machine HA solution.

而且，对于烟囱式的遗留系统而言，当企业将遗留系统迁移到云计算平台上后，由于遗留系统与云计算平台难以兼容的架构特性，与遗留系统相关的故障可能出现在基础设施层、OS层和应用层，而OpenStack HA方案不能完全解决应用层面的HA问题。并且，OpenStack HA方案获得的只有基础设施层的故障信息，没有结合应用的状态进行综合的分析判断，容易出现误判，从而制造出新的故障。另外，OpenStack HA方案要求IaaS需要具备故障自动恢复的能力，然而，不同的IaaS管理平台的VM HA能力并不一致，甚至有部分IaaS管理平台缺失故障自动恢复的能力，因此，OpenStack HA方案并不是在所有IaaS管理平台通用。Moreover, for chimney-style legacy systems, when an enterprise migrates the legacy systems to the cloud computing platform, due to the incompatibility of the legacy system with the cloud computing platform, failures related to the legacy system may occur at the infrastructure layer, OS layer and application layer, while the OpenStack HA solution cannot completely solve the HA problem at the application layer. Moreover, the OpenStack HA solution only obtains fault information at the infrastructure layer, without comprehensive analysis and judgment based on the status of the application, which is prone to misjudgment and thus creates new faults. In addition, the OpenStack HA solution requires IaaS to have the ability to automatically recover from failures. However, the VM HA capabilities of different IaaS management platforms are inconsistent, and some IaaS management platforms even lack the ability to automatically recover from failures. Therefore, the OpenStack HA solution is not in Common to all IaaS management platforms.

为了解决现有技术中的缺点，本发明实施例提供了一种云计算服务器的故障恢复方法、相关装置以及管理系统，从IaaS层、OS层、应用层建立多层次、全方位的故障检测和处理机制，解决企业老旧的遗留系统在迁移到云计算平台后如何保证其可靠运行的问题，最大限度保证应用(遗留系统)的可靠性。In order to solve the shortcomings in the prior art, the embodiment of the present invention provides a cloud computing server fault recovery method, related devices and management system, which establishes a multi-level, all-round fault detection and management system from the IaaS layer, OS layer, and application layer. The processing mechanism solves the problem of how to ensure the reliable operation of the old legacy system of the enterprise after migrating to the cloud computing platform, so as to ensure the reliability of the application (legacy system) to the maximum extent.

参见图9，图9是本发明实施例提供的一种管理系统，所述管理系统由SaaS服务平台(下文简称为SaaS)、PaaS管理平台(下文简称为PaaS)和IaaS管理平台(下文简称为IaaS)连接组建而成。所述管理系统可针对不同层次(I层、P层和S层)提供相对应的管理服务，在具体实现方式中，所述IaaS管理平台、PaaS管理平台和SaaS服务平台可以分别运行于不同的服务器中，所述IaaS管理平台、PaaS管理平台和SaaS服务平台也可以运行在同一服务器上。Referring to FIG. 9, FIG. 9 is a management system provided by an embodiment of the present invention. The management system consists of a SaaS service platform (hereinafter referred to as SaaS), a PaaS management platform (hereinafter referred to as PaaS) and an IaaS management platform (hereinafter referred to as IaaS) connection is established. The management system can provide corresponding management services for different levels (I layer, P layer and S layer). In a specific implementation, the IaaS management platform, PaaS management platform and SaaS service platform can run on different In the server, the IaaS management platform, the PaaS management platform and the SaaS service platform may also run on the same server.

具体的，IaaS管理平台可以是面向私有、公有或混合IaaS云用户提供的云计算基础设施平台，可集中管理超大规模的服务器、存储和网络资源，形成可统一管理和调度的云计算资源池，为用户提供按需使用和弹性调度的计算能力。IaaS管理平台可同时支持多种虚拟化技术的集成，并对I层硬件资源提供统一的资源管理、调度和监控，实现对虚拟机、存储资源及网络资源从创建、检测到销毁的全生命周期管理，为I层的虚拟机提供快速创建、弹性扩展、本地重建、动态迁移等一系列高可用性的保障，并未虚拟机提供操作系统的支持；Specifically, the IaaS management platform can be a cloud computing infrastructure platform for private, public or hybrid IaaS cloud users, which can centrally manage ultra-large-scale servers, storage and network resources, and form a cloud computing resource pool that can be managed and scheduled in a unified manner. Provide users with computing power that can be used on demand and flexibly scheduled. The IaaS management platform can support the integration of multiple virtualization technologies at the same time, and provide unified resource management, scheduling and monitoring for I-layer hardware resources, and realize the full life cycle of virtual machines, storage resources and network resources from creation, detection to destruction Management, providing a series of high-availability guarantees such as rapid creation, elastic expansion, local reconstruction, and dynamic migration for virtual machines on the I layer, and providing operating system support for virtual machines;

具体的，PaaS管理平台可以构建于IaaS管理平台之上，也就是说，IaaS管理平台直接对本地硬件资源进行管理。PaaS管理平台需要获取或者调用本地硬件资源的相关信息时，可直接向IaaS管理平台请求或者调用。此外，PaaS管理平台还提供应用及相关资源的端到端的监控与管理，将请求指令路由至有效应用实例，并依赖代理应用、云控制器、健康管理器等组件对操作系统、应用和相关服务的状态、运行参数等信息进行管理和监控。Specifically, the PaaS management platform can be built on the IaaS management platform, that is, the IaaS management platform directly manages local hardware resources. When the PaaS management platform needs to obtain or call relevant information of local hardware resources, it can directly request or call the IaaS management platform. In addition, the PaaS management platform also provides end-to-end monitoring and management of applications and related resources, routes request instructions to valid application instances, and relies on components such as proxy applications, cloud controllers, and health managers to monitor operating systems, applications, and related services. The status, operating parameters and other information are managed and monitored.

具体的，在本发明实施例中，SaaS服务平台可以构建在PaaS管理平台和IaaS管理平台组成的基础架构之上，SaaS服务平台只关注于向云服务运营商或企业提供应用服务，SaaS服务平台中的应用、应用系统受PaaS管理平台的管理、检测和控制。Specifically, in the embodiment of the present invention, the SaaS service platform can be built on the infrastructure composed of the PaaS management platform and the IaaS management platform. The SaaS service platform only focuses on providing application services to cloud service operators or enterprises. The SaaS service platform The applications and application systems in the system are managed, detected and controlled by the PaaS management platform.

所述管理系统为云计算平台(云计算服务器)提供HA方案，其中，PaaS处于该HA方案的核心，具有全局的信息视图，统筹故障的检测、分析、策略以及恢复。The management system provides an HA solution for the cloud computing platform (cloud computing server), wherein PaaS is at the core of the HA solution, has a global information view, and coordinates fault detection, analysis, strategy and recovery.

一方面，云计算服务器的某些层次出现故障后，故障不再是在本层内直接进行处理，而是统一汇聚到PaaS，PaaS综合分析判断，其故障检测范围覆盖云计算服务器的硬件资源层面、操作系统层面和应用层面，故本发明实施例提供的HA方案具有全面性；On the one hand, after some layers of cloud computing servers fail, the faults are no longer directly processed in this layer, but are converged to PaaS, and PaaS comprehensive analysis and judgment, and its fault detection range covers the hardware resource level of cloud computing servers , operating system level and application level, so the HA solution provided by the embodiment of the present invention is comprehensive;

另一方面，I层、P层和S层的故障统一汇聚给PaaS处理，PaaS结合硬件资源、操作系统和应用的状态进行根因分析，准确地确定故障根因，从而可以防止误报漏报，故本发明实施例提供的HA方案具有准确性；On the other hand, the faults of the I layer, P layer, and S layer are collectively aggregated to PaaS for processing. PaaS combines hardware resources, operating systems, and application status to conduct root cause analysis to accurately determine the root cause of the fault, thereby preventing false positives and false negatives. , so the HA scheme provided by the embodiment of the present invention is accurate;

再一方面，PaaS基于故障处理策略发起故障恢复，基于不同的故障根因采取不同的故障恢复手段，故故障恢复的能力和效率可由PaaS保障，也就是说，本发明实施例提供的HA方案不受IaaS的HA能力的制约，无论IaaS的HA能力如何都可以保障运行云计算服务器上的应用的可靠性，所以故本发明实施例提供的HA方案具有通用性。On the other hand, PaaS initiates fault recovery based on fault handling strategies, and adopts different fault recovery methods based on different fault root causes, so the ability and efficiency of fault recovery can be guaranteed by PaaS, that is, the HA solution provided by the embodiment of the present invention does not Restricted by the HA capability of the IaaS, regardless of the HA capability of the IaaS, the reliability of applications running on the cloud computing server can be guaranteed, so the HA solution provided by the embodiment of the present invention is universal.

本发明实施例还提供的一种云计算服务器的故障恢复方法，请参见图2，一种云计算服务器的故障恢复方法，包括：An embodiment of the present invention also provides a cloud computing server failure recovery method, please refer to Figure 2, a cloud computing server failure recovery method, including:

步骤S101，获取云计算服务器的硬件资源故障信息、操作系统故障信息、应用故障信息。Step S101, acquiring hardware resource fault information, operating system fault information, and application fault information of a cloud computing server.

在本发明实施例中，设计PaaS作为整个云计算平台(云计算服务器)的故障管理核心，PaaS处于IaaS和SaaS的中间，PaaS可用于收集PaaS自身所管理的云服务业务数据，还可以用于收集IaaS和SaaS提交的数据，其中，所述PaaS独立于所述云计算服务器。In the embodiment of the present invention, PaaS is designed as the fault management core of the entire cloud computing platform (cloud computing server), and PaaS is in the middle of IaaS and SaaS. PaaS can be used to collect cloud service business data managed by PaaS itself, and can also be used for Collect data submitted by IaaS and SaaS, wherein the PaaS is independent of the cloud computing server.

在本发明实施例中，PaaS获取云计算服务器的故障信息，所述故障信息具体包括硬件资源故障信息、操作系统故障信息以及应用故障信息，In the embodiment of the present invention, the PaaS obtains the fault information of the cloud computing server, and the fault information specifically includes hardware resource fault information, operating system fault information, and application fault information,

其中，硬件资源故障信息用于指示硬件资源故障层面所出现的故障，如存储资源不足、网络异常、虚拟机运行故障等等；操作系统故障信息用于指示操作系统(OS)层面所出现的故障，例如操作系统登录异常、系统死机等；应用故障信息用于指示应用所出现的故障，例如应用中止，应用系统异常等。Among them, the hardware resource fault information is used to indicate the fault occurred at the hardware resource fault level, such as insufficient storage resources, network abnormality, virtual machine operation fault, etc.; the operating system fault information is used to indicate the fault occurred at the operating system (OS) level , such as operating system login exceptions, system crashes, etc.; application fault information is used to indicate application failures, such as application suspension, application system exceptions, and the like.

具体的，PaaS执行云计算服务器的故障信息，包括执行以下步骤S201-S203：Specifically, the PaaS executes the fault information of the cloud computing server, including performing the following steps S201-S203:

步骤S201：获取基础设施即服务系统IaaS中所发送的硬件资源故障信息。Step S201: Obtain hardware resource failure information sent in the infrastructure as a service system IaaS.

在本发明实施例中，IaaS用于管理硬件资源，包括计算资源、存储资源以及网络资源，IaaS还用于检测云计算服务器的硬件资源所产生的故障，其中，所述IaaS独立于所述云计算服务器。In the embodiment of the present invention, IaaS is used to manage hardware resources, including computing resources, storage resources, and network resources, and IaaS is also used to detect faults generated by hardware resources of cloud computing servers, wherein the IaaS is independent of the cloud computing server.

在本发明实施例中，IaaS可对本地硬件资源进行实时监控，能够动态展现计算资源、存储资源、网络资源以及相关虚拟机的运行状态，具体的，IaaS可进行资源容量查询、资源用量控制、VM运行状态监测、故障告警等，并将相关信息上报至PaaS。In the embodiment of the present invention, IaaS can monitor local hardware resources in real time, and can dynamically display the running status of computing resources, storage resources, network resources and related virtual machines. Specifically, IaaS can perform resource capacity query, resource usage control, VM running status monitoring, fault alarm, etc., and relevant information reported to PaaS.

在具体实施例中，当运行于云计算服务器的虚拟机(VM)出现故障，或者云计算服务器中相关的硬件配置(CPU、内存、磁盘、网络等)出现故障时，IaaS检测到该故障，并实时生成相应的硬件资源故障信息，并将所述硬件资源故障信息发送至PaaS，相应的，PaaS获取所述硬件资源故障信息。In a specific embodiment, when a virtual machine (VM) running on a cloud computing server fails, or a related hardware configuration (CPU, memory, disk, network, etc.) in the cloud computing server fails, the IaaS detects the failure, And generate corresponding hardware resource failure information in real time, and send the hardware resource failure information to the PaaS, and correspondingly, the PaaS acquires the hardware resource failure information.

步骤S202：获取所述云计算服务器的操作系统故障信息。Step S202: Obtain operating system failure information of the cloud computing server.

PaaS管理应用软件的运行环境以及中间件服务，为需要运行的应用提供生命周期管理，PaaS可获取中间件、应用等所依托的操作系统相关状态信息。对于运行在虚拟机上的操作系统，当所述操作系统出现系统断连、系统崩溃等等故障时，所述PaaS可以获取相关的操作系统故障信息。在具体的实现方式中，PaaS可以在所需要检测的操作系统OS中设置代理(Agent)，PaaS与所述Agent进行通信，通过检测通信质量来判断Agent所在OS的运行状态。PaaS manages the operating environment of application software and middleware services, and provides lifecycle management for applications that need to run. PaaS can obtain information about the operating system on which middleware and applications rely. For the operating system running on the virtual machine, when the operating system has a failure such as system disconnection or system crash, the PaaS can obtain relevant operating system failure information. In a specific implementation manner, the PaaS can set an agent (Agent) in the operating system OS to be detected, and the PaaS communicates with the Agent, and judges the operating status of the OS where the Agent is located by detecting the communication quality.

步骤S203：获取所述云计算服务器的应用故障信息。Step S203: Obtain application failure information of the cloud computing server.

在本发明实施例中，SaaS只关注于提供应用(应用软件、应用、应用系统等)服务，并不直接管理和监控所述应用服务，管理和监控所述应用的角色实际上由PaaS充当，当所述SaaS中的应用发生故障时，PaaS实时检测并获取到该故障对应的应用故障信息。In the embodiment of the present invention, SaaS only focuses on providing application (application software, application, application system, etc.) services, and does not directly manage and monitor the application services, and the role of managing and monitoring the applications is actually played by PaaS, When an application in the SaaS fails, the PaaS detects and acquires application failure information corresponding to the failure in real time.

需要说明的是，需要说明的是，步骤S201、步骤S202和步骤S203之间没有必然的先后顺序，此外，在具体的实施例中，步骤S201、步骤S202和步骤S203中的两个步骤可以同时进行，步骤S201、步骤S202和步骤S203还可以同时进行，上述实施例的描述不应理解为对本发明的限制。It should be noted that there is no necessary sequence between step S201, step S202 and step S203. In addition, in a specific embodiment, two steps in step S201, step S202 and step S203 can be performed simultaneously Step S201, step S202 and step S203 may also be performed at the same time, and the description of the above embodiment should not be construed as limiting the present invention.

步骤S102：根据所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云服务器的故障根因。Step S102: Determine the root cause of the cloud server failure according to the hardware resource failure information, the operating system failure information and the application failure information.

其中，PaaS获取故障信息后，判定故障信息的来源，并针对该故障信息设置计时器，继续检测在预设时间(例如3分钟)内是否还能获取到其他故障信息。计时器的所述预设时间结束的时候，PaaS基于预设时间内获取到的所有故障信息进行综合分析，以确定云服务器的故障根因，也就是确定导致云服务器出现故障的具体原因以及出现故障的具体位置。Wherein, after the PaaS acquires the fault information, it determines the source of the fault information, sets a timer for the fault information, and continues to detect whether other fault information can be obtained within a preset time (for example, 3 minutes). When the preset time of the timer ends, the PaaS will conduct a comprehensive analysis based on all the fault information obtained within the preset time to determine the root cause of the cloud server failure, that is, to determine the specific cause of the cloud server failure and its occurrence. The exact location of the fault.

参见表1，表1是本发明一些具体实施例中PaaS在预设时间内获取到的故障信息与PaaS所确定的故障根因之间的对应关系。Referring to Table 1, Table 1 shows the correspondence between the fault information obtained by the PaaS within a preset time and the root cause of the fault determined by the PaaS in some specific embodiments of the present invention.

表1Table 1

注：√表示状态健康,×表示检测到故障,NA表示无检测信息Note: √ indicates a healthy state, × indicates a fault is detected, and NA indicates no detection information

可以看到，PaaS对故障进行检测的结果可包括3种情况：状态健康，检测到故障以及无检测信息上报，对故障根因的分析包括以下几种情况：It can be seen that the results of fault detection by PaaS can include three situations: healthy status, detected faults, and no detection information reported. The analysis of the root cause of faults includes the following situations:

当PaaS在预设时间内，检测到硬件资源故障信息和操作系统故障信息，那么，PaaS将确定故障根因是云计算服务器的硬件资源层出现了故障；When PaaS detects hardware resource failure information and operating system failure information within the preset time, then PaaS will determine that the root cause of the failure is the failure of the hardware resource layer of the cloud computing server;

当PaaS在预设时间内，检测到硬件资源故障信息和应用故障信息，那么，PaaS将确定故障根因是云计算服务器的硬件资源层出现了故障；When PaaS detects hardware resource failure information and application failure information within the preset time, then PaaS will determine that the root cause of the failure is the failure of the hardware resource layer of the cloud computing server;

可以看到，当PaaS在预设时间内，检测到操作系统故障信息和应用故障信息，并且在该预设时间内没有检测到硬件资源故障信息，那么，PaaS将确定故障根因是云计算服务器的操作系统OS出现故障；或It can be seen that when the PaaS detects operating system failure information and application failure information within the preset time, and no hardware resource failure information is detected within the preset time, then the PaaS will determine that the root cause of the failure is the cloud computing server operating system OS malfunctions; or

可以看到，当PaaS在预设时间内仅检测到应用故障信息，而没有检测到硬件资源故障信息和操作系统故障信息，那么，PaaS将确定故障根因是云计算服务器的应用层出现故障。It can be seen that when PaaS only detects application failure information within the preset time, but does not detect hardware resource failure information and operating system failure information, then PaaS will determine that the root cause of the failure is the failure of the application layer of the cloud computing server.

可以看到，在PaaS仅检测到操作系统故障信息；或，当仅检测到硬件资源故障信息，或者，仅检测到硬件资源故障信息和应用故障信息等等情况下，那么，PaaS将判定所述预设时间内出现的故障信息属于云计算平台(云计算服务器)的误报，这些情况下，PaaS将忽略上述相关故障信息。It can be seen that, when only operating system failure information is detected by PaaS; or, when only hardware resource failure information is detected, or only hardware resource failure information and application failure information are detected, the PaaS will determine that the The fault information that occurs within a preset time period is a false report of the cloud computing platform (cloud computing server), and in these cases, the PaaS will ignore the above-mentioned relevant fault information.

步骤S103：根据所述故障根因确定故障处理策略。Step S103: Determine a fault handling strategy according to the root cause of the fault.

在通过步骤S102确定了故障根因之后，PaaS根据故障根因确定相对应的故障处理策略。After determining the root cause of the fault in step S102, the PaaS determines a corresponding fault handling strategy according to the root cause of the fault.

参见表2，表2是本发明实施例中故障根因与故障处理策略的一些对应关系。Referring to Table 2, Table 2 shows some correspondences between fault root causes and fault handling strategies in the embodiments of the present invention.

表2Table 2

可以看到，在一具体应用场景中，在故障根因是硬件资源出现故障的情况下，所述故障处理策略至少包括重启(reboot)虚拟机、本地重建(rebuild)虚拟机和迁移(migration)虚拟机。It can be seen that in a specific application scenario, when the root cause of the failure is a hardware resource failure, the failure handling strategy includes at least restarting (reboot) virtual machine, local reconstruction (rebuild) virtual machine and migration (migration) virtual machine.

可以看到，在一具体应用场景中，在故障根因是操作系统出现故障的情况下，所述故障处理策略至少包括重启虚拟机，在该种情况下，重启虚拟机时，虚拟机会相应加载所对应的操作系统，进而完成操作系统的重启；在特殊情形下，如果不重启虚拟机也能实现重启操作系统，那么故障处理策略直接为重启操作系统。It can be seen that in a specific application scenario, when the root cause of the failure is the failure of the operating system, the failure handling strategy includes at least restarting the virtual machine. In this case, when the virtual machine is restarted, the virtual machine will load the corresponding The corresponding operating system, and then complete the restart of the operating system; in special cases, if the operating system can be restarted without restarting the virtual machine, then the fault handling strategy is to restart the operating system directly.

可以看到，在一具体应用场景中，在故障根因是应用层出现故障的情况下，所述故障处理策略至少包括重启应用、重启虚拟机，其中，重启应用为直接对相关的应用进行重新重启；重启虚拟机为先重启该应用所在的虚拟机，虚拟机会相应加载所对应的操作系统，然后再在该操作系统中运行该应用；在特殊情形下，如果不重启虚拟机也能实现重启操作系统，那么故障处理策略直接为重启操作系统，然后再在该操作系统中运行该应用。It can be seen that in a specific application scenario, when the root cause of the fault is a fault in the application layer, the fault handling strategy includes at least restarting the application and restarting the virtual machine, wherein restarting the application means directly restarting the relevant application. Restart; restarting the virtual machine is to restart the virtual machine where the application is located first, and the virtual machine will load the corresponding operating system accordingly, and then run the application in the operating system; in special cases, restarting can also be achieved without restarting the virtual machine operating system, then the fault handling strategy is directly to restart the operating system, and then run the application in the operating system.

步骤S104、根据所述故障处理策略所指示的操作进行故障恢复。Step S104, performing fault recovery according to the operations indicated by the fault handling policy.

可以理解的，在确定故障处理策略后，PaaS即可基于所述故障处理策略所指示的操作实现相关故障的恢复。It can be understood that after the fault handling policy is determined, the PaaS can recover related faults based on the operations indicated by the fault handling policy.

在本发明实施例中，IaaS负责硬件资源的管理与控制，包括调整虚拟机CPU、内存及磁盘扩容，进行虚拟机的重启、本地重建以及动态迁移等等，最大限度地保证虚拟机业务连续性，以便于降低甚至消除由于虚拟机故障而带来的业务影响。In the embodiment of the present invention, IaaS is responsible for the management and control of hardware resources, including adjusting the CPU, memory, and disk expansion of the virtual machine, restarting the virtual machine, rebuilding locally, and dynamically migrating, etc., to ensure the business continuity of the virtual machine to the greatest extent. , so as to reduce or even eliminate the business impact caused by virtual machine failures.

故当所述故障处理策略所指示的操作为针对硬件资源层进行操作时，PaaS向IaaS下发指令，所述指令包括所述故障处理策略所指示的操作，IaaS基于执行所述指令，实现相关故障的恢复。Therefore, when the operation indicated by the fault handling policy is to operate on the hardware resource layer, the PaaS issues an instruction to the IaaS, and the instruction includes the operation indicated by the fault handling policy. Based on the execution of the instruction, the IaaS implements the relevant failure recovery.

例如，在故障处理策略为重启虚拟机的情况下，PaaS调用IaaS接口重启虚拟机，然后通过检查虚拟机状态判断故障是否恢复。For example, when the fault handling policy is to restart the virtual machine, the PaaS invokes the IaaS interface to restart the virtual machine, and then checks the status of the virtual machine to determine whether the fault is recovered.

在故障处理策略为本地重建虚拟机的情况下，PaaS判定云计算服务器上的虚拟机所在的系统盘是共享盘，则PaaS调用IaaS接口在该共享盘进行虚拟机本地重建，然后通过检查虚拟机的任务状态判断故障是否恢复。When the fault handling policy is to rebuild the virtual machine locally, PaaS determines that the system disk where the virtual machine on the cloud computing server is located is a shared disk, then PaaS calls the IaaS interface to rebuild the virtual machine locally on the shared disk, and then checks the virtual machine Judging whether the fault is recovered or not based on the task status.

在故障处理策略为迁移虚拟机的情况下，PaaS调用IaaS接口将故障云计算服务器上的虚拟机迁移到其它宿主机上。In the case that the fault handling strategy is to migrate the virtual machine, the PaaS invokes the IaaS interface to migrate the virtual machine on the faulty cloud computing server to other host machines.

在故障处理策略为重启操作系统的情况下，PaaS调用IaaS接口重启虚拟机，在虚拟机重启后加载相应的操作系统。When the fault handling strategy is to restart the operating system, PaaS calls the IaaS interface to restart the virtual machine, and loads the corresponding operating system after the virtual machine restarts.

在特殊情况下，如果不需重启所述虚拟机中也可以实现操作系统的重启，那么，PaaS调用IaaS接口直接重启操作系统。In special cases, if the operating system can be restarted without restarting the virtual machine, then the PaaS invokes the IaaS interface to directly restart the operating system.

可以看出，通过实施本发明实施例，在企业将应用(如应用程序、应用系统、IT系统、遗留系统等)迁移到云计算平台的云计算服务器后，PaaS可通过IaaS监控硬件资源层的故障，可通过代理应用监控操作系统的运行状态和应用的运行状态。PaaS获取到故障信息时，继续获取预设时间内的其他故障信息，在预设时间结束后，基于汇总的所有故障信息进行综合分析，确定导致故障发生的故障根因，并基于故障根因确定具体的故障处理策略，进而调用IaaS或代理应用进行相应的故障恢复，从而确保了应用在云计算平台所具有的高可用性，本发明实施例的HA方案具有全面性、准确性和通用性等完备特征。It can be seen that by implementing the embodiment of the present invention, after the enterprise migrates the application (such as application program, application system, IT system, legacy system, etc.) Faults, the running status of the operating system and the running status of the application can be monitored through the agent application. When PaaS obtains the fault information, it will continue to obtain other fault information within the preset time. After the preset time is over, it will conduct a comprehensive analysis based on all the fault information summarized to determine the root cause of the fault, and determine the fault based on the root cause of the fault. Specific fault handling strategy, and then call IaaS or proxy application to perform corresponding fault recovery, thereby ensuring the high availability of the application on the cloud computing platform. The HA solution in the embodiment of the present invention is complete in comprehensiveness, accuracy and versatility. feature.

请综合参阅图3-图6，图3是本发明实施例提供的又一种云计算服务器的故障恢复方法，该方法包括但不限于如下步骤：Please refer to Figure 3-Figure 6 comprehensively, Figure 3 is another cloud computing server failure recovery method provided by the embodiment of the present invention, the method includes but is not limited to the following steps:

步骤S301：IaaS检测硬件资源故障信息，并将资源故障信息发送至PaaS。Step S301: the IaaS detects hardware resource failure information, and sends the resource failure information to the PaaS.

在具体的实施例中，参见图4，IaaS监控云计算服务器的硬件资源，以确定云计算服务器的I层是否发生故障，在硬件资源出现故障时，IaaS向PaaS上报硬件资源故障信息。例如，当应用(应用程序、应用系统、企业IT系统等)被部署到IaaS分配的虚拟机上后，管理人员根据需要向PaaS注册IaaS的租户、用户账号、虚拟机地址等信息，以便于PaaS向IaaS提供高可用性方案。比如，当所述应用为企业的烟囱式遗留系统时，为了使所述遗留系统获得高可用性，管理人员向PaaS注册IaaS的该企业的遗留系统相关信息。PaaS识别所述相关信息后，自动向IaaS订阅遗留系统所在的云计算服务器(宿主机)以及虚拟机的故障告警。IaaS实时检测虚拟机、宿主机的运行状态，当检测到故障的出现后，IaaS生成相应的硬件资源故障信息(如VM运行状态异常信息)，并将硬件资源故障信息上报到PaaS，以便于PaaS进行后续故障处理。In a specific embodiment, referring to FIG. 4, the IaaS monitors the hardware resources of the cloud computing server to determine whether the I layer of the cloud computing server fails. When the hardware resources fail, the IaaS reports the hardware resource failure information to the PaaS. For example, when an application (application, application system, enterprise IT system, etc.) is deployed on a virtual machine assigned by IaaS, the administrator registers information such as IaaS tenant, user account, virtual machine address, etc. with PaaS as needed, so that PaaS can Provide high availability solutions to IaaS. For example, when the application is a chimney-type legacy system of an enterprise, in order to make the legacy system highly available, the manager registers information related to the legacy system of the enterprise of the IaaS with the PaaS. After the PaaS recognizes the relevant information, it automatically subscribes to the IaaS for fault alarms of the cloud computing server (host machine) where the legacy system is located and the virtual machine. IaaS detects the running status of virtual machines and host machines in real time. When a fault is detected, IaaS generates corresponding hardware resource fault information (such as abnormal VM running status information), and reports the hardware resource fault information to PaaS, so that PaaS Perform follow-up troubleshooting.

步骤S302：PaaS通过检测第一代理应用的心跳信息来获取操作系统故障信息，所述心跳信息用于指示所述操作系统故障信息。Step S302: The PaaS acquires operating system failure information by detecting heartbeat information of the first proxy application, where the heartbeat information is used to indicate the operating system failure information.

其中，步骤S302使用第一代理应用仅仅是为了和步骤S303中的第二代理应用进行区分。Wherein, step S302 uses the first proxy application only to distinguish it from the second proxy application in step S303.

如图4所示，PaaS在部署应用的所有虚拟机上都安装一个代理应用(Agentapplication)，也就是说，所述第一代理应用部署在云计算服务器的OS层，所述代理应用用于与PaaS进行心跳维持，PaaS定时检测与第一代理应用的心跳，以便判断OS层是否发生故障。当某个第一代理应用心跳消失时，PaaS发送心跳请求，如果第一代理应用仍不能及时返回对心跳请求的应答，那就表明PaaS与该第一代理应用所在的操作系统(或虚拟机)发生了断连，所以，PaaS将生成相应的操作系统故障信息。As shown in Figure 4, PaaS installs an agent application (Agentapplication) on all the virtual machines where the application is deployed, that is to say, the first agent application is deployed on the OS layer of the cloud computing server, and the agent application is used to communicate with The PaaS maintains the heartbeat, and the PaaS regularly detects the heartbeat with the first agent application, so as to determine whether the OS layer fails. When the heartbeat of a first agent application disappears, the PaaS sends a heartbeat request. If the first agent application still cannot return a response to the heartbeat request in time, it indicates that the PaaS and the operating system (or virtual machine) where the first agent application is located A disconnection has occurred, so the PaaS will generate a corresponding operating system failure message.

步骤S303：PaaS通过第二代理应用检测应用的运行状态来获取所述应用故障信息。Step S303: The PaaS acquires the application failure information by detecting the running state of the application through the second proxy application.

具体的，所述应用部署在云计算服务器的应用层，并由SaaS整合成应用层面的云服务，以便于向云服务运营商或企业提供该云服务。Specifically, the application is deployed at the application layer of the cloud computing server, and is integrated into an application-level cloud service by SaaS, so as to provide the cloud service to cloud service operators or enterprises.

其中，步骤S303使用第二代理应用仅仅是为了和步骤S302中的代理应用进行区分。所述第二代理应用同样部署在所有虚拟机上，通过代理应用来管理本虚拟机节点的应用。在具体的实施例中，第二代理应用和第一代理应用可以是同一个应用，也可以是不同的应用，本发明实施例在这里不做限定。Wherein, step S303 uses the second proxy application only to distinguish it from the proxy application in step S302. The second agent application is also deployed on all virtual machines, and the application of the virtual machine node is managed through the agent application. In a specific embodiment, the second proxy application and the first proxy application may be the same application or different applications, which is not limited in this embodiment of the present invention.

如图4所示，第二代理应用同样部署于应用层，可用于管理云计算服务器中的应用，并定期监控虚拟机上的应用的运行状态，例如，第二代理应用通过应用所提供的状态检测脚本进行相关运行状态的监控。在具体的应用场景中，应用在运行过程中，动态地提供一个状态检测脚本，例如状态检测脚本为status.sh，status.sh定义返回值1表示应用正在运行，定义返回值2表示应用已停止运行，定义返回值3表示应用已出现异常状况。所述status.sh被放置在该应用(应用系统)的安装目录下，第二代理应用定期在该安装目录调用status.sh，并获取相应的返回值，可以理解的，第二代理应用根据脚本的返回值判断应用的运行状态，并将相应的运行状态发送至SaaS。在返回值为2或3的情况下，第二代理应用生成应用故障信息，并将所述应用故障信息发送至PaaS，相应的，PaaS获取所述应用故障信息。As shown in Figure 4, the second proxy application is also deployed at the application layer, and can be used to manage the applications in the cloud computing server, and regularly monitor the running status of the applications on the virtual machine, for example, the status provided by the second proxy application through the application The detection script monitors the relevant running status. In a specific application scenario, the application dynamically provides a status detection script during the running process. For example, the status detection script is status.sh, status.sh defines a return value of 1 to indicate that the application is running, and defines a return value of 2 to indicate that the application has stopped Run, define a return value of 3 to indicate that an abnormal situation has occurred in the application. The status.sh is placed in the installation directory of the application (application system), and the second agent application periodically calls status.sh in the installation directory, and obtains the corresponding return value. It is understandable that the second agent application according to the script The return value of determines the running status of the application, and sends the corresponding running status to SaaS. If the returned value is 2 or 3, the second proxy application generates application fault information and sends the application fault information to the PaaS, and correspondingly, the PaaS obtains the application fault information.

步骤S304：PaaS根据所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云服务器的故障根因。Step S304: The PaaS determines the root cause of the cloud server failure according to the hardware resource failure information, the operating system failure information, and the application failure information.

在本发明实施例中，PaaS获取所有的层面的故障信息，故PaaS可以结合硬件资源层、OS层以及应用层的故障信息进行综合的分析判断，以准确得到故障根因。In the embodiment of the present invention, PaaS obtains fault information at all levels, so PaaS can conduct comprehensive analysis and judgment in combination with fault information at the hardware resource layer, OS layer, and application layer to accurately obtain the root cause of the fault.

参见图5，PaaS定义云计算服务器初始工作状态为正常状态，当PaaS接收到故障信息时，PaaS基于接收该故障信息的时间点设置预设时间T(例如2分钟)，并在预设时间T内继续其他的故障信息，在预设时间结束后，PaaS基于所获取的所有故障信息进行综合判定。Referring to Figure 5, PaaS defines the initial working state of the cloud computing server as a normal state. When PaaS receives fault information, PaaS sets a preset time T (for example, 2 minutes) based on the time point of receiving the fault information, and at the preset time T Continue other fault information within the preset time, and PaaS will make a comprehensive judgment based on all the fault information obtained after the preset time.

如果所获取到的故障信息满足预设条件，PaaS定义云计算服务器工作状态为故障状态，并且进一步确定故障根因。如果所获取到的故障信息不满足预设条件，PaaS继续定义云计算服务器工作状态为正常状态。If the acquired fault information satisfies the preset condition, PaaS defines the working state of the cloud computing server as a fault state, and further determines the root cause of the fault. If the acquired fault information does not meet the preset condition, PaaS continues to define the working state of the cloud computing server as a normal state.

如图5所示，如果在预设时间T结束后，PaaS基于所获取的所有故障信息中同时存在VM状态异常和心跳断连，那么PaaS将定义云计算服务器工作状态为故障状态，故障根因为硬件资源层发生故障；如果在预设时间T结束后，PaaS基于所获取的所有故障信息中仅存在VM状态异常或者心跳断连，那么PaaS将丢弃上述故障信息，并定义云计算服务器工作状态为正常状态。As shown in Figure 5, if after the preset time T is over, if there are both abnormal VM status and heartbeat disconnection in all the fault information obtained by PaaS, then PaaS will define the working state of the cloud computing server as a fault state, and the root cause of the fault is A fault occurs at the hardware resource layer; if after the preset time T is over, if there is only VM state abnormality or heartbeat disconnection based on all the fault information obtained by PaaS, then PaaS will discard the above fault information and define the working state of the cloud computing server as normal status.

同样，如果在预设时间T结束后，PaaS基于所获取的所有故障信息中同时存在应用异常和心跳断连，那么PaaS将定义云计算服务器工作状态为故障状态，故障根因为OS层发生故障；如果在预设时间T结束后，PaaS基于所获取的所有故障信息中仅存在应用异常或者心跳断连，那么PaaS将丢弃上述故障信息，并定义云计算服务器工作状态为正常状态。Similarly, if PaaS has both application exception and heartbeat disconnection based on all the fault information obtained after the preset time T, then PaaS will define the working state of the cloud computing server as a fault state, and the root cause of the fault is a fault in the OS layer; If after the preset time T is over, PaaS only has application abnormality or heartbeat disconnection based on all the fault information obtained, then PaaS will discard the above fault information and define the working status of the cloud computing server as normal.

另外，如果在预设时间T结束后，PaaS基于所获取的所有故障信息中仅存在应用异常，而不存在其他故障信息，那么PaaS将定义云计算服务器工作状态为故障状态，故障根因为应用层发生故障。In addition, if after the preset time T is over, PaaS only has application exceptions and no other fault information based on all the fault information obtained by PaaS, then PaaS will define the working state of the cloud computing server as a fault state, and the root cause of the fault is the application layer. malfunction.

步骤S305：PaaS根据所述故障根因确定故障处理策略。Step S305: The PaaS determines a fault handling policy according to the root cause of the fault.

在一具体的应用场景中，PaaS基于故障的类型确定具体的故障处理策略，比如可以在PaaS中预设故障诊断数据库，该故障诊断数据库存储有多种故障信息，针对属于同一层次的故障信息赋予不同的故障等级，如故障等级一、故障等级二、故障等级三等等。比如对于针对硬件资源层出现故障所预设的故障处理策略中，预设故障等级一所对应的故障处理策略为重启虚拟机，故障等级二对应的故障处理策略为本地重建虚拟机，故障等级三对应的故障处理策略为迁移虚拟机，以此类推。在确定故障根因后，PaaS基于实际获取的硬件资源层出现故障进行分析，确定该硬件资源层出现故障所对应的故障等级，并基于该故障等级相应的确定故障处理策略。In a specific application scenario, PaaS determines a specific fault handling strategy based on the type of fault. For example, a fault diagnosis database can be preset in PaaS. The fault diagnosis database stores a variety of fault information. Different failure levels, such as failure level 1, failure level 2, failure level 3, etc. For example, among the fault handling strategies preset for faults in the hardware resource layer, the fault handling strategy corresponding to fault level 1 is to restart the virtual machine, the fault handling strategy corresponding to fault level 2 is to rebuild the virtual machine locally, and the fault handling strategy corresponding to fault level 3 The corresponding fault handling strategy is to migrate the virtual machine, and so on. After determining the root cause of the failure, PaaS analyzes the failure of the hardware resource layer based on the actual acquisition, determines the failure level corresponding to the failure of the hardware resource layer, and determines the corresponding fault handling strategy based on the failure level.

举例来说，在一具体应用场景中，参见图6，对应硬件资源层的故障处理策略按照优先级从高到低分别为：重启虚拟机、本地重建虚拟机、迁移虚拟机和上报网管系统。在确定故障根因为硬件资源出现故障后，PaaS选择重启虚拟机(在本层优先级最高)作为故障处理策略。在后续执行重启虚拟机无法实现故障恢复的情况下，PaaS重新选择故障管理策略为本地重建虚拟机。在后续执行本地重建虚拟机无法实现故障恢复的情况下，PaaS重新选择故障管理策略为迁移虚拟机。在后续执行迁移虚拟机无法实现故障恢复的情况下，PaaS重新选择故障管理策略为上报网管系统，并在执行上报网管系统的操作后结束上述故障恢复流程。For example, in a specific application scenario, referring to Figure 6, the fault handling strategies corresponding to the hardware resource layer are, in descending order of priority, respectively: restarting the virtual machine, rebuilding the virtual machine locally, migrating the virtual machine, and reporting to the network management system. After determining that the root cause of the fault is a hardware resource failure, PaaS chooses to restart the virtual machine (the highest priority in this layer) as a fault handling strategy. In the case that subsequent restart of the virtual machine fails to achieve fault recovery, the PaaS re-selects the fault management strategy to rebuild the virtual machine locally. In the case that subsequent local rebuilding of virtual machines fails to achieve fault recovery, PaaS reselects the fault management strategy as migrating virtual machines. In the case that the fault recovery cannot be achieved after the subsequent migration of the virtual machine, the PaaS re-selects the fault management strategy as reporting to the network management system, and ends the above fault recovery process after performing the operation of reporting to the network management system.

其中，上报网管系统具体包括：PaaS基于故障信息生成故障日志，并将所述故障日志存档，所述故障日志用于指示故障发生的时间、位置、故障类型、故障恢复历史等信息。PaaS向网管系统上报所述故障日志，以便于运维人员通过所述网管系统及时发现该故障和进行人工维护。Wherein, reporting to the network management system specifically includes: PaaS generates a fault log based on the fault information, and archives the fault log, and the fault log is used to indicate information such as the time, location, fault type, and fault recovery history of the fault. The PaaS reports the fault log to the network management system, so that the operation and maintenance personnel can find the fault in time through the network management system and perform manual maintenance.

又举例来说，在另一具体应用场景中，对应操作系统的故障处理策略至少包括重启虚拟机和上报网管系统。在故障根因为操作系统出现故障的情况下，PaaS选择重启虚拟机(在本层优先级最高)作为故障处理策略。在后续执行重启虚拟机无法实现故障恢复的情况下，PaaS重新选择故障管理策略为上报网管系统，并在执行上报网管系统的操作后结束上述故障恢复流程。As another example, in another specific application scenario, the fault handling policy corresponding to the operating system at least includes restarting the virtual machine and reporting to the network management system. In the case that the root cause of the failure is the failure of the operating system, PaaS chooses to restart the virtual machine (the highest priority in this layer) as a failure handling strategy. In the case that the subsequent execution of restarting the virtual machine fails to achieve fault recovery, PaaS reselects the fault management strategy as reporting to the network management system, and ends the above fault recovery process after performing the operation of reporting to the network management system.

又举例来说，在另一具体应用场景中，对应于应用的故障处理策略按照优先级从高到低分别为：重启应用、重启虚拟机、和上报网管系统。在故障根因为应用层出现故障的情况下，PaaS选择重启应用(在本层优先级最高)作为故障处理策略。在后续执行重启应用无法实现故障恢复的情况下，PaaS重新选择故障管理策略为重启虚拟机。在后续执行重启虚拟机无法实现故障恢复的情况下，PaaS重新选择故障管理策略为上报网管系统，并在执行上报网管系统的操作后结束上述故障恢复流程。As another example, in another specific application scenario, the fault handling strategies corresponding to the application are, in descending order of priority, respectively: restarting the application, restarting the virtual machine, and reporting to the network management system. When the root cause of the failure is a failure at the application layer, PaaS chooses to restart the application (with the highest priority at this layer) as the fault handling strategy. In the case that the subsequent restart of the application fails to achieve fault recovery, the PaaS re-selects the fault management policy as restarting the virtual machine. In the case that the subsequent execution of restarting the virtual machine fails to achieve fault recovery, PaaS reselects the fault management strategy as reporting to the network management system, and ends the above fault recovery process after performing the operation of reporting to the network management system.

步骤S306、PaaS分别向IaaS发送故障恢复指令，所述故障恢复指令包括所确定的故障处理策略，相应的，IaaS执行故障处理策略所指示的操作进行I层的故障恢复；或者PaaS分别向SaaS的第二代理应用发送故障恢复指令，所述故障恢复指令包括所确定的故障处理策略，相应的，第二代理应用执行故障处理策略所指示的操作进行S层的故障恢复；Step S306, PaaS sends fault recovery instructions to IaaS respectively, and the fault recovery instructions include the determined fault handling strategy. The second proxy application sends a fault recovery instruction, the fault recovery command includes the determined fault handling strategy, and correspondingly, the second proxy application executes the operation indicated by the fault handling strategy to perform fault recovery on the S layer;

具体的，PaaS向IaaS发送故障恢复指令，所述故障恢复指令包括所确定I层的故障处理策略，IaaS执行故障处理策略所指示的操作进行I层的故障恢复包括：执行重启虚拟机、执行本地重建虚拟机和执行迁移虚拟机，如图6所示，IaaS执行上述操作后，如果PaaS判断故障已经恢复，那么将结束上述操作流程。如果IaaS执行上述操作后，PaaS判断故障没有恢复，那么PaaS将执行上报网管系统的操作。Specifically, the PaaS sends a fault recovery command to the IaaS, and the fault recovery command includes the determined fault handling policy of the I layer. Rebuild the virtual machine and execute the migration of the virtual machine, as shown in Figure 6, after the IaaS performs the above operations, if the PaaS determines that the fault has been recovered, then the above operation process will end. If PaaS judges that the fault has not been recovered after IaaS performs the above operations, then PaaS will perform the operation of reporting to the network management system.

具体的，PaaS向第二代理应用发送故障恢复指令，所述故障恢复指令包括所确定S层的故障处理策略，IaaS执行故障处理策略所指示的操作进行S层的故障恢复包括：执行重启应用，第二代理应用执行上述操作后，如果PaaS判断故障已经恢复，那么将结束上述操作流程。如果IaaS执行上述操作后，PaaS判断故障没有恢复，那么PaaS指示第二代理应用在重启虚拟机后执行重启应用，如果故障还没有恢复，那么PaaS将执行上报网管系统的操作。Specifically, the PaaS sends a fault recovery instruction to the second agent application, the fault recovery instruction includes the determined fault handling policy of the S layer, and the IaaS executes the operation indicated by the fault processing policy to perform the fault recovery of the S layer including: executing restarting the application, After the second proxy application executes the above operations, if the PaaS determines that the fault has been recovered, the above operation process will be ended. If IaaS executes the above operations, PaaS judges that the fault has not recovered, then PaaS instructs the second proxy application to restart the application after restarting the virtual machine, and if the fault has not recovered, PaaS will perform the operation of reporting to the network management system.

可以看到，通过实施本发明实施例，在企业将应用迁移到云计算平台的云计算服务器后，PaaS可通过IaaS监控硬件资源层的故障，可通过代理应用监控操作系统的运行状态和遗留系统的运行状态。PaaS获取到故障信息时，继续获取预设时间内的其他故障信息，在预设时间结束后，基于汇总的所有故障信息进行综合分析，确定导致故障发生的故障根因，并基于故障根因确定具体的故障处理策略，进而调用IaaS或代理应用进行相应的故障恢复，在所有的故障处理策略都不能实现故障恢复，PaaS向网管系统进行故障告警以进行进一步故障维护，从而确保了遗留系统在云计算平台所具有的高可用性，本发明实施例的HA方案具有全面性、准确性和通用性等完备特征。It can be seen that by implementing the embodiment of the present invention, after the enterprise migrates the application to the cloud computing server of the cloud computing platform, the PaaS can monitor the failure of the hardware resource layer through the IaaS, and can monitor the running status of the operating system and the legacy system through the agent application. operating status. When PaaS obtains the fault information, it will continue to obtain other fault information within the preset time. After the preset time is over, it will conduct a comprehensive analysis based on all the fault information summarized to determine the root cause of the fault, and determine the fault based on the root cause of the fault. Specific fault handling strategies, and then invoke IaaS or proxy applications to perform corresponding fault recovery. If all fault processing strategies cannot achieve fault recovery, PaaS will send fault alarms to the network management system for further fault maintenance, thus ensuring that legacy systems are deployed in the cloud. Due to the high availability of the computing platform, the HA scheme of the embodiment of the present invention has complete features such as comprehensiveness, accuracy and versatility.

基于同一发明构思，本发明实施例提供一种实现云计算服务器的故障恢复的装置70，请参见图7，控制节点70包括：发射器703、接收器704、存储器702和与存储器702耦合的处理器701。发射器703、接收器704、存储器702和处理器701可通过总线或者其它方式连接(图7中以通过总线连接为例)。其中：Based on the same inventive concept, an embodiment of the present invention provides an apparatus 70 for realizing fault recovery of a cloud computing server. Referring to FIG. device 701. The transmitter 703, the receiver 704, the memory 702, and the processor 701 may be connected via a bus or in other ways (in FIG. 7, connection via a bus is taken as an example). in:

处理器701，可以是一个或多个中央处理器(Central Processing Unit，CPU)，图7中以一个处理器为例，在处理器701是一个CPU的情况下，该CPU可以是单核CPU，也可以是多核CPU。The processor 701 may be one or more central processing units (Central Processing Units, CPUs). In FIG. 7, a processor is taken as an example. When the processor 701 is a CPU, the CPU may be a single-core CPU. It can also be a multi-core CPU.

存储器702，包括但不限于是随机存储记忆体(Random Access Memory，RAM)、只读存储器(Read-Only Memory，ROM)、可擦除可编程只读存储器(Erasable ProgrammableRead Only Memory，EPROM)、或便携式只读存储器(Compact Disc Read-Only Memory，CD-ROM)，该存储器702用于相关指令及数据，还用于存储程序代码，所述程序代码具体用于实现图5或图8实施例中的所述控制节点的功能；Memory 702, including but not limited to random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), or Portable read-only memory (Compact Disc Read-Only Memory, CD-ROM), the memory 702 is used for relevant instructions and data, and is also used for storing program codes, and the program codes are specifically used to realize the embodiment in Fig. 5 or Fig. 8 the function of said control node;

发射器703用于向外部发送指令数据；Transmitter 703 is used to send instruction data to the outside;

接收器704用于从外部接收数据；The receiver 704 is used to receive data from the outside;

具体的，处理器701用于调用存储器702中存储的程序代码，并执行以下步骤：Specifically, the processor 701 is used to call the program code stored in the memory 702, and perform the following steps:

利用接收器704获取基础设施即服务IaaS管理平台所发送的硬件资源故障信息，其中，所述IaaS管理平台用于管理所述云计算服务器的硬件资源，还用于检测所述硬件资源的硬件资源故障信息，所述IaaS管理平台独立于所述云计算服务器；Use the receiver 704 to acquire the hardware resource failure information sent by the infrastructure as a service IaaS management platform, wherein the IaaS management platform is used to manage the hardware resources of the cloud computing server, and is also used to detect the hardware resources of the hardware resources Fault information, the IaaS management platform is independent of the cloud computing server;

利用接收器704获取所述云计算服务器的操作系统故障信息，所述操作系统故障信息用于指示安装于所述云计算服务器的操作系统所出现的故障；Using the receiver 704 to obtain operating system failure information of the cloud computing server, the operating system failure information is used to indicate a failure of the operating system installed on the cloud computing server;

利用接收器704获取所述云计算服务器的应用故障信息，所述应用故障信息用于指示安装于所述操作系统的应用所出现的故障；Obtaining application failure information of the cloud computing server by using the receiver 704, where the application failure information is used to indicate a failure of an application installed in the operating system;

处理器701根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因；The processor 701 determines the root cause of the failure of the cloud computing server according to the acquired hardware resource failure information, the operating system failure information and the application failure information;

处理器701根据所述故障根因确定故障处理策略；The processor 701 determines a fault handling strategy according to the root cause of the fault;

利用发射器703根据所述故障处理策略所指示的操作进行故障恢复。Use the transmitter 703 to perform fault recovery according to the operations indicated by the fault handling strategy.

具体的，所述操作系统还具有第一代理应用；Specifically, the operating system also has a first agent application;

利用接收器704获取所述云计算服务器的操作系统故障信息，包括：利用接收器704通过检测所述第一代理应用的心跳信息来确定所述操作系统故障信息，所述心跳信息用于指示所述操作系统是否发生故障。Using the receiver 704 to obtain the operating system failure information of the cloud computing server includes: using the receiver 704 to determine the operating system failure information by detecting the heartbeat information of the first proxy application, the heartbeat information being used to indicate the If the above operating system malfunctions.

具体的，在所述操作系统还具有第二代理应用；Specifically, the operating system also has a second agent application;

利用接收器704获取所述云计算服务器的应用故障信息，包括：利用接收器704通过所述第二代理应用调用所述应用的状态检测脚本，根据所述状态检测脚本的返回值确定所述应用故障信息。Obtaining the application fault information of the cloud computing server by using the receiver 704 includes: using the receiver 704 to call the state detection script of the application through the second proxy application, and determine the application according to the return value of the state detection script accident details.

处理器701根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因，至少包括：在预设时间内皆检测到所述硬件资源故障信息和所述操作系统故障信息情况下，处理器701确定故障根因为所述硬件资源出现故障；或在预设时间内检测到所述操作系统故障信息和所述应用故障信息，且没有检测到所述硬件资源故障信息情况下，处理器701确定故障根因为所述操作系统出现故障；或在预设时间内仅检测到应用故障信息情况下，处理器701确定故障根因为所述应用出现故障。The processor 701 determines the root cause of the failure of the cloud computing server according to the acquired hardware resource failure information, the operating system failure information and the application failure information, at least including: all detected within a preset time In the case of the hardware resource fault information and the operating system fault information, the processor 701 determines that the root cause of the fault is that the hardware resource is faulty; or detects the operating system fault information and the application fault information within a preset time, And if the hardware resource fault information is not detected, the processor 701 determines that the root cause of the fault is that the operating system is faulty; The above application fails.

处理器701根据所述故障根因确定故障处理策略包括：The processor 701 determines the fault handling strategy according to the root cause of the fault, including:

具体的，在故障根因为所述硬件资源出现故障的情况下，所述故障处理策略包括重启虚拟机、本地重建虚拟机和迁移虚拟机，具体为：在故障根因为所述硬件资源出现故障的情况下，所述故障处理策略为重启虚拟机；在执行重启虚拟机不能实现硬件资源故障恢复的情况下，所述故障处理策略为本地重建虚拟机；在执行重启虚拟机和本地重建虚拟机皆不能实现硬件资源故障恢复的情况下，所述故障处理策略为迁移虚拟机。Specifically, when the root cause of the failure is a failure of the hardware resource, the failure handling strategy includes restarting the virtual machine, locally rebuilding the virtual machine, and migrating the virtual machine, specifically: when the root cause of the failure is the failure of the hardware resource In this case, the fault handling strategy is to restart the virtual machine; in the case that the hardware resource failure recovery cannot be realized by restarting the virtual machine, the fault handling strategy is to rebuild the virtual machine locally; In the case that the hardware resource fault recovery cannot be realized, the fault handling strategy is to migrate the virtual machine.

具体的，在故障根因为所述应用出现故障的情况下，所述故障处理策略至少包括重启虚拟机、重启应用，具体为：在故障根因为所述应用出现故障的情况下，所述故障处理策略为重启应用；在执行重启虚拟机不能实现应用故障恢复的情况下，所述故障处理策略为重启虚拟机。Specifically, in the case that the root cause of the fault is a fault in the application, the fault handling strategy includes at least restarting the virtual machine and restarting the application, specifically: in the case that the root cause of the fault is faulty in the application, the fault handling The strategy is to restart the application; in the case that restarting the virtual machine fails to recover the application failure, the fault handling strategy is to restart the virtual machine.

具体的，执行故障处理策略所指示的操作，包括：在故障根因为所述硬件资源出现故障的情况下，执行故障处理策略所指示的操作至少包括：调用所述IaaS管理平台接口执行相应的故障处理策略所指示的操作；或在故障根因为所述操作系统出现故障的情况下，执行故障处理策略所指示的操作包括：调用所述IaaS管理平台接口执行相应的故障处理策略所指示的操作；或在故障根因为所述应用出现故障的情况下，执行故障处理策略所指示的操作包括：调用所述第二代理应用执行相应的故障处理策略所指示的操作。Specifically, executing the operations indicated by the fault handling strategy includes: in the case that the root cause of the fault is the failure of the hardware resource, executing the operations indicated by the fault handling strategy at least includes: calling the IaaS management platform interface to execute the corresponding fault Processing the operation indicated by the policy; or in the case that the root cause of the fault is that the operating system fails, executing the operation indicated by the fault handling policy includes: calling the IaaS management platform interface to perform the operation indicated by the corresponding fault handling policy; Or in the case that the root cause of the fault is a fault in the application, executing the operation indicated by the fault handling policy includes: invoking the second proxy application to execute the corresponding operation indicated by the fault handling policy.

处理器701执行故障处理策略所指示的操作，还包括：The processor 701 executes the operation indicated by the fault handling strategy, which also includes:

处理器701基于故障信息生成故障日志，将所述故障日志存档，并利用发射器703向网管系统上报所述故障日志，所述故障信息包括所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息。The processor 701 generates a fault log based on the fault information, archives the fault log, and uses the transmitter 703 to report the fault log to the network management system, the fault information includes the hardware resource fault information, the operating system fault information and The application failure information.

需要说明的，通过前述图2-图6实施例的详细描述，本领域技术人员可以清楚的知道装置70所包含的各个功能单元的实现方法，所以为了说明书的简洁，在此不再赘述。It should be noted that those skilled in the art can clearly know the implementation method of each functional unit contained in the device 70 through the detailed description of the embodiment in FIGS.

基于同一发明构思，本发明实施例提供的一种实现云计算服务器的故障恢复的装置80，请参见图8，该装置80包括多个功能模块，各个功能模块的详细描述如下。Based on the same inventive concept, an embodiment of the present invention provides an apparatus 80 for realizing fault recovery of a cloud computing server, please refer to FIG. 8 , the apparatus 80 includes a plurality of functional modules, and the detailed description of each functional module is as follows.

故障检测模块801，用于获取基础设施即服务IaaS管理平台所发送的硬件资源故障信息，其中，所述IaaS管理平台用于管理所述云计算服务器的硬件资源，还用于检测所述硬件资源的硬件资源故障信息，所述IaaS管理平台独立于所述云计算服务器；还用于获取所述云计算服务器的操作系统故障信息，所述操作系统故障信息用于指示安装于所述云计算服务器的操作系统所出现的故障；还用于获取所述云计算服务器的应用故障信息，所述应用故障信息用于指示安装于所述操作系统的应用所出现的故障；The fault detection module 801 is configured to obtain hardware resource fault information sent by the infrastructure as a service IaaS management platform, wherein the IaaS management platform is used to manage the hardware resources of the cloud computing server, and is also used to detect the hardware resources hardware resource fault information, the IaaS management platform is independent of the cloud computing server; it is also used to obtain the operating system fault information of the cloud computing server, and the operating system fault information is used to indicate that it is installed on the cloud computing server the failure of the operating system; and also used to obtain the application failure information of the cloud computing server, the application failure information is used to indicate the failure of the application installed in the operating system;

故障分析模块802，用于根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因；A failure analysis module 802, configured to determine the root cause of the failure of the cloud computing server according to the acquired hardware resource failure information, the operating system failure information and the application failure information;

故障策略模块803，用于根据所述故障根因确定故障处理策略；A fault strategy module 803, configured to determine a fault handling strategy according to the root cause of the fault;

故障恢复模块804，用于根据所述故障处理策略所指示的操作进行故障恢复。The fault recovery module 804 is configured to perform fault recovery according to the operations indicated by the fault handling policy.

在具体的实施例中，所述操作系统具有第一代理应用；故障检测模块801还用于获取所述云计算服务器的操作系统故障信息，包括：所述故障检测模块801还用于通过检测所述第一代理应用的心跳信息来确定所述操作系统故障信息，所述心跳信息用于指示所述操作系统是否发生故障。In a specific embodiment, the operating system has a first agent application; the fault detection module 801 is also used to acquire the operating system fault information of the cloud computing server, including: the fault detection module 801 is also used to detect the The heartbeat information of the first agent application is used to determine the operating system failure information, and the heartbeat information is used to indicate whether the operating system fails.

在具体实施例中，在所述操作系统中安装有第二代理应用；所述故障检测模块801还用于获取所述云计算服务器的应用故障信息包括：所述故障检测模块801还用于通过所述第二代理应用调用所述应用的状态检测脚本，根据所述状态检测脚本的返回值确定所述应用故障信息。In a specific embodiment, a second agent application is installed in the operating system; the fault detection module 801 is also used to obtain the application fault information of the cloud computing server includes: the fault detection module 801 is also used to pass The second proxy application invokes a state detection script of the application, and determines the application failure information according to a return value of the state detection script.

在具体实施例中，故障分析模块802用于根据所获取到的所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息确定所述云计算服务器的故障根因，至少包括：In a specific embodiment, the failure analysis module 802 is configured to determine the root cause of the failure of the cloud computing server according to the acquired hardware resource failure information, the operating system failure information and the application failure information, at least including:

所述故障分析模块802用于在预设时间内皆检测到所述硬件资源故障信息和所述操作系统故障信息情况下，确定故障根因为所述硬件资源出现故障；或所述故障分析模块802用于在预设时间内检测到所述操作系统故障信息和所述应用故障信息，且没有检测到所述硬件资源故障信息情况下，确定故障根因为所述操作系统出现故障；或所述故障分析模块802用于在预设时间内仅检测到应用故障信息情况下，确定故障根因为所述应用出现故障。The failure analysis module 802 is configured to determine that the root cause of the failure is that the hardware resource fails when both the failure information of the hardware resource and the failure information of the operating system are detected within a preset time; or the failure analysis module 802 It is used to detect the operating system fault information and the application fault information within a preset time, and if the hardware resource fault information is not detected, determine that the root cause of the fault is that the operating system is faulty; or the fault The analysis module 802 is configured to determine that the root cause of the fault is that the application is faulty when only application fault information is detected within a preset time.

在具体实施例中，故障策略模块803用于根据所述故障根因确定故障处理策略包括：In a specific embodiment, the fault strategy module 803 is configured to determine a fault handling strategy according to the root cause of the fault, including:

其中，在故障根因为硬件资源出现故障的情况下，所述故障处理策略包括重启虚拟机、本地重建虚拟机和迁移虚拟机，具体为：Wherein, in the case that the root cause of the failure is a hardware resource failure, the failure handling strategy includes restarting the virtual machine, locally rebuilding the virtual machine and migrating the virtual machine, specifically:

在故障根因为所述硬件资源出现故障的情况下，所述故障处理策略为重启虚拟机；在执行重启虚拟机不能实现硬件资源故障恢复的情况下，所述故障处理策略为本地重建虚拟机；在执行重启虚拟机和本地重建虚拟机皆不能实现硬件资源故障恢复的情况下，所述故障处理策略为迁移虚拟机。In the case where the root cause of the failure is a failure of the hardware resource, the failure handling strategy is to restart the virtual machine; when the hardware resource failure recovery cannot be realized by restarting the virtual machine, the failure handling strategy is to rebuild the virtual machine locally; In the case that neither restarting the virtual machine nor rebuilding the virtual machine locally can recover the hardware resource failure, the failure handling strategy is to migrate the virtual machine.

其中，在故障根因为所述应用出现故障的情况下，所述故障处理策略至少包括重启虚拟机、重启应用，具体为：Wherein, in the case that the root cause of the failure is the failure of the application, the failure handling strategy includes at least restarting the virtual machine and restarting the application, specifically:

在故障根因为所述应用出现故障的情况下，所述故障处理策略为重启应用；在执行重启虚拟机不能实现应用故障恢复的情况下，所述故障处理策略为重启虚拟机。In the case that the root cause of the failure is the failure of the application, the failure handling strategy is to restart the application; in the case that restarting the virtual machine cannot realize application failure recovery, the failure handling strategy is to restart the virtual machine.

在具体实施例中，故障恢复模块804用于根据所述故障处理策略所指示的操作进行故障恢复，包括：In a specific embodiment, the fault recovery module 804 is configured to perform fault recovery according to the operations indicated by the fault handling policy, including:

在具体实施例中，所述装置80还包括故障告警模块805，所述故障告警模块用于基于故障信息生成故障日志，将所述故障日志存档，并向网管系统上报所述故障日志，所述故障信息包括所述硬件资源故障信息、所述操作系统故障信息和所述应用故障信息。In a specific embodiment, the device 80 further includes a fault alarm module 805, the fault alarm module is configured to generate a fault log based on fault information, archive the fault log, and report the fault log to the network management system, the The fault information includes the hardware resource fault information, the operating system fault information and the application fault information.

需要说明的，通过前述图2-图6实施例的详细描述，本领域技术人员可以清楚的知道装置80所包含的各个功能单元的实现方法，所以为了说明书的简洁，在此不再赘述。It should be noted that those skilled in the art can clearly know the implementation method of each functional unit included in the device 80 through the detailed description of the embodiment in FIGS.

基于同一发明构思，本发明实施例还提供又一种管理系统，参见图10，所述管理系统包括IaaS管理平台901、PaaS管理平台902和SaaS服务平台903，其中，PaaS管理平台902包括故障检测模块801、故障分析模块802、故障策略模块803和故障恢复模块804，SaaS服务平台903包括代理应用806。PaaS管理平台902的不同模块与IaaS管理平台901通过周期性通讯接口IF连接，PaaS管理平台902的不同模块与SaaS服务平台903也通过IF连接，不同的接口描述如下：Based on the same inventive concept, the embodiment of the present invention also provides another management system. Referring to FIG. A module 801 , a failure analysis module 802 , a failure strategy module 803 and a failure recovery module 804 , and the SaaS service platform 903 includes an agent application 806 . Different modules of the PaaS management platform 902 are connected to the IaaS management platform 901 through the periodic communication interface IF, and different modules of the PaaS management platform 902 are also connected to the SaaS service platform 903 through the IF. The different interfaces are described as follows:

接口名称interface name 接口连接关系Interface connection relationship IF1IF1 连接故障检测模块801和故障分析模块802Connect fault detection module 801 and fault analysis module 802 IF2IF2 连接故障策略模块803和故障分析模块802Connect fault strategy module 803 and fault analysis module 802 IF3IF3 连接故障恢复模块804和故障分析模块802Connect fault recovery module 804 and fault analysis module 802 IF4IF4 连接故障检测模块801和IaaS管理平台901Connect fault detection module 801 and IaaS management platform 901 IF5IF5 连接故障检测模块801和代理应用806Connection fault detection module 801 and agent application 806 IF6IF6 连接故障恢复模块804和IaaS管理平台901Connect fault recovery module 804 and IaaS management platform 901 IF7IF7 连接故障恢复模块804和代理应用806Connection failure recovery module 804 and agent application 806

需要说明的是，管理系统中各个管理平台、模块以及各个接口的功能在上文实施例中已有体现，具体可参见图2-图9的相关描述，在这里不在赘述。It should be noted that the functions of each management platform, module, and each interface in the management system have been embodied in the above embodiments, for details, please refer to the relevant descriptions in Fig. 2-Fig. 9 , and will not repeat them here.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者任意组合来实现。当使用软件实现时，可以全部或者部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令，在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网络站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、微波等)方式向另一个网络站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质，也可以是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带等)、光介质(例如DVD等)、或者半导体介质(例如固态硬盘)等等。In the above embodiments, all or part may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to the embodiments of the present invention will be generated. The computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a network site, computer, server, or data center Transmission to another network site, computer, server, or data center via wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer, and may also be a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape, etc.), an optical medium (such as a DVD, etc.), or a semiconductor medium (such as a solid-state hard disk), and the like.

在上述实施例中，对各个实施例的描述各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the foregoing embodiments, the descriptions of each embodiment have different emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

Claims

1. a kind of fault recovery method of cloud computing server, which is characterized in that be applied to cloud computing server, the method packet It includes：

It is the hardware resource fault message serviced transmitted by IaaS management platforms to obtain infrastructure, and the IaaS management platforms are used In the hardware resource fault message for detecting the hardware resource；

The operating system failure information of the cloud computing server is obtained, the operating system failure information, which is used to indicate, to be installed on The failure that the operating system of the cloud computing server occurs；

The application and trouble information of the cloud computing server is obtained, the application and trouble information, which is used to indicate, is installed on the operation The failure that systematic difference occurs；

According to the accessed hardware resource fault message, the operating system failure information and the application and trouble information Determine the failure root of the cloud computing server because；

According to the failure root because determining troubleshooting strategy；

Operation indicated by the troubleshooting strategy carries out fault recovery.

2. according to the method described in claim 1, it is characterized in that, the operating system also has first agent's application；

The operating system failure information of the cloud computing server is obtained, including：

The operating system failure information, the heartbeat message are determined by detecting the heartbeat message of first agent's application It is used to indicate whether the operating system breaks down.

3. method according to claim 1 or 2, which is characterized in that also there is second agent's application in the operating system；

The application and trouble information of the cloud computing server is obtained, including：

By the state-detection script applied described in second agent's application call, according to the return of the state-detection script Value determines the application and trouble information.

4. method according to any one of claims 1 to 3, which is characterized in that according to the accessed hardware resource Fault message, the operating system failure information and the application and trouble information determine the failure root of the cloud computing server Cause includes at least：

It all detects under the hardware resource fault message and the operating system failure information state, determines in preset time Failure root is because the hardware resource breaks down；Or

The operating system failure information and the application and trouble information are detected in preset time, and are not detected described In the case of hardware resource fault message, failure root is determined because the operating system breaks down；Or

It is only detected under application and trouble information state in preset time, determines failure root because the application is broken down.

5. according to claim 4 any one of them method, which is characterized in that according to the failure root because determining troubleshooting plan Slightly include：

Failure root because the hardware resource break down in the case of, the troubleshooting strategy include restart virtual machine, It is local to rebuild virtual machine and migrate one or more in virtual machine；Or

In failure root because in the case that the operating system breaks down, the troubleshooting strategy includes restarting virtual machine； Or in failure root because in the case that the application is broken down, the troubleshooting strategy includes restarting application and restarting virtual One or both of machine.

6. according to claim 5 any one of them method, which is characterized in that in failure root because event occurs in the hardware resource In the case of barrier, the troubleshooting strategy includes restarting virtual machine, local one kind rebuild in virtual machine and migration virtual machine Or it is a variety of, specially：

In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtual machine；

In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight Build virtual machine；

In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery, the event Barrier processing strategy is migration virtual machine.

7. according to claim 5 any one of them method, which is characterized in that break down because of the application in failure root In the case of, the troubleshooting strategy includes one or both of restarting application and restarting virtual machine, specially：

In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application；

In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.

8. according to the method described in claim 5, it is characterized in that, execute troubleshooting strategy indicated by operation, including：

In failure root because in the case that the hardware resource breaks down, the operation indicated by execution troubleshooting strategy is at least Including：The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy；Or

In failure root because in the case that the operating system breaks down, the operation packet indicated by troubleshooting strategy is executed It includes：The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy；Or

In failure root because in the case that the application is broken down, executing the operation indicated by troubleshooting strategy includes：It adjusts With the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.

9. a kind of device of fault recovery that realizing cloud computing server, which is characterized in that including：

Fault detection module services hardware resource fault message transmitted by IaaS management platforms for obtaining infrastructure, Wherein, the IaaS management platforms are used to detect the hardware resource fault message of the hardware resource；It is additionally operable to obtain the cloud The operating system failure information of calculation server, the operating system failure information, which is used to indicate, is installed on the cloud computing service The failure that the operating system of device occurs；It is additionally operable to obtain the application and trouble information of the cloud computing server, the application event Barrier information, which is used to indicate, is installed on the failure that the application of the operating system occurs；

Failure analysis module, for according to the accessed hardware resource fault message, the operating system failure information With the application and trouble information determine the failure root of the cloud computing server because；

Failure strategy module is used for according to the failure root because determining troubleshooting strategy；

Failure Recovery Module carries out fault recovery for the operation indicated by the troubleshooting strategy.

10. device according to claim 9, which is characterized in that the operating system is applied with first agent；

Fault detection module is additionally operable to obtain the operating system failure information of the cloud computing server, including：

The fault detection module is additionally operable to determine the operation system by detecting the heartbeat message of first agent's application System fault message, the heartbeat message are used to indicate whether the operating system breaks down.

11. device according to claim 9 or 10, which is characterized in that be equipped with second agent in the operating system Using；

The application and trouble information that the fault detection module is additionally operable to obtain the cloud computing server includes：

The fault detection module is additionally operable to the state-detection script by being applied described in second agent's application call, according to The return value of the state-detection script determines the application and trouble information.

12. according to claim 9 to 11 any one of them device, which is characterized in that failure analysis module 802 is used for according to institute The hardware resource fault message, the operating system failure information and the application and trouble information got determines the cloud The failure root of calculation server is because including at least：

The failure analysis module in preset time for all detecting the hardware resource fault message and operation system It unites in the case of fault message, determines failure root because the hardware resource breaks down；Or

The failure analysis module in preset time for detecting the operating system failure information and the application and trouble Information, and in the case of not detecting the hardware resource fault message, determine failure root because the operating system occurs therefore Barrier；Or

The failure analysis module for only being detected under application and trouble information state in preset time, determine failure root because The application is broken down.

13. according to claim 12 any one of them device, which is characterized in that failure strategy module 803 is used for according to Failure root because determine troubleshooting strategy include：

Failure root because the hardware resource break down in the case of, the troubleshooting strategy include restart virtual machine, It is local to rebuild one or more of virtual machine and migration virtual machine；Or

In failure root because in the case that the operating system breaks down, the troubleshooting strategy, which includes at least, restarts virtually Machine；Or

Failure root because the application break down in the case of, the troubleshooting strategy include at least restart application and again Open one or two of virtual machine.

14. according to claim 13 any one of them device, which is characterized in that in failure root because the hardware resource occurs In the case of failure, the troubleshooting strategy includes restarting virtual machine, local one rebuild in virtual machine and migration virtual machine It is a or multiple, specially：

15. according to claim 13 or 14 any one of them devices, which is characterized in that in failure root because the application occurs In the case of failure, the troubleshooting strategy, which includes at least, one or two of restarts application and restarts virtual machine, specifically For：

16. device according to claim 15, which is characterized in that Failure Recovery Module 804 is used at according to the failure Operation indicated by reason strategy carries out fault recovery, including：

17. a kind of device of fault recovery that realizing cloud computing server, which is characterized in that including：Memory and with it is described Processor, transmitter and the receiver of memory coupling, wherein：The transmitter is used to send director data, institute with to outside Data of the receiver for receiving external transmission are stated, the memory is for storing program code and related data, the place Reason device is for executing the program code stored in the memory, to execute a kind of fault recovery method of cloud computing server, Wherein, the method is such as claim 1 to 8 any one of them method.

18. a kind of management system, including IaaS management platforms, PaaS management platforms and SaaS service platforms, wherein PaaS is managed Platform includes fault detection module, failure analysis module, failure strategy module and Failure Recovery Module, and SaaS service platforms include Agent application, PaaS management platforms are connect with IaaS management platforms and SaaS service platforms by periodic communication interface.It is described Management system is for realizing such as claim 1-8 any one of them method.

19. a kind of computer readable storage medium, which is characterized in that including instruction, when run on a computer so that meter Calculation machine executes such as claim 1-8 any one of them methods.