WO2015037975A1

WO2015037975A1 - System and method for self-maintaining the cloud system

Info

Publication number: WO2015037975A1
Application number: PCT/MY2014/000143
Authority: WO
Inventors: Zakaria Bin Alli MOHAMAD; Syafuan Bin Salim AHMAD; Bin Wijee NAZARUDIN
Original assignee: Mimos Bhd
Current assignee: Mimos Bhd
Priority date: 2013-09-10
Filing date: 2014-05-23
Publication date: 2015-03-19
Anticipated expiration: 2016-03-10

Abstract

The present invention provides a system and method for managing system maintenance of a plurality of hosts accessible by client devices (105, 107, 109) over at a cloud-computing network. The system provides a maintenance controller (110) having a migration agent (112) and an analyze agent (114) for supervising the maintenance task. Each host is deployed with a monitoring agent and a workflow agent (124) for monitoring the respective hosts activities and automating maintenance tasks when during the maintenance process.

Description

System and Method for Self-Maintaining the Cloud System

Field of the Invention

[0001] The present invention relates to an infrastructure maintenance system.

More specifically, the present invention relates to an apparatus and method for maintaining cloud-computing services.

Background

[0002] The popularity of the mobile communication devices and the availability of high-capacity networks have led to a tremendous growth in cloud computing (or computing cloud) to meet the demand. To maximize the effectiveness of the shared resources, just like all computing devices, regular system maintenance is also required.

[0003] System maintenance activities need to be scheduled periodically in order to ensure the system (i.e. operating system, software and hardware) is being updated, patched and repaired accordingly. During system maintenance activity, the system is required to be shutdown, or rebooted. During the downtime, the users who are logging onto the system are also required to leave the system. It is however difficult to determine "a right time" to shutdown the physical and/or virtual system for maintenance due to multi-tenancy disperse users from different geographical location and time-zone.

[0004] US patent no. 7,814,490 discloses an apparatus and methods for carrying out maintenance for a computing network in an opportunistic manner. It manages the downtime during the maintenance. The system requires administrator to manage the idle time manually. [0005] US patent no. 7,873,505 discloses a method and apparatus for predicting scheduled downtime for the computing network. It predicts a scheduled downtime based on previous scheduled or unscheduled system downtime events for a distributed computing system. Summary

[0006] In accordance with one aspect of the present invention, there is provided a system for managing system maintenance of a plurality of hosts accessible by client devices over at a cloud computing. The system comprises a monitoring agent deployed on each of the hosts, operably monitors and records system activities of the hosts to determine if the host requires a maintenance; a workflow agent deployed on each of the hosts, operably automating the maintenance tasks at a suitable time; a maintenance controller for supervising the overall maintenance of the plurality of hosts; a migration agent deployed on the maintenance controller, the migration agent is adapted for migrating virtual machines of the respective hosts when a migration is required for shutting down the hosts; an analyze agent deployed on the maintenance controller, the analyze agent operatively analyzes the system activities of the respective hosts collected by the monitoring agent to determine the suitable time to execute the maintenance tasks.

[0007] In one embodiment, the analyze agent operatively determines a usage pattern from the recorded system activities of the host to identify a period where the system activities are at the lowest, and scheduled the suitable time at the period. The system activities may include CPU usage, memory usage, disk I/O usage, and network transmissions. In a further embodiment, a time duration of each activities at their lowest is compared with each other to identify an overlapping period of the time durations and scheduled the overlapping period as the suitable time for executing the maintenance tasks. When no overlapping period is identified, the system scheduled the suitable time based on a maintenance attribute priority file set by the system administrator.

[0008] In another embodiment, the workflow agent automates the maintenance tasks based on a maintenance task list configured for the respective hosts by the system administrator. It may also determine a mean time to recovery based on the scheduled maintenance task list; and instructs the migration agent to migrate all available virtual machines when needed, before carrying out the maintenance tasks.

[0009] In another aspect, there is also provided a method of managing system maintenance of a plurality of hosts accessible by client devices over at a cloud computing. The method comprises deploying a monitoring agent and a workflow agent on the respective hosts; providing a maintenance controller having a migration agent and an analysing agent; monitoring and recording system activities by the monitoring agent from the respective hosts for determining if maintenance is required on the host; analyzing a suitable time for executing maintenance task through analyze agent based on the system activities; and executing maintenance tasks automatically at the suitable time. When it is required, instructing the migrating agent to migrate virtual machines resided on the host.

[0010] In one embodiment, the method further comprises determining a usage pattern from the recorded system activities of the host; identifying a period where the system activities are at the lowest by identifying a time duration of each activities at their lowest and comparing the time duration with each other to identify an overlapping period of the time durations; and scheduling a suitable time at that overlapped period. The system activities include CPU usage, memory usage, disk I/O usage, and network transmissions.

[0011] In yet a further embodiment, the method may further comprise analyzing maintenance tasks for carrying out over the host to be maintained; determining mean time to recovery; and determining through the maintenance tasks if restarting or shutting down of the host is require to determine if the migration of virtual machines is required.

Brief Description of the Drawings

[0012] Preferred embodiments according to the present invention will now be described with reference to the figures accompanied herein, in which like reference numerals denote like elements;

[0013] FIG. 1 illustrates a schematic diagram of a computing network in accordance with one embodiment of the present invention;

[0014] FIG. 2 illustrates an overall maintenance process in accordance with one embodiment of the present invention;

[0015] FIG. 3 illustrates an analyzing process in accordance with one embodiment of the present invention; and

[0016] FIG. 4 illustrates a process for managing maintenance workflow in accordance with one embodiment of the present invention.

Detailed Description [0017] Embodiments of the present invention shall now be described in detail, with reference to the attached drawings. It is to be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated device, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

[0018] FIG. 1 illustrates a schematic diagram of a computing network 100 in accordance with one embodiment of the present invention. The computing network 100 comprises a cloud infrastructure 101 having clients 105, 107 and 109 connecting thereto. The cloud infrastructure 101 may be deployed in a different jurisdiction from the clients 105, 107, 109. For purpose of illustrations, not limitations, the cloud infrastructure 101 is herein deployed in Malaysia, the client 105 is herein deployed in United States, the client 107 is herein deployed in United Kingdom and the client 109 is herein deployed in Japan. The cloud infrastructure 101 further comprises a plurality of hosts, namely Host A, Host B, ..., Host n, and a maintenance controller 110. The maintenance controller 110 is adapted for providing a regular maintenance on the plurality of hosts. The maintenance controller 110 is able to identify a suitable and desirable timing for carrying out the system maintenance. The maintenance controller 110 is capable to provide a self- maintenance-service on the physical and/or virtual machine.

[0019] Each of the plurality of hosts comprises a monitoring agent (MON) 122 and a workflow agent (WA) 124. The MON 122 monitors and records the Host's system activities, such as CPU usage percentage, memory usage percentage, I/O activity and network activity continuously. The information collected by the MON 122 is used to analyze the Node conditions to determine whether maintenance is required. The WA 124 facilitates automation to the maintenance tasks. The MONs 122 are controlled by the maintenance controller 110.

[0020] The maintenance controller 110 comprises a migration agent (MA) 112 and an analyzer agent (AA) 114. The MA 112 is responsible to seamlessly migrate virtual machines prior to the maintenance activity. The virtual machines may be migrated to other hosts when necessary. The migration is carried out to the host by the MA 112 that is scheduled to be shutdown and/or rebooted for maintenance. The AA 114 is adapted to identify and schedule a suitable time for system downtime. It operationally analyzes data to determine the suitable time for Node maintenance. Preferably, the suitable time for system downtime is the time when the Nodes have minimum access and usage from clients regardless of their locations and timezones.

[0021] FIG. 2 illustrates an overall maintenance process in accordance with one embodiment of the present invention. The maintenance process can be carried out under the network 100 of FIG. 1. The process comprises collecting data from the host at step 201; analyzing host health at step 203; determining if maintenance is required at step 205; identifying a suitable time for maintenance at step 207; and executing maintenance activities at step 209. At the step 201, the MON 122 collects the required information from the respective hosts. The required information includes CPU usage, memory usage, disk I/O activities and network activity. The collections of the information from the respective hosts are carried out continuously on the respective MONs 122. It can be set with a timeframe set by system administrator. The collected data is transmitted to the maintenance controller 110 for storing on the database of the maintenance controller 110 and processing the same. Based on the collected information, the MON 122 analyzes the Hosts' health at the step 203. The hardware conditions of the respective hosts are also considered by the MON 122. [0022] At the step 205, the maintenance controller 110 determines if maintenance is required on the hosts. If maintenance is required, the MON 122 instructs the AA 114 of the maintenance controller 110 to initiate the maintenance procedures. At the step 207, the AA 114 retrieves the host's information from the database of the maintenance controller 110 to identify a suitable time for executing the maintenance procedures. At the step 209, the WA 124 of the relevant host that requires maintenance initiates the maintenance procedures based on the maintenance scripts. Examples of the maintenance tasks may include, but not limited to, upgrading the hypervisor, a host kernel and/or the like. These types of upgrades requires host to restart and therefore virtual machines thereon must be migrated to other hosts.

[0023] FIG. 3 illustrates an analyzing process in accordance with one embodiment of the present invention. The analyzing process is mainly carried out on the AA 114 of the maintenance controller 110 of FIG. 1. The analysis comprises identifying a time duration at step 301; identifying an overlapping time duration at step 302; determining if the overlapping time duration can be identified at step 303; when no over overlapping time duration can be identified, selecting a suitable time for executing maintenance processes at step 304; when the overlapping time duration is identified, setting a suitable time for executing maintenance processes at step 305; the maintenance process is determined if it has been carried out at step 306; and instructing WA 124 to execute the maintenance process at step 307.

[0024] At the step 301, the AA 114 identifies the usage pattern to determine the respective periods where the CPU, memory, disk and network are at a lowest usage and activities. At step 302, the results determined earlier will be compared if any of those periods (where the CPU, memory, disk and network are at the lowest usage and activities) overlaps through those patterns. If an overlapped period is being identified at the step 303, that period is scheduled for the next maintenance at the step 305. If no such overlapped period can be identified at the step 303, the system selects a suitable timing at step 304 based on a maintenance attribute priority 310 set by the system administrator. [0025] Once the suitable time is set for the maintenance process, the system checks in step 306 if the time now is up for executing the maintenance process. The checking process is looped until the current time reaches the scheduled maintenance process, i.e. the suitable time. Once the time is up for the scheduled maintenance, the WA 124 of the relevant host will be triggered to execute the maintenance process accordingly at the step 307.

[0026] FIG. 4 illustrates a process for managing maintenance workflow in accordance with one embodiment of the present invention. The maintenance workflow is controlled and managed by the WAs 124 deployed on the respective hosts. Briefly, the process comprises retrieving maintenance task script at step 401; analyzing maintenance task list at step 402; determining a Mean-Time to Recovery (MTTR) at step 403; determining if virtual machines migration is needed at step 405; instructing MA to migrate the virtual machines at step 406; and executing maintenance tasks at step 407.

[0027] At the step 401, the WA 124 retrieves a list of maintenance tasks that are to be executed. The tasks are pre-defined by the system administrator and stored on the maintenance task scripts 410 of the maintenance controller 110. At the step 402, the maintenance task list is being analyzed by the WA 124. The maintenance task scripts 410 contain all the required tasks that are needed to be carried out in order to keep the hosts in a proper running condition. For example, the tasks may include upgrading the software package or dependencies library. The task list will be analysed for severity, impact, duration and etc, for example. Based on the tasks to be carried out, the WA 124 further estimates duration for each (maintenance) task to be executed to render a total time to complete the entire maintenance process. The WA 124 notifies and alerts the system administrator on the total time at step 404, and at the same time, proceeds to determine if any virtual machines' migration is required among the list of maintenance tasks at step 405. When there is no virtual machines' migration needed, the WA 124 executes the maintenance tasks as scheduled on the task list at step 407. Typically, the virtual machines' migration is required when any of the maintenance tasks requires the host to be shutdown or reboot. When virtual machines' migration is required at the step 405, the WA 124 first instructs the MA 112 at the maintenance controller 110 to migrate the virtual machines on the host at step 406. Once the migrations is completed and in ordered, the WA 124 automates the maintenance according to the scripts at the step 407.

[0028] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, variations, and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

1. A system (100) for managing system maintenance of a plurality of hosts accessible by client devices (105, 107, 109) over at a cloud computing, the system comprising:

a monitoring agent (122) deployed on each of the hosts, operably monitors and records system activities of the hosts to determine if the host requires a maintenance; a workflow agent (124) deployed on each of the hosts, operably automating the maintenance tasks at a suitable time;

a maintenance controller (1 10) for supervising the overall maintenance of the plurality of hosts;

a migration agent (1 12) deployed on the maintenance controller (1 10), the migration agent (1 12) is adapted for migrating virtual machines of the respective hosts when a migration is required for shutting down the hosts;

an analyze agent (1 14) deployed on the maintenance controller (110), the analyze agent (1 14) operatively analyzes the system activities of the respective hosts collected by the monitoring agent (122) to determine the suitable time to execute the maintenance tasks.

2. The system (100) according to claim 1, wherein the analyze agent (114) operatively determines a usage pattern from the recorded system activities of the host to identify a period where the system activities are at the lowest, and scheduled the suitable time at the period.

3. The system ( 100) according to claim 2, wherein the system activities include CPU usage, memory usage, disk I/O usage, and network transmissions.

4. The system (100) according to claim 3, wherein a time duration of each system activity at its lowest is compared with each other to identify an overlapping period of the time durations and scheduled the overlapping period as the suitable time for executing the maintenance tasks.

5. The system (100) accrording to claim 4, wherein when no overlapping period is identified, the system scheduled the suitable time based on a maintenance attribute priority file set by the system administrator.

6. The system (100) according to claim 1, wherein the workflow agent (124) automates the maintenance tasks based on a maintenance task list configured for the respective hosts by the system administrator.

7. The system (100) according to claim 1 , wherein the workflow agent (124) operationally determines a mean time to recovery based on the scheduled maintenance task list; and instructs the migration agent (1 12) to migrate all available virtual machines when needed, before carrying out the maintenance tasks.

8. A method of managing system maintenance of a plurality of hosts accessible by client devices over at a cloud computing, the method comprising: deploying a monitoring agent (122) and a workflow agent (124) on the respective hosts; providing a maintenance controller (1 10) having a migration agent (112) and an analysing agent (1 14); monitoring and recording (201) system activities by the monitoring agent (122) from the respective hosts for determining (205) if maintenance is required on the host; analyzing (207) a suitable time for executing maintenance task through analyze agent (1 14) based on the system activities; and executing maintenance tasks (209) automatically at the suitable time, wherein when it is required, instructing the migrating agent (1 12) to migrate (406) virtual machines resided on the host.

9. The method of claim 8, further comprising:

determining (301) a usage pattern from the recorded system activities of the host; identifying (302) a period where the system activities are at the lowest by identifying a time duration of each activities at their lowest and comparing the time duration with each other to identify an overlapping period of the time durations; and scheduling (305) a suitable time at that overlapped period,

wherein the system activities includes CPU usage, memory usage, disk I/O usage, and network transmissions.

10. The method of claim 9, further comprising:

analysing (402) maintenance tasks for carrying out over the host to be maintained; determining (403) mean time to recovery;

determining (405) through the maintenance tasks if restarting or shutting down of the host is require to determine if the migration of virtual machines is required.