WO2024213056A1 - Method for controlling high-performance computing cluster, and electronic device and storage medium - Google Patents
Method for controlling high-performance computing cluster, and electronic device and storage medium Download PDFInfo
- Publication number
- WO2024213056A1 WO2024213056A1 PCT/CN2024/087280 CN2024087280W WO2024213056A1 WO 2024213056 A1 WO2024213056 A1 WO 2024213056A1 CN 2024087280 W CN2024087280 W CN 2024087280W WO 2024213056 A1 WO2024213056 A1 WO 2024213056A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- job
- computing node
- target
- node
- computing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/61—Installation
- G06F8/63—Image based installation; Cloning; Build to order
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/65—Updates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/4557—Distribution of virtual machine instances; Migration and load balancing
Definitions
- the present application relates to the field of control of computing clusters, and in particular to a control method, electronic device and storage medium for a high-performance computing cluster.
- HPC clusters Traditional high-performance computing clusters (HPC clusters) are mainly composed of login nodes, scheduling nodes, and computing nodes. After the scheduler on the scheduling node of the high-performance computing cluster receives a job submission request, it can select a suitable computing node from the HPC cluster to run the job. Once the job starts running on the computing node, it is difficult to perceive the computing node where the job is located. If a computing node fails, the entire job calculation will fail.
- the embodiments of the present application provide a control method, an electronic device, and a storage medium for a high-performance computing cluster, so as to at least solve the technical problem of low reliability of the high-performance computing cluster in the related art.
- a control method for a high-performance computing cluster including: when a scheduling node receives a job migration request, obtaining the job status of a target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines a first computing node from the high-performance computing cluster, and determines a target process on the first computing node, wherein the target job is running on the first computing node, and the target process is a process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
- a control method for a high-performance computing cluster including: a first computing node receives migration information sent by a scheduling node, wherein a target job is running on the first computing node, the migration information is sent by the scheduling node when receiving a job migration request and determining that the job status of the target job is running, and the job migration request is used to migrate the target job; based on the migration information, the first computing node dumps a target process image corresponding to the target job to a preset storage device
- a high-performance computing cluster including: a computing node for running jobs; a preset storage device for storing process images; a scheduling node connected to the computing node, and used to obtain the job status of the target job when a job migration request is received, and when the job status is running, determine the first computing node and determine the target process on the first computing node, wherein the job migration request is used to migrate the target job, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the first computing node is connected to the preset storage device, and is used to store the target process image to the preset storage device; Set up a storage device.
- an electronic device including: a memory storing an executable program; and a processor for running the program, wherein the program executes any one of the methods in the above embodiments when running.
- a computer-readable storage medium including a stored executable program, wherein when the executable program is running, the device where the storage medium is located is controlled to execute any one of the methods in the above embodiments.
- the scheduling node obtains the job status of the target job when receiving the job migration request, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to the preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster.
- the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, and the target job on the potential fault node can be easily migrated to reduce the risk of job failure, thereby improving the reliability of the target job operation and improving the reliability of the high-performance computing cluster.
- the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
- FIG1 is a hardware structure block diagram of a computer terminal (or mobile device) for implementing a control method for a high-performance computing cluster according to an embodiment of the present application;
- FIG2 is a block diagram of a computing environment according to an embodiment of the present application.
- FIG3 is a structural block diagram of a service grid according to an embodiment of the present application.
- FIG4 is a flow chart of a control method for a high performance computing cluster according to Embodiment 1 of the present application.
- FIG5 is a schematic diagram of a traditional high performance computing cluster according to an embodiment of the present application.
- FIG6 is a schematic diagram of a control structure of a high performance computing cluster according to an embodiment of the present application.
- FIG. 7 is a flow chart of a high performance computing cluster performing a dump operation process to persistent storage according to an embodiment of the present application
- FIG. 8 is a flowchart of restoring a high performance computing cluster job from a dumped process image according to an embodiment of the present application
- FIG9 is a schematic diagram of a control structure of another high performance computing cluster according to an embodiment of the present application.
- FIG10 is a schematic diagram of a control structure of another high performance computing cluster according to an embodiment of the present application.
- FIG11 is a flow chart of a control method for a high performance computing cluster according to Embodiment 2 of the present application.
- FIG12 is a schematic diagram of a high performance computing cluster according to Embodiment 2 of the present application.
- FIG13 is a schematic diagram of a control device for a high performance computing cluster according to Embodiment 4 of the present application.
- FIG14 is a schematic diagram of a control device for a high performance computing cluster according to Embodiment 5 of the present application.
- FIG. 15 is a structural block diagram of a computer terminal according to an embodiment of the present application.
- Job migration a job can be migrated from one node computer to another node computer with a lighter workload or suitable for processing the job;
- High Performance Computing (HPC) clusters These clusters can handle complex computing problems simultaneously by connecting multiple machines.
- HPC job scheduling job scheduling performed by a scheduler on a high-performance computing cluster
- Mirror dump process It is the information storage process that produces a mirror view of the same data on any two or more disks.
- a control method for a high-performance computing cluster is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.
- FIG1 is a hardware structure block diagram of a computer terminal (or mobile device) for implementing a control method for a high-performance computing cluster according to an embodiment of the present application.
- a computer terminal 10 may include one or more (shown in the figure as 102a, 102b, ..., 102n) processors 102, a processor for storing data A memory 104, and a transmission module 106 for communication functions, wherein the processor 102 may include but is not limited to a processing device such as a microcontroller unit (MCU) or a programmable logic device (FPGA).
- MCU microcontroller unit
- FPGA programmable logic device
- the computer terminal 10 may also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which may be included as one of the ports of the BUS bus), a network interface, a power supply and/or a camera.
- I/O interface input/output interface
- USB universal serial bus
- FIG. 1 is only illustrative and does not limit the structure of the above-mentioned electronic device.
- the computer terminal 10 may also include more or fewer components than those shown in FIG. 1, or have a configuration different from that shown in FIG. 1.
- the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits".
- the data processing circuits may be embodied in whole or in part as software, hardware, firmware, or any other combination thereof.
- the data processing circuit may be a single independent processing module, or may be incorporated in whole or in part into any of the other components in the computer terminal 10 (or mobile device).
- the data processing circuit acts as a processor control (e.g., selection of a variable resistor terminal path connected to an interface).
- the memory 104 can be used to store software programs and modules of application software, such as the program instructions/data storage device corresponding to the control method of the high-performance computing cluster in the embodiment of the present application.
- the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, the control method of the high-performance computing cluster described above is realized.
- the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal 10 via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the transmission device 106 is used to receive or send data via a network.
- the specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10.
- the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
- the transmission device 106 can be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet wirelessly.
- RF Radio Frequency
- the display may be, for example, a touch screen liquid crystal display (LCD), which enables a user to interact with a user interface of the computer terminal 10 (or mobile device).
- LCD liquid crystal display
- FIG1 The hardware structure block diagram shown in FIG1 can be used not only as an exemplary block diagram of the above-mentioned computer terminal 10 (or mobile device), but also as an exemplary block diagram of the above-mentioned server.
- FIG2 shows an embodiment of using the computer terminal 10 (or mobile device) shown in FIG1 as a computing node in a computing environment 201 in a block diagram.
- FIG2 is a structural block diagram of a computing environment according to an embodiment of the present application. As shown in FIG2, the computing environment 201 includes multiple (210-1, 210-2, ..., are used in the figure to illustrate) computing nodes (such as servers) running on a distributed network.
- the computing nodes all contain local processing and memory resources, and the terminal user 202 can remotely run applications or store data in the computing environment 201.
- the application can be provided as multiple services 220-1, 220-2, 220-3 and 220-4 in the computing environment 201, representing services "A”, "D”, “E” and "H” respectively.
- the end user 202 can provide and access services through a web browser or other software application on the client.
- the end user 202's provision and/or request can be provided to the ingress gateway 230.
- the ingress gateway 230 may include a corresponding agent to handle the provision of services (one or more services provided in the computing environment 201). Should and/or request.
- Services are provided or deployed based on various virtualization technologies supported by the computing environment 201.
- services can be provided based on virtual machine (VM)-based virtualization, container-based virtualization, and/or similar methods.
- Virtual machine-based virtualization can be to simulate a real computer by initializing a virtual machine to execute programs and applications without directly contacting any actual hardware resources. While the virtual machine virtualizes the machine, according to container-based virtualization, a container can be started to virtualize the entire operating system (OS) so that multiple workloads can run on a single operating system instance.
- OS operating system
- a service 220-2 can be equipped with one or more Pods 240-1, 240-2, ..., 240-N (collectively referred to as Pods).
- a Pod can include a proxy 245 and one or more containers 242-1, 242-2, ..., 242-M (collectively referred to as containers).
- One or more containers in a Pod process requests related to one or more corresponding functions of the service, and the proxy 245 typically controls network functions related to the service, such as routing, load balancing, etc.
- Other services can also be equipped with similar Pods.
- executing a user request from the end user 202 may require invoking one or more services in the computing environment 201, and executing one or more functions of one service may require invoking one or more functions of another service.
- service “A” 220-1 receives a user request from the end user 202 from the ingress gateway 230, service “A” 220-1 may call service “D” 220-2, and service “D” 220-2 may request service “E” 220-3 to execute one or more functions.
- the computing environment described above can be a cloud computing environment, where the allocation of resources is managed by the cloud service provider, allowing the development of functions without considering the implementation, adjustment or expansion of servers.
- the computing environment allows developers to execute code in response to events without building or maintaining complex infrastructure. Services can be divided into a set of functions that can be automatically and independently scaled, rather than expanding a single hardware device to handle potential loads.
- FIG3 shows a block diagram of an embodiment of using the computer terminal 10 (or mobile device) shown in FIG1 as a service grid.
- FIG3 is a structural block diagram of a service grid according to an embodiment of the present application.
- the service grid 300 is mainly used to facilitate secure and reliable communication between multiple microservices. Microservices refer to decomposing an application into multiple smaller services or instances and distributing them on different clusters/machines to run.
- microservices may include application service instance A and application service instance B, which form a functional application layer of the service grid 300.
- application service instance A runs in the form of container/process 308 on machine/workload container group 314 (Pod)
- application service instance B runs in the form of container/process 310 on machine/workload container group 316 (Pod).
- application service instance A may be a product query service
- application service instance B may be a product ordering service
- application service instance A and grid proxy (sidecar) 303 coexist in machine workload container group 614, and application service instance B and grid proxy 305 coexist in machine workload container 314.
- Grid proxy 303 and grid proxy 305 form the data plane layer (dataplane) of service grid 300.
- Grid proxy 303 and grid proxy 305 run in the form of container/process 304 and container/process 306, respectively, and can receive request 312 for commodity query service, and grid proxy 303 and application service instance A can communicate bidirectionally, and grid proxy 305 and application service instance B can communicate bidirectionally.
- grid proxy 303 and grid proxy 305 can also communicate bidirectionally.
- the traffic of application service instance A is routed to the appropriate destination through grid proxy 303.
- the network traffic of application service instance B is routed to the appropriate destination through the grid proxy 305.
- the network traffic mentioned here includes but is not limited to Hyper Text Transfer Protocol (HTTP), Representational State Transfer (REST), high-performance, general open source framework (google Remote Procedure Call, gRPC), open source in-memory data structure storage system (Redis), etc.
- HTTP Hyper Text Transfer Protocol
- REST Representational State Transfer
- gRPC general open source framework
- Redis open source in-memory data structure storage system
- the function of extending the data plane layer can be achieved by writing a custom filter for the proxy (Envoy) in the service mesh 300.
- the service mesh proxy configuration can be to enable the service mesh to correctly proxy service traffic and achieve service intercommunication and service governance.
- Mesh proxy 303 and mesh proxy 305 can be configured to perform at least one of the following functions: service discovery, health checking, routing, load balancing, authentication and authorization, and observability.
- the service grid 300 also includes a control plane layer.
- the control plane layer may be a group of services running in a dedicated namespace, and these services are hosted by a hosted control plane component 301 in a machine/workload container group (machine/Pod) 302.
- the hosted control plane component 301 communicates bidirectionally with the grid agent 303 and the grid agent 305.
- the hosted control plane component 301 is configured to perform some control management functions.
- the hosted control plane component 301 receives telemetry data transmitted by the grid agent 303 and the grid agent 305, and can further aggregate the telemetry data.
- the hosted control plane component 301 can also provide a user-oriented application programming interface (API) to more easily manipulate network behavior and provide configuration data to the grid agent 303 and the grid agent 305.
- API application programming interface
- FIG4 is a flow chart of a control method for a high-performance computing cluster according to Embodiment 1 of the present application. As shown in FIG4, the method includes:
- Step S402 upon receiving the job migration request, the scheduling node obtains the job status of the target job.
- the job migration request is used to migrate the target job.
- the above-mentioned target job can be a task to be processed.
- the target job can be an order, product information, etc.
- the job can be patient information, medical images, etc.
- the target job may be a job on a node with relatively low utilization, or may be a job on a potential faulty computing node.
- the risk of job failure may be reduced by migrating the job on the potential faulty computing node.
- the above-mentioned job migration request may be generated according to the demand for monitoring the target job, or may be generated periodically, which is not limited here, and the job migration request of the target job may be generated according to the actual situation.
- the above-mentioned scheduling node may be a scheduler.
- the cluster administrator may generate a job migration request by triggering a checkpoint operation on the target job.
- the scheduler may check the job status of the target job to migrate the target job according to the job status.
- the job status of the target job can be checked to determine whether the job status of the target job is running. If the job status is that the target job is running, the target process of the target job can be dumped to facilitate monitoring of the target job.
- Step S404 When the job status is running, the scheduling node determines a first computing node from the high-performance computing cluster, and determines a target process on the first computing node.
- the target job is running on the first computing node, and the target process is a process on the first computing node corresponding to the target job.
- the above-mentioned high-performance computing cluster can be used to represent a computing cluster that processes complex computing problems simultaneously by connecting multiple machines.
- the first computing node mentioned above may be one or more, which is not limited here.
- the scheduler can search the scheduling database for a list of computing nodes corresponding to the target job and a list of process identifiers (IDs) on the computing nodes, that is, determine the first computing node from the high-performance cluster and determine the target process on the first computing node.
- IDs process identifiers
- Step S406 The scheduling node sends migration information to the first computing node.
- the migration information is used to control the first computing node.
- the above-mentioned preset storage device may be any pre-set storage device.
- the preset storage device may be a device including a persistent storage path, but is not limited thereto. This is only described as an example.
- a checkpoint-restart service in a high-performance cluster may be called to perform a checkpoint dump on a target process of a target job, and the target process may be dumped to a preset storage device.
- the checkpoint-restart service can return the result to the scheduler.
- the scheduler comprehensively calculates the results of each node in the node list.
- the checkpoint operation of the target job is considered to be successfully completed.
- the scheduling node can update the job status and move the target job to the waiting queue, and set the conditions for being scheduled again.
- the scheduling node communicates with the successful nodes to resume operation and can return the operation failure and related information to the administrator.
- the utilization rate of the HPC cluster can be improved and the cluster throughput can be increased.
- the target job in the high-performance cluster can be migrated, so as to free up more available nodes for the subsequent jobs, and the jobs that could not be put into operation immediately on the scheduling queue can be put into operation as soon as possible, thereby increasing the number of jobs that can be executed by the cluster per unit time, thereby improving the cluster throughput and increasing efficiency.
- the failure probability of HPC jobs can be reduced.
- the running HPC jobs can be migrated to other nodes in the cluster in advance for nodes with potential failure risks, thereby reducing job failures caused by unexpected downtime or other failures.
- the power consumption of the HPC cluster can also be reduced, which is helpful for carbon neutrality.
- the jobs on the nodes with relatively low utilization on the HPC cluster can be migrated.
- the jobs can be concentrated on certain cabinets in the computer room, and then the unloaded nodes or cabinets and other equipment can be powered off to save power or put into hibernation, which helps to reduce power consumption.
- the scheduling node first obtains the job status of the target job when receiving the job migration request, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node The scheduling node determines the first computing node from the high-performance computing cluster and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to the preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster.
- the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, and the target job on the potential fault node can be easily migrated to reduce the risk of job failure, thereby improving the reliability of the target job operation and achieving the purpose of improving the reliability of the high-performance computing cluster.
- the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
- FIG5 is a schematic diagram of a traditional high-performance computing cluster according to an embodiment of the present application.
- a cluster user logs in to the login node, he can submit an HPC job to the job scheduling node of the cluster.
- an HPC job consists of a group of multi-machine parallel computing processes, which are specifically controlled by the script and submission command for submitting the job.
- the scheduler on the job scheduling node receives the job submission request, it will select a suitable computing node from the computer cluster to run the job, wherein multiple computing nodes in the computer cluster can share storage.
- the HPC scheduler cannot perceive the computing node where the mobile job is located except for conventional process control and monitoring. If a computing node fails, the entire job will fail to calculate.
- FIG6 is a schematic diagram of a control structure of a high performance computing cluster according to an embodiment of the present application. As shown in FIG6 ,
- the first step is that when the computing cluster is started, the computing nodes contained in the computing cluster will start the checkpoint-restart service, which implements the checkpoint-restart action for the process corresponding to the HPC job and provides an access interface or application programming interface (Application Programming Interface, referred to as API) to the scheduling node;
- API Application Programming Interface
- the user logs in to the login node and submits the job to the HPC cluster.
- Cluster users submit jobs according to the normal process and can ignore the details of the job submission process.
- the user submits the job to the scheduler in the scheduling queue and waits for the HPC cluster scheduler to select a computing node to execute the job.
- the relevant algorithms, queues, and databases on the scheduling node need to be modified as follows:
- C CheckPointed
- the dump process time is related to the memory size occupied by the job process and the storage write speed. For jobs that occupy a large amount of memory, it takes a long time. You can add a checkpoint node C t (Checkpointing) as needed to indicate the state in the dump.
- C t Checkpointing
- the cluster administrator performs a checkpoint operation on the job, the job returns from the running queue to the waiting queue and sets the conditions for resuming the job process on the compute node. Jobs in the C state are not restored until the cluster resources meet the conditions or When the cluster administrator initiates scheduling, the process corresponding to the job is restored on the computing node.
- Checkpoint operations can be performed through commands, interfaces, button controls, etc.
- the scheduling algorithm in Figure 6 is used to increase the communication and call process of the checkpoint-restart service with the computing node, schedule the checkpointed job back to the waiting queue and set it to a new state. It also sets the conditions for rescheduling back to the computing node, and resumes the job execution when the conditions are met, and updates the job and the computing node list and the corresponding process ID.
- the job in the scheduling queue is dumped and then returned to the scheduling queue in a new state.
- the scheduling database in FIG6 can add job states C, Ct, and the job corresponding storage path.
- the job enters the running state and the computing node starts running the job.
- the computing node list of the job in the computer cluster and the process ID of the job on the computing node list can be determined.
- the scheduler updates the job status and stores the job-related information in the corresponding scheduling database.
- the cluster administrator triggers the job to perform a checkpoint operation.
- the scheduler checks the status of the job. If the job is running, the scheduler searches the scheduling database for the list of computing nodes corresponding to the job and the list of process IDs on the computing nodes. The scheduler communicates with the nodes on the computing node list and calls the program checkpoint node (Job Progress Check Point) of the checkpoint-restart service.
- the checkpoint-restart service starts to checkpoint the process corresponding to the above job and dumps the process image to the specified persistent storage path. After the dump is completed, the checkpoint-restart service returns the result to the scheduler, and the scheduler integrates the results of each node in the computing node list.
- the scheduling node updates the job status to C and moves the job from the running queue to the waiting queue, and sets the conditions for being scheduled again; when some nodes in the job computing node list return success or all nodes return failure, the scheduling node communicates with the successful nodes to resume operation and returns the operation failure and related information to the management.
- the HPC cluster administrator can schedule the job to execute on a new computing node by presetting conditions or directly controlling the scheduler.
- the scheduler receives a request from the cluster administrator or meets the above presetting conditions, the scheduler and the cluster administrator can specify a new computing node list or a scheduling algorithm to determine the computing node communication in order to call the program recovery node (Job Progress Restart) of the checkpoint restart service on the computing node.
- the checkpoint restart service can restore the execution of the job on the computing node from the job process image on the corresponding storage path.
- the recovery process may also encounter errors, such as the process ID is already occupied, in which case the computing node returns failure to the scheduling node.
- the scheduling node After the scheduling node receives the node checkpoint restart service return node on each new computing node list, the job resumes when the node in the computing node list returns successfully.
- the scheduler can then update the job status and the job database's new node information. If there is a computing node that fails to resume the job process, the scheduler stops the completed recovery process on other computing nodes and returns the corresponding information to the cluster administrator.
- FIG. 7 is a flow chart of a high performance computing cluster performing a dump operation process to persistent storage according to an embodiment of the present application. As shown in FIG. 7 , the method includes:
- Step S701 the scheduler receives a request to start a checkpoint
- Step S702 checking the operation status
- Step S703 determine whether the job is running, if yes (Y), execute step S704, if no (N), execute step S717;
- Step S704 querying the job node list from the scheduling database
- Step S705 establishing a checkpoint-restart service communication call with each computing node
- Step S706 receiving the operation result returned by the computing node
- Step S707 whether each computing node returns that the operation is successful, if so, execute step S708, if not, execute step S710;
- Step S708 update the job status, adjust the job queue, and set the re-running conditions
- Step S709 returning to indicate successful completion of the operation
- Step S710 resume the operation process of some successful nodes to continue running
- Step S711 return failure and end the operation
- Step S712 the checkpoint-restart service receives the request
- the received request may be a JobProgressCheckPoint request.
- Step S713 obtaining the process identifier and dump path
- Step S714 determine whether the job is running, if so, execute step S715, if not, execute step S706;
- Step S715 performing a checkpoint dump operation on the process
- Step S717 end the operation and return.
- FIG8 is a flow chart of restoring a high performance computing cluster job from a dumped process image according to an embodiment of the present application. As shown in FIG8 , the method includes:
- Step S801 the cluster administrator controls the job recovery or checkpoint preset conditions to be met
- Step S802 the scheduler starts to execute the job process recovery
- Step S803 the scheduler runs the scheduling algorithm to obtain a list of new computing nodes where the job is restored;
- Step S804 establishing checkpoints of different computing nodes - restarting service communication calls
- Step S805 receiving the operation result returned by the computing node
- Step S806 determining whether the return operation of each computing node is successful, if so, executing step S807, if not, executing step S809;
- Step S807 updating the job status and adjusting the job queue
- Step S808 returning to indicate successful completion of the operation
- Step S809 stop restoring the partially successful node operation process and continue to run
- Step S810 return failure and end the operation
- Step S811 the checkpoint-restart service receives the request
- Step S812 obtaining the job process image from the dump path
- Step S814 returning the operation result to the scheduling node.
- Figure 9 is a schematic diagram of the control structure of another high-performance computing cluster according to an embodiment of the present application. Compared with the method shown in Figure 6, the present application supplements three steps. First, during the HPC job submission phase, after the scheduler selects a candidate computing node from the computing cluster, before executing the job, the container for the job is first started in the computing node, and then the job process is executed in the container; secondly, when a scenario requiring HPC job migration is encountered during the cluster job execution phase.
- the dump container job process can be executed in different ways.
- One way is as shown in Figure 9.
- the container can be started first and then the HPC job process recovery operation can be run in the container;
- Figure 10 is a control structure diagram of another high-performance computing cluster according to an embodiment of the present application. As shown in Figure 10, another way is to restore the job process to a node without a container environment.
- the container environment can also be started first, and then the execution can be restored in the container environment.
- the control method of the high-performance computing cluster proposed in this application can facilitate cluster users or administrators to further control the migration of cluster jobs in the nodes in the computing cluster.
- the risk of job failure can be reduced by migrating jobs on potential faulty nodes.
- the cluster utilization rate can also be improved through appropriate migration.
- the power consumption of the entire cluster can be reduced by cooperating with measures such as cabinet nodes.
- the scheduling node includes a scheduler, a running queue and a queue to be scheduled, and the scheduling node sends migration information to the first computing node.
- the method also includes: the scheduler receives the dump result fed back by the first computing node, wherein the dump result is used to indicate whether the target process image is successfully dumped; when the dump result is that the target process dump is successful, the scheduler updates the job status of the target job to the migration status, and moves the target job from the running queue to the queue to be scheduled, wherein the running queue is used to store jobs whose job status is running, and the queue to be scheduled is used to store jobs whose job status is not running; the scheduler outputs a first prompt message, wherein the first prompt message is used to prompt that the migration of the target job is successful.
- the above-mentioned scheduler is used to move the job according to the dump result fed back by the computing node.
- the above-mentioned job status is used to indicate that the target job is being executed in the first computing node.
- the scheduler can determine whether the target process image is dumped successfully based on the dump result fed back by the first computing node.
- the scheduler can update the job status of the target job to the migration status, thereby terminating the operation of the target job.
- the scheduler can execute the operation of moving the target job from the running queue to the to-be-scheduled queue.
- the scheduler outputs the first prompt information, it means that the target job has been migrated successfully.
- the scheduler may store the relevant information of the updated status in the scheduling database corresponding to the target job.
- the migration status includes: migration completed and migration in progress.
- the scheduler updates the job status of the target job to the migration status, including: the scheduler obtains the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device; the scheduler updates the job status to migration completed or migration in progress based on the amount of memory and the storage speed.
- the dump process time is related to the memory size occupied by the job process and the storage write speed. Jobs that occupy a large amount of memory require a longer time. Therefore, the scheduler can obtain the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device, and determine the time required to complete the migration based on the amount of memory and the stored data, so as to update the job status to migration completed or migrating according to the migration time.
- the scheduler updates the job status to migration completed or migrating based on the memory amount and the storage speed, including: when the memory amount is greater than the preset memory amount and the storage speed is less than the preset speed, the scheduler updates the job status to migrating; when the memory amount is less than or equal to the preset memory amount, or the storage speed is greater than or equal to the preset speed, the scheduler updates the job status to migration completed.
- the above-mentioned preset memory amount may be a memory amount pre-set according to the memory of a preset storage device, and the preset memory amount may also be set according to actual conditions.
- the above-mentioned preset speed can be a speed preset according to the storage speed of a preset storage device, and the preset speed can also be set according to actual conditions.
- the amount of memory when the amount of memory is greater than the preset memory amount and the storage speed is less than the preset speed, it means that the target process of the target job has not yet been migrated, so the memory amount is large and the speed is small, and the job status can be updated to migrating at this time; when the amount of memory is less than or equal to the preset memory amount and the storage speed is greater than or equal to the preset speed, it means that the target process of the target job has been migrated, so the memory amount of the preset storage device is small and the speed can be large, and the job status can be updated to migration completed at this time; when the amount of memory is less than or equal to the preset storage amount and the storage speed is less than the preset speed, it means that the target process of the target job has been migrated, so the memory amount of the preset storage device is small and the speed can be small, and the job status can be updated to migration completed.
- the job status during the migration process may be marked with different identifiers, thereby facilitating the scheduler to update the job status.
- the method when there are multiple first computing nodes, the method also includes: when the dump result of at least one first computing node is a failure to dump the target process, the scheduler determines a second computing node among the multiple first computing nodes, wherein the dump result of the second computing node is a successful dump of the target process; the scheduler sends a resume operation request to the second computing node, wherein the resume operation request is used to request the second computing node to continue running the target process; the scheduler outputs a second prompt message, wherein the second prompt message is used to prompt that the target job migration has failed.
- the prompting method of the second prompting information is not limited to text, voice, image, etc., and the prompting method of the second prompting information can be determined according to actual needs.
- the scheduler can determine a second computing node from multiple first computing nodes to send a resume request to the second computing node, thereby avoiding the second computing node from being affected by the migration failure, so that the second computing node can continue to run the target process.
- the method after the scheduling node sends the migration information to the first computing node, the method also includes: when the target job meets the recovery conditions or receives a scheduling request, the scheduling node determines a third computing node from the high-performance computing cluster, wherein the scheduling request is used to schedule the target job to the third computing node; the scheduling node sends recovery information to the third computing node, wherein the recovery information is used to control the third computing node to read the target process from a preset storage device and run the target process.
- the above-mentioned recovery condition may be a pre-set condition, wherein the recovery condition may be set in a scheduling algorithm, but is not limited thereto.
- the above scheduling request can be generated by the cluster administrator to trigger the job to perform checkpoint operation.
- the third computing node mentioned above may be a computing node in a new computing node list specified by a cluster administrator, or may be a computing node in a new computing node list calculated by a scheduling algorithm.
- the scheduler after receiving the scheduling request, can determine the The third computing node is determined, and the target process can be read from the preset storage device by calling the first preset interface, so as to restore the operation of the target process on the third computing node.
- the scheduling node determines the third computing node from the high-performance computing cluster, including: when the target job meets the recovery condition, the scheduling node determines the third computing node from the high-performance computing cluster based on the scheduling algorithm, wherein the scheduling algorithm is used to allocate the target resources of the third computing node to the target job so that the target job is executed by the target resources; when receiving a scheduling request, the scheduling node determines the computing node corresponding to the scheduling request from the high-performance computing cluster to obtain the third computing node.
- the third computing node may be re-determined through the existing scheduling algorithm of the cluster, and resources may be allocated to the third computing node, so that the target job may be executed by the target resources.
- the above scheduling algorithm can be used to determine the communication and calling process of the checkpoint-restart service. It can schedule the jobs that have completed the checkpoint back to the waiting queue and set them to a new state, and set the conditions for rescheduling them back to the computing node. When the conditions are met, the job execution is resumed.
- the scheduling algorithm can also update the job and the computing node list and the corresponding process ID.
- the computing node in the new computing node list may be calculated according to a scheduling algorithm, so as to determine that the computing node is the third computing node.
- the scheduling node includes a scheduler, a running queue and a queue to be scheduled.
- the method also includes: the scheduler receives the recovery result fed back by the third computing node, wherein the recovery result is used to characterize whether the target process has recovered successfully; when the recovery result is that the target process has recovered successfully, the scheduler updates the job status of the target job to running, and moves the target job from the queue to be scheduled to the running queue; the scheduler outputs a third prompt message, wherein the third prompt message is used to prompt that the target job has recovered successfully.
- the scheduler receives the recovery result fed back by the third computing node. If the recovery result is that the target process has been successfully recovered, the scheduler can update the job status of the target job to running, so as to facilitate determining the job status of the target job. At this time, the target job can be moved from the to-be-scheduled queue to the running queue, and the scheduler can output a third prompt message to facilitate the user to understand that the target job has been successfully recovered.
- the method when there are multiple third computing nodes, the method also includes: when the recovery result of at least one third computing node is that the target process recovery fails, the scheduler determines a fourth computing node among the multiple third computing nodes, wherein the recovery result of the fourth computing node is that the target process recovery is successful; the scheduler sends a stop operation request to the fourth computing node, wherein the stop operation request is used to request the fourth computing node to stop running the target process; the scheduler outputs a fourth prompt message, wherein the fourth prompt message is used to prompt that the target job recovery fails.
- the target process in the fourth computing node has been successfully restored, that is, the action of restoring the target process has been completed in the fourth computing node.
- the recovery result of multiple third computing nodes is that the target process recovery failed, it is necessary to determine a fourth computing result that the target process in the multiple third computing nodes has been successfully recovered, and it is necessary to stop the target process of the fourth computing node so that the recovery status of the target processes on the third computing nodes is unified.
- the fourth prompt information can be output through the scheduler so as to prompt the user that the target job recovery has failed through the fourth prompt information.
- the scheduling algorithm and strategy of the job system can be improved, so that the jobs submitted by users to the HPC cluster can achieve the user-imperceptible migration function, thereby benefiting the HPC cluster system in operation and maintenance, fault handling, cluster power consumption, etc.
- FIG. 11 is a flow chart of a method for controlling a high performance computing cluster according to Embodiment 2 of the present application. As shown in FIG. 11 , the method may include the following steps:
- Step S1102 The first computing node receives migration information sent by the scheduling node.
- the target job is running on the first computing node, and the migration information is sent by the scheduling node when it receives a job migration request and determines that the job status of the target job is running.
- the job migration request is used to migrate the target job.
- Step S1104 the first computing node dumps the target process image corresponding to the target job to a preset storage device based on the migration information.
- the migration information includes at least: process information of the target process and a storage path of a preset storage device.
- the first computing node dumps the target process image corresponding to the target job to the preset storage device based on the migration information, including: the first computing node determines whether there is a target process on the first computing node based on the process information; when the target process exists on the first computing node, the first computing node dumps the target process image to the preset storage device based on the storage path.
- the first computing node obtains the migration information sent by the scheduling node through a first preset interface.
- the above-mentioned first preset interface can be Job Progress CheckPoint, wherein the input information of the first preset interface can be a job process ID set and a job process mirror storage path, the return information of the first preset interface can be success or failure, and the action of the first preset interface can be to transfer the checkpoint of the corresponding process to the specified storage.
- the process information of the target process mentioned above may be the name, data, etc. of the target process.
- the storage path of the above-mentioned preset storage device may be a path name, path information, etc. of the storage path, wherein the storage path of the preset storage device may be a persistent storage path.
- the first computing node can obtain the migration information sent by the scheduling node through the first preset interface, and can determine the process information of the target process and the storage path of the preset storage device through the migration information, and can determine whether there is a target process of the target job on the first computing node based on the process information. If there is a target process, it means that the process of the target job has not ended, and the target process image can be dumped to the preset storage device; if there is no target process, it means that the process of the target job has ended, and there is no need to dump the target process image to the preset storage device.
- the method also includes: the third computing node receives the recovery information sent by the scheduling node, wherein the recovery information is sent by the scheduling node when the target job meets the recovery conditions or receives the scheduling request; the third computing node reads the target process from the preset storage device; the third computing node runs the target process.
- the third computing node obtains the recovery information sent by the scheduling node through the second preset interface.
- the above-mentioned second preset interface can be Job Progress Restart, wherein the input information of the second preset interface can be the job image path, the return information of the second preset interface can be success or failure, and the action of the second preset interface can be restoring the dumped job image to the computing node.
- the third computing node can obtain the recovery information sent by the scheduling node through the second preset interface, and the third computing node can actively read the target process from the preset storage device according to the storage path to facilitate restoring the target process of the target job to the third computing node.
- running the target process on the third computing node includes: the third computing node running the target process includes: the third computing node reading the process information of the target process from a preset storage device; the third computing node determines whether there is a process corresponding to the process information; if it is determined that there is a process corresponding to the process information, the third computing node stops running the target process; if it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
- the third computing node can read the process information of the target process from a preset storage device, and can determine whether the target process is occupied through the process information of the target process. If it is determined that there is a process corresponding to the process information, it means that the target process is occupied, and the third computing node stops running the target process. If it is determined that there is no process corresponding to the process information, it means that the target process is not occupied, and the third computing node runs the target process.
- user information including but not limited to user device information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- user information including but not limited to user device information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, or of course by hardware.
- the technical solution of this application, or the part that contributes to the prior art can be in the form of a software product.
- the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including a number of instructions for enabling a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of various embodiments of the present application.
- a high-performance computing cluster for implementing the control method of the above-mentioned high-performance computing cluster is also provided.
- Figure 12 is a schematic diagram of a high-performance computing cluster according to Example 2 of the present application. As shown in Figure 12, the high-performance computing cluster 1200 includes: a computing node 1202, a preset storage device 1204, and a scheduling node 1206.
- the computing node 1202 is used to run jobs; the preset storage device 1204 is used to store processes; the scheduling node 1206 is connected to the computing node, and is used to obtain the job status of the target job when a job migration request is received, and when the job status is running, determine the first computing node and determine the target process on the first computing node, wherein the job migration request is used to migrate the target job, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the first computing node is connected to the preset storage device, and is used to store the target process image to the preset storage device.
- the first computing node mentioned above may be a computing node where the target job is located among the computing nodes.
- the above-mentioned preset storage device may be used to store one or more processes, wherein the target process may be a process corresponding to the target job stored in the preset storage device.
- the computing node is used to run the job;
- the preset storage device is used to store the process image;
- the scheduling node is connected to the computing node, and is used to obtain the job status of the target job when receiving the job migration request, and when the job status is running, determine the first computing node, and determine the target process on the first computing node, wherein the job migration request is used to migrate the target job, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node;
- the first computing node is connected to the preset storage device, and is used to store the target process image to the preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster.
- the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, and the target job on the potential fault node can be easily migrated to reduce the risk of job failure, thereby improving the reliability of the target job operation and improving the reliability of the high-performance computing cluster.
- the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
- FIG. 13 is a schematic diagram of a control device for a high-performance computing cluster according to Example 4 of the present application. As shown in Figure 13, the device 1300 includes: an acquisition module 1302, a determination module 1304, and a sending module 1306.
- the acquisition module is used to obtain the job status of the target job when the scheduling node receives the job migration request, wherein the job migration request is used to migrate the target job;
- the determination module is used to determine the first computing node from the high-performance computing cluster through the scheduling node when the job status is running, and determine the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process on the first computing node that is connected to the target job.
- the sending module is used to send migration information to the first computing node through the scheduling node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
- the acquisition module 1302, the determination module 1304, and the sending module 1306 correspond to steps S402 to S406 in Example 1, and the three modules and the corresponding steps implement the same examples and application scenarios, but are not limited to the contents disclosed in the above-mentioned Example 1.
- the above-mentioned modules or units may be hardware components or software components stored in a memory (e.g., memory 104) and processed by one or more processors (e.g., processors 102a, 102b, ..., 102n), and the above-mentioned modules may also be part of the device and may be run in the computer terminal 10 provided in Example 1.
- the scheduling node includes a scheduler, a running queue and a queue to be scheduled, and the device also includes: a receiving module, an updating module, and an output module.
- the receiving module is used to receive the dump result fed back by the first computing node through the scheduler, wherein the dump result is used to indicate whether the target process has been dumped successfully;
- the updating module is used to update the job status of the target job to the migration status when the dump result is that the target process has been dumped successfully, and move the target job from the running queue to the to-be-scheduled queue, wherein the running queue is used to store jobs whose job status is running, and the to-be-scheduled queue is used to store jobs whose job status is not running;
- the output module is used for the scheduler to output the first prompt information, wherein the first prompt information is used to prompt that the migration of the target job is successful.
- the migration status includes: migration completed and migration in progress.
- the update module is also used to obtain the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device through the scheduler.
- the update module is also used to update the job status to migration completed or migration in progress based on the memory amount and storage speed through the scheduler.
- the update module is also used to update the job status to migrating through the scheduler when the memory amount is greater than the preset memory amount and the storage speed is less than the preset speed; the update module is also used to update the job status to migration completed through the scheduler when the memory amount is less than or equal to the preset memory amount and the storage speed is greater than or equal to the preset speed.
- the device also includes: a determination module.
- the determination module is also used to, when the dump result of at least one first computing node is a failure to dump the target process, the scheduler determines a second computing node among multiple first computing nodes, wherein the dump result of the second computing node is a successful dump of the target process, and the scheduler sends a resume operation request to the second computing node, wherein the resume operation request is used to request the second computing node to continue running the target process, and the scheduler outputs a second prompt message, wherein the second prompt message is used to prompt that the target job migration has failed.
- the determination module is also used for the scheduling node to determine a third computing node from the high-performance computing cluster when the target job meets the recovery conditions or receives a scheduling request, wherein the scheduling request is used to schedule the target job to the third computing node, and the scheduling node sends recovery information to the third computing node, wherein the recovery information is used to control the third computing node to read the target process from a preset storage device and run the target process.
- the determination module is also used to determine a third computing node from the high-performance computing cluster based on a scheduling algorithm through a scheduling node when the target job meets the recovery conditions, and to obtain the third computing node by determining the computing node corresponding to the scheduling request from the high-performance computing cluster through the scheduling node when a scheduling request is received.
- the scheduling node includes a scheduler, a running queue and a queue to be scheduled, and the receiving module is also used
- the scheduler receives the recovery result fed back by the third computing node, wherein the recovery result is used to indicate whether the target process has recovered successfully;
- the update module is used to update the job status of the target job to running when the recovery result is that the target process has recovered successfully, and move the target job from the to-be-scheduled queue to the running queue;
- the output module is also used for the scheduler to output a third prompt message, wherein the third prompt message is used to prompt that the target job has recovered successfully.
- the device further includes: a sending module.
- the determination module is also used for the scheduler to determine a fourth computing node among multiple third computing nodes when the recovery result of at least one third computing node is a failure to recover the target process, wherein the recovery result of the fourth computing node is a successful recovery of the target process;
- the sending module is used for the scheduler to send a stop operation request to the fourth computing node, wherein the stop operation request is used to request the fourth computing node to stop running the target process;
- the output module is also used for the scheduler to output a fourth prompt message, wherein the fourth prompt message is used to prompt that the recovery of the target job has failed.
- FIG 14 is a schematic diagram of a control device for a high-performance computing cluster according to Example 5 of the present application.
- the device 1400 includes: a receiving module 1402 and a dump module 1404.
- the receiving module is used to receive the migration information sent by the scheduling node through the first computing node, wherein the target job is running on the first computing node, and the migration information is sent by the scheduling node when it receives the job migration request and determines that the job status of the target job is running, and the job migration request is used to migrate the target job;
- the dump module is used for the first computing node to dump the target process image corresponding to the target job to a preset storage device based on the migration information.
- the migration information includes at least: process information of the target process and a storage path of a preset storage device.
- the dump module is also used to determine whether the target process exists on the first computing node based on the process information through the first computing node. If the target process exists on the first computing node, the target process image is dumped to the preset storage device through the first computing node based on the storage path.
- the device includes: a reading module and an operating module.
- the receiving module is used for the third computing node to receive the recovery information sent by the scheduling node, wherein the recovery information is sent by the scheduling node when the target job meets the recovery conditions or receives a scheduling request;
- the reading module is used for the third computing node to read the target process from the preset storage device; and
- the running module is used for the third computing node to run the target process.
- the running module is also used for the third computing node to read the process information of the target process from a preset storage device, and the third computing node determines whether there is a process corresponding to the process information. If it is determined that there is a process corresponding to the process information, the third computing node stops running the target process. If it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
- the embodiment of the present application can provide a computer terminal, which can be any one of the computer terminal groups.
- the computer terminal may be replaced by a terminal device such as a mobile terminal.
- the computer terminal may be located in at least one network device among a plurality of network devices of the computer network.
- the above-mentioned computer terminal can execute the program code of the following steps in the control method of the high-performance computing cluster: when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
- Figure 15 is a block diagram of a computer terminal according to an embodiment of the present application.
- the computer terminal A may include: one or more (only one is shown in the figure) processors 102, a memory 104, a storage controller, and a peripheral interface, wherein the peripheral interface is connected to a radio frequency module, an audio module, and a display.
- the memory can be used to store software programs and modules, such as the program instructions/modules corresponding to the control method and device of the high-performance computing cluster in the embodiment of the present application.
- the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, realizing the control method of the high-performance computing cluster mentioned above.
- the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
- the memory may further include a memory remotely arranged relative to the processor, and these remote memories can be connected to the terminal A via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the processor can call the information and application programs stored in the memory through the transmission device to execute the following steps: when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
- the processor may also execute program code of the following steps: the scheduler receives a dump result fed back by the first computing node, wherein the dump result is used to indicate whether the target process is successfully dumped; when the dump result is that the target process is successfully dumped, the scheduler updates the job status of the target job to a migration status, and moves the target job from the running queue to the to-be-scheduled queue, wherein the running queue is used to store jobs whose job status is running, and the to-be-scheduled queue is used to store jobs whose job status is not running; the scheduler outputs a first prompt message, wherein the first prompt message is used to prompt that the migration of the target job is successful.
- the processor may also execute program code of the following steps: the scheduler obtains the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device; the scheduler updates the job status to migration completed or in progress based on the memory amount and storage speed.
- the processor may further execute the following program code: when the memory amount is greater than the preset memory amount and the storage speed is less than the preset speed, the scheduler updates the job status to migrating; when the memory amount is less than or equal to When the preset memory amount is reached and the storage speed is greater than or equal to the preset speed, the scheduler updates the job status to migration completed.
- the processor may also execute program code of the following steps: when the target job meets the recovery conditions or receives a scheduling request, the scheduling node determines a third computing node from the high-performance computing cluster, wherein the scheduling request is used to schedule the target job to the third computing node; the scheduling node sends recovery information to the third computing node, wherein the recovery information is used to control the third computing node to read the target process from a preset storage device and run the target process.
- the processor may also execute the program code of the following steps: when the target job meets the recovery conditions, the scheduling node determines a third computing node from the high-performance computing cluster based on the scheduling algorithm; when a scheduling request is received, the scheduling node determines the computing node corresponding to the scheduling request from the high-performance computing cluster to obtain the third computing node.
- the processor may also execute the program code of the following steps: when the target job meets the recovery conditions, determining a third computing node from the high-performance computing cluster based on the scheduling algorithm; when a scheduling request is received, determining the computing node corresponding to the scheduling request from the high-performance computing cluster to obtain the third computing node.
- the processor may also execute program code of the following steps: the third computing node obtains recovery information sent by the scheduling node through the second preset interface, wherein the recovery information includes at least: a storage path of a preset storage device; and the third computing node reads the target process from the preset storage device based on the storage path.
- the processor may also execute program code of the following steps: the third computing node reads process information of the target process from a preset storage device; the third computing node determines whether there is a process corresponding to the process information; if it is determined that there is a process corresponding to the process information, the third computing node stops running the target process; if it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
- the processor may also execute program code of the following steps: the scheduler receives a recovery result fed back by a third computing node, wherein the recovery result is used to indicate whether the target process has recovered successfully; when the recovery result is that the target process has recovered successfully, the scheduler updates the job status of the target job to running, and moves the target job from the to-be-scheduled queue to the running queue; the scheduler outputs a third prompt message, wherein the third prompt message is used to prompt that the target job has recovered successfully.
- the processor may also execute program code of the following steps: when the recovery result of at least one third computing node is that the target process recovery fails, the scheduler determines a fourth computing node among multiple third computing nodes, wherein the recovery result of the fourth computing node is that the target process recovery is successful; the scheduler sends a stop operation request to the fourth computing node, wherein the stop operation request is used to request the fourth computing node to stop running the target process; the scheduler outputs a fourth prompt message, wherein the fourth prompt message is used to prompt that the target job recovery fails.
- the processor can call the information and application programs stored in the memory through the transmission device to execute the following steps: the first computing node receives the migration information sent by the scheduling node, wherein a target job is running on the first computing node, and the migration information is sent by the scheduling node after receiving a job migration request and determining that the job status of the target job is running, and the job migration request is used to migrate the target job; based on the migration information, the first computing node dumps the target process image corresponding to the target job to a preset storage device.
- the processor may also execute program code of the following steps: the first computing node determines whether a target process exists on the first computing node based on process information; when the target process exists on the first computing node, the first computing node dumps the target process image to a preset storage device based on a storage path.
- the processor may also execute program code of the following steps: the third computing node receives recovery information sent by the scheduling node, wherein the recovery information is sent by the scheduling node when the target job meets the recovery conditions or receives a scheduling request; the third computing node reads the target process from a preset storage device; the third computing node runs the target process.
- the processor may also execute program code of the following steps: the third computing node reads process information of the target process from a preset storage device; the third computing node determines whether there is a process corresponding to the process information; if it is determined that there is a process corresponding to the process information, the third computing node stops running the target process; if it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
- the scheduling node when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is in operation, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster.
- the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, so as to facilitate the migration of the target job on the potential fault node to reduce the risk of job failure, thereby improving the reliability of the target job operation and achieving the purpose of improving the reliability of the high-performance computing cluster.
- the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
- the structure shown in FIG. 15 is for illustration only, and the computer terminal may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a PDA, a mobile Internet device (Mobile Internet Devices, MID), a PAD, and other terminal devices.
- FIG. 15 does not limit the structure of the above-mentioned electronic device.
- the computer terminal A may also include more or fewer components (such as a network interface, a display device, etc.) than those shown in FIG. 15, or have a configuration different from that shown in FIG. 15.
- a person of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by instructing the hardware related to the terminal device through a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: a flash drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
- the embodiment of the present application further provides a storage medium.
- the storage medium can be used to store the program code executed by the control method of the high performance computing cluster provided in the above embodiment 1.
- the above storage medium may be located in any computer terminal in a computer terminal group in a computer network, or in any mobile terminal in a mobile terminal group.
- the storage medium is configured to store program codes for executing the following steps: when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines a first computing node from the high-performance computing cluster and determines a target process on the first computing node, wherein the target job is running on the first computing node. job, the target process is the process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
- the disclosed technical content can be implemented in other ways.
- the device embodiments described above are only schematic, for example, the division of units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server or network device, etc.) to perform all or part of the steps of the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, disk or optical disk and other media that can store program codes.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
本申请要求于2023年04月13日提交中国专利局、申请号为202310441527.7、发明名称为“高性能计算集群的控制方法、电子设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the Chinese Patent Office on April 13, 2023, with application number 202310441527.7 and invention name “Control method, electronic device and storage medium for high-performance computing cluster”, the entire contents of which are incorporated by reference in this application.
本申请涉及计算集群的控制领域,具体而言,涉及一种高性能计算集群的控制方法、电子设备以及存储介质。The present application relates to the field of control of computing clusters, and in particular to a control method, electronic device and storage medium for a high-performance computing cluster.
传统的高性能计算集群(High Performance Computing,简称为HPC集群)主要由登录节点、调度节点和计算节点组成,其中,高性能计算集群的调度节点上的调度器收到作业提交请求后,可以从HPC集群上选择合适的计算节点运行作业,一旦作业在计算节点开始运行后,则难以感知到作业所在的计算节点,若某个计算节点故障会导致整个作业计算失败。Traditional high-performance computing clusters (HPC clusters) are mainly composed of login nodes, scheduling nodes, and computing nodes. After the scheduler on the scheduling node of the high-performance computing cluster receives a job submission request, it can select a suitable computing node from the HPC cluster to run the job. Once the job starts running on the computing node, it is difficult to perceive the computing node where the job is located. If a computing node fails, the entire job calculation will fail.
针对上述的问题,目前尚未提出有效的解决方案。To address the above-mentioned problems, no effective solution has been proposed yet.
发明内容Summary of the invention
本申请实施例提供了一种高性能计算集群的控制方法、电子设备以及存储介质,以至少解决相关技术中的高性能计算集群的可靠性较低的技术问题。The embodiments of the present application provide a control method, an electronic device, and a storage medium for a high-performance computing cluster, so as to at least solve the technical problem of low reliability of the high-performance computing cluster in the related art.
根据本申请实施例的一个方面,提供了一种高性能计算集群的控制方法,包括:调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中。According to one aspect of an embodiment of the present application, a control method for a high-performance computing cluster is provided, including: when a scheduling node receives a job migration request, obtaining the job status of a target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines a first computing node from the high-performance computing cluster, and determines a target process on the first computing node, wherein the target job is running on the first computing node, and the target process is a process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
根据本申请实施例的一个方面,还提供了一种高性能计算集群的控制方法,包括:第一计算节点接收调度节点发送的迁移信息,其中,第一计算节点上运行有目标作业,迁移信息是调度节点在接收到作业迁移请求,并确定目标作业的作业状态为运行中的情况下发送的,作业迁移请求用于对目标作业进行迁移;第一计算节点基于迁移信息,将目标作业对应的目标进程镜像转储到预设存储设备中According to one aspect of an embodiment of the present application, a control method for a high-performance computing cluster is also provided, including: a first computing node receives migration information sent by a scheduling node, wherein a target job is running on the first computing node, the migration information is sent by the scheduling node when receiving a job migration request and determining that the job status of the target job is running, and the job migration request is used to migrate the target job; based on the migration information, the first computing node dumps a target process image corresponding to the target job to a preset storage device
根据本申请实施例的另一方面,还提供了一种高性能计算集群,包括:计算节点,用于运行作业;预设存储设备,用于存储进程镜像;调度节点,与计算节点连接,用于在接收到作业迁移请求的情况下,获取目标作业的作业状态,在作业状态为运行中的情况下,确定第一计算节点,并确定第一计算节点上的目标进程,其中,作业迁移请求用于对目标作业进行迁移,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;第一计算节点,与预设存储设备连接,用于将目标进程镜像存储至预 设存储设备。According to another aspect of an embodiment of the present application, a high-performance computing cluster is also provided, including: a computing node for running jobs; a preset storage device for storing process images; a scheduling node connected to the computing node, and used to obtain the job status of the target job when a job migration request is received, and when the job status is running, determine the first computing node and determine the target process on the first computing node, wherein the job migration request is used to migrate the target job, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the first computing node is connected to the preset storage device, and is used to store the target process image to the preset storage device; Set up a storage device.
根据本申请实施例的另一方面,还提供了一种电子设备,包括:存储器,存储有可执行程序;处理器,用于运行程序,其中,程序运行时执行上述实施例中任意一项的方法。According to another aspect of the embodiments of the present application, an electronic device is further provided, including: a memory storing an executable program; and a processor for running the program, wherein the program executes any one of the methods in the above embodiments when running.
根据本申请实施例的另一方面,还提供了一种计算机可读存储介质,计算机可读存储介质包括存储的可执行程序,其中,在可执行程序运行时控制存储介质所在设备执行上述实施例中任意一项的方法。According to another aspect of an embodiment of the present application, a computer-readable storage medium is further provided, the computer-readable storage medium including a stored executable program, wherein when the executable program is running, the device where the storage medium is located is controlled to execute any one of the methods in the above embodiments.
在本申请实施例中,首先调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中,实现提高高性能计算集群的可靠性的目的。容易注意到的是,在作业状态为运行中的情况下,可以检查高性能集群中的计算节点,并确定出运行有目标作业的第一计算节点,可以将第一计算节点上的目标进程镜像转储到预设存储设备中,实现在用户无感知的情况下对目标作业对应的目标进程进行转储,可以方便对潜在故障节点上的目标作业进行迁移降低作业失败的风险,从而可以提高目标作业运行的可靠性,实现了提高高性能计算集群的可靠性。进而解决了相关技术中的高性能计算集群的可靠性较低的技术问题。In the embodiment of the present application, first, the scheduling node obtains the job status of the target job when receiving the job migration request, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to the preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster. It is easy to notice that, when the job status is running, the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, and the target job on the potential fault node can be easily migrated to reduce the risk of job failure, thereby improving the reliability of the target job operation and improving the reliability of the high-performance computing cluster. Thus, the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
容易注意到的是,上面的通用描述和后面的详细描述仅仅是为了对本申请进行举例和解释,并不构成对本申请的限定。It is easy to notice that the above general description and the following detailed description are only for the purpose of exemplifying and explaining the present application, and do not constitute a limitation of the present application.
在附图中,除非另外规定,否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解,这些附图仅描绘了根据本申请公开的一些实施方式,而不应将其视为是对本申请范围的限制。In the accompanying drawings, unless otherwise specified, the same reference numerals throughout the multiple drawings represent the same or similar parts or elements. These drawings are not necessarily drawn to scale. It should be understood that these drawings only depict some embodiments disclosed in the present application and should not be regarded as limiting the scope of the present application.
图1是根据本申请实施例的一种用于实现高性能计算集群的控制方法的计算机终端(或移动设备)的硬件结构框图;FIG1 is a hardware structure block diagram of a computer terminal (or mobile device) for implementing a control method for a high-performance computing cluster according to an embodiment of the present application;
图2是根据本申请实施例的一种计算环境的结构框图;FIG2 is a block diagram of a computing environment according to an embodiment of the present application;
图3是根据本申请实施例的一种服务网格的结构框图;FIG3 is a structural block diagram of a service grid according to an embodiment of the present application;
图4是根据本申请实施例1的高性能计算集群的控制方法的流程图;FIG4 is a flow chart of a control method for a high performance computing cluster according to Embodiment 1 of the present application;
图5是根据本申请实施例的一种传统高性能计算集群的示意图;FIG5 is a schematic diagram of a traditional high performance computing cluster according to an embodiment of the present application;
图6是根据本申请实施例的一种高性能计算集群的控制结构示意图;FIG6 is a schematic diagram of a control structure of a high performance computing cluster according to an embodiment of the present application;
图7是根据本申请实施例的一种高性能计算集群进行转储作业进程到持久存储的流程图;7 is a flow chart of a high performance computing cluster performing a dump operation process to persistent storage according to an embodiment of the present application;
图8是根据本申请实施例的一种高性能计算集群作业从转储的进程镜像恢复的流程图;8 is a flowchart of restoring a high performance computing cluster job from a dumped process image according to an embodiment of the present application;
图9是根据本申请实施例的另一种高性能计算集群的控制结构示意图;FIG9 is a schematic diagram of a control structure of another high performance computing cluster according to an embodiment of the present application;
图10是根据本申请实施例的又一种高性能计算集群的控制结构示意图; FIG10 is a schematic diagram of a control structure of another high performance computing cluster according to an embodiment of the present application;
图11是根据本申请实施例2的一种高性能计算集群的控制方法的流程图;FIG11 is a flow chart of a control method for a high performance computing cluster according to Embodiment 2 of the present application;
图12是根据本申请实施例2的一种高性能计算集群的示意图;FIG12 is a schematic diagram of a high performance computing cluster according to Embodiment 2 of the present application;
图13是根据本申请实施例4的一种高性能计算集群的控制装置示意图;FIG13 is a schematic diagram of a control device for a high performance computing cluster according to Embodiment 4 of the present application;
图14是根据本申请实施例5的一种高性能计算集群的控制装置示意图;FIG14 is a schematic diagram of a control device for a high performance computing cluster according to Embodiment 5 of the present application;
图15是根据本申请实施例的一种计算机终端的结构框图。FIG. 15 is a structural block diagram of a computer terminal according to an embodiment of the present application.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
首先,在对本申请实施例进行描述的过程中出现的部分名词或术语适用于如下解释:First, some nouns or terms that appear in the description of the embodiments of the present application are subject to the following explanations:
作业迁移:即一个作业可以从一个节点计算机上迁移到其他工作负荷较轻或适宜处理该作业的节点计算机上运行;Job migration: a job can be migrated from one node computer to another node computer with a lighter workload or suitable for processing the job;
高性能计算集群(High Performance Computing,简称为HPC集群):这类集群通过连接多台机器可以同时处理复杂的计算问题;High Performance Computing (HPC) clusters: These clusters can handle complex computing problems simultaneously by connecting multiple machines.
HPC作业调度:在高性能计算集群上通过调度器执行的作业调度;HPC job scheduling: job scheduling performed by a scheduler on a high-performance computing cluster;
镜像转储过程:是在任意两个或多个磁盘上产生同一个数据的镜像视图的信息存储过程。Mirror dump process: It is the information storage process that produces a mirror view of the same data on any two or more disks.
实施例1Example 1
根据本申请实施例,提供了一种高性能计算集群的控制方法,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present application, a control method for a high-performance computing cluster is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.
本申请实施例1所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。图1是根据本申请实施例的一种用于实现高性能计算集群的控制方法的计算机终端(或移动设备)的硬件结构框图。如图1所示,计算机终端10(或移动设备)可以包括一个或多个(图中采用102a、102b,……,102n来示出)处理器102、用于存储数据的 存储器104、以及用于通信功能的传输模块106,其中,处理器102可以包括但不限于微处理器(Microcontroller Unit,简称为MCU)或可编程逻辑器件(Field Programmable Gate Array,简称为FPGA)等的处理装置。除此以外,还可以包括:显示器、输入/输出接口(I/O接口)、通用串行总线(Universal Serial Bus,USB)端口(可以作为BUS总线的端口中的一个端口被包括)、网络接口、电源和/或相机。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiment provided in Embodiment 1 of the present application can be executed in a mobile terminal, a computer terminal or a similar computing device. FIG1 is a hardware structure block diagram of a computer terminal (or mobile device) for implementing a control method for a high-performance computing cluster according to an embodiment of the present application. As shown in FIG1 , a computer terminal 10 (or mobile device) may include one or more (shown in the figure as 102a, 102b, ..., 102n) processors 102, a processor for storing data A memory 104, and a transmission module 106 for communication functions, wherein the processor 102 may include but is not limited to a processing device such as a microcontroller unit (MCU) or a programmable logic device (FPGA). In addition, it may also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which may be included as one of the ports of the BUS bus), a network interface, a power supply and/or a camera. It will be appreciated by those skilled in the art that the structure shown in FIG. 1 is only illustrative and does not limit the structure of the above-mentioned electronic device. For example, the computer terminal 10 may also include more or fewer components than those shown in FIG. 1, or have a configuration different from that shown in FIG. 1.
应当注意到的是上述一个或多个处理器102和/或其他数据处理电路在本文中通常可以被称为“数据处理电路”。该数据处理电路可以全部或部分的体现为软件、硬件、固件或其他任意组合。此外,数据处理电路可为单个独立的处理模块,或全部或部分的结合到计算机终端10(或移动设备)中的其他元件中的任意一个内。如本申请实施例中所涉及到的,该数据处理电路作为一种处理器控制(例如与接口连接的可变电阻终端路径的选择)。It should be noted that the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits". The data processing circuits may be embodied in whole or in part as software, hardware, firmware, or any other combination thereof. In addition, the data processing circuit may be a single independent processing module, or may be incorporated in whole or in part into any of the other components in the computer terminal 10 (or mobile device). As described in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of a variable resistor terminal path connected to an interface).
存储器104可用于存储应用软件的软件程序以及模块,如本申请实施例中的高性能计算集群的控制方法对应的程序指令/数据存储装置,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的高性能计算集群的控制方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store software programs and modules of application software, such as the program instructions/data storage device corresponding to the control method of the high-performance computing cluster in the embodiment of the present application. The processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, the control method of the high-performance computing cluster described above is realized. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal 10 via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or send data via a network. The specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 can be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet wirelessly.
显示器可以例如触摸屏式的液晶显示器(Liquid Crystal Display,LCD),该液晶显示器可使得用户能够与计算机终端10(或移动设备)的用户界面进行交互。The display may be, for example, a touch screen liquid crystal display (LCD), which enables a user to interact with a user interface of the computer terminal 10 (or mobile device).
图1示出的硬件结构框图,不仅可以作为上述计算机终端10(或移动设备)的示例性框图,还可以作为上述服务器的示例性框图,一种可选实施例中,图2以框图示出了使用上述图1所示的计算机终端10(或移动设备)作为计算环境201中计算节点的一种实施例。图2是根据本申请实施例的一种计算环境的结构框图,如图2所示,计算环境201包括运行在分布式网络上的多个(图中采用210-1,210-2,…,来示出)计算节点(如服务器)。计算节点都包含本地处理和内存资源,终端用户202可以在计算环境201中远程运行应用程序或存储数据。应用程序可以作为计算环境201中的多个服务220-1,220-2,220-3和220-4进行提供,分别代表服务“A”,“D”,“E”和“H”。The hardware structure block diagram shown in FIG1 can be used not only as an exemplary block diagram of the above-mentioned computer terminal 10 (or mobile device), but also as an exemplary block diagram of the above-mentioned server. In an optional embodiment, FIG2 shows an embodiment of using the computer terminal 10 (or mobile device) shown in FIG1 as a computing node in a computing environment 201 in a block diagram. FIG2 is a structural block diagram of a computing environment according to an embodiment of the present application. As shown in FIG2, the computing environment 201 includes multiple (210-1, 210-2, ..., are used in the figure to illustrate) computing nodes (such as servers) running on a distributed network. The computing nodes all contain local processing and memory resources, and the terminal user 202 can remotely run applications or store data in the computing environment 201. The application can be provided as multiple services 220-1, 220-2, 220-3 and 220-4 in the computing environment 201, representing services "A", "D", "E" and "H" respectively.
终端用户202可以通过客户端上的web浏览器或其他软件应用程序提供和访问服务,在一些实施例中,可以将终端用户202的供应和/或请求提供给入口网关230。入口网关230可以包括一个相应的代理来处理针对服务(计算环境201中提供的一个或多个服务)的供 应和/或请求。The end user 202 can provide and access services through a web browser or other software application on the client. In some embodiments, the end user 202's provision and/or request can be provided to the ingress gateway 230. The ingress gateway 230 may include a corresponding agent to handle the provision of services (one or more services provided in the computing environment 201). Should and/or request.
服务是根据计算环境201支持的各种虚拟化技术来提供或部署的。在一些实施例中,可以根据基于虚拟机(Virtual Machine,VM)的虚拟化、基于容器的虚拟化和/或类似的方式提供服务。基于虚拟机的虚拟化可以是通过初始化虚拟机来模拟真实的计算机,在不直接接触任何实际硬件资源的情况下执行程序和应用程序。在虚拟机虚拟化机器的同时,根据基于容器的虚拟化,可以启动容器来虚拟化整个操作系统(Operating System,OS),以便多个工作负载可以在单个操作系统实例上运行。Services are provided or deployed based on various virtualization technologies supported by the computing environment 201. In some embodiments, services can be provided based on virtual machine (VM)-based virtualization, container-based virtualization, and/or similar methods. Virtual machine-based virtualization can be to simulate a real computer by initializing a virtual machine to execute programs and applications without directly contacting any actual hardware resources. While the virtual machine virtualizes the machine, according to container-based virtualization, a container can be started to virtualize the entire operating system (OS) so that multiple workloads can run on a single operating system instance.
在基于容器虚拟化的一个实施例中,服务的若干容器可以被组装成一个Pod(例如,Kubernetes Pod)。举例来说,如图2所示,服务220-2可以配备一个或多个Pod240-1,240-2,…,240-N(统称为Pod)。Pod可以包括代理245和一个或多个容器242-1,242-2,…,242-M(统称为容器)。Pod中一个或多个容器处理与服务的一个或多个相应功能相关的请求,代理245通常控制与服务相关的网络功能,如路由、负载均衡等。其他服务也可以配备类似的Pod。In an embodiment based on container virtualization, several containers of a service can be assembled into a Pod (e.g., a Kubernetes Pod). For example, as shown in FIG2 , a service 220-2 can be equipped with one or more Pods 240-1, 240-2, ..., 240-N (collectively referred to as Pods). A Pod can include a proxy 245 and one or more containers 242-1, 242-2, ..., 242-M (collectively referred to as containers). One or more containers in a Pod process requests related to one or more corresponding functions of the service, and the proxy 245 typically controls network functions related to the service, such as routing, load balancing, etc. Other services can also be equipped with similar Pods.
在操作过程中,执行来自终端用户202的用户请求可能需要调用计算环境201中的一个或多个服务,执行一个服务的一个或多个功能可能需要调用另一个服务的一个或多个功能。如图2所示,服务“A”220-1从入口网关230接收终端用户202的用户请求,服务“A”220-1可以调用服务“D”220-2,服务“D”220-2可以请求服务“E”220-3执行一个或多个功能。During operation, executing a user request from the end user 202 may require invoking one or more services in the computing environment 201, and executing one or more functions of one service may require invoking one or more functions of another service. As shown in FIG2 , service “A” 220-1 receives a user request from the end user 202 from the ingress gateway 230, service “A” 220-1 may call service “D” 220-2, and service “D” 220-2 may request service “E” 220-3 to execute one or more functions.
上述的计算环境可以是云计算环境,资源的分配由云服务提供上管理,允许功能的开发无需考虑实现、调整或扩展服务器。该计算环境允许开发人员在不构建或维护复杂基础设施的情况下执行响应事件的代码。服务可以被分割完成一组可以自动独立伸缩的功能,而不是扩展单个硬件设备来处理潜在的负载。The computing environment described above can be a cloud computing environment, where the allocation of resources is managed by the cloud service provider, allowing the development of functions without considering the implementation, adjustment or expansion of servers. The computing environment allows developers to execute code in response to events without building or maintaining complex infrastructure. Services can be divided into a set of functions that can be automatically and independently scaled, rather than expanding a single hardware device to handle potential loads.
另一种可选实施例中,图3以框图示出了使用上述图1所示的计算机终端10(或移动设备)作为服务网格的一种实施例。图3是根据本申请实施例的一种服务网格的结构框图,如图3所示,该服务网格300主要用于方便多个微服务之间进行安全和可靠的通信,微服务是指将应用程序分解为多个较小的服务或者实例,并分布在不同的集群/机器上运行。In another optional embodiment, FIG3 shows a block diagram of an embodiment of using the computer terminal 10 (or mobile device) shown in FIG1 as a service grid. FIG3 is a structural block diagram of a service grid according to an embodiment of the present application. As shown in FIG3, the service grid 300 is mainly used to facilitate secure and reliable communication between multiple microservices. Microservices refer to decomposing an application into multiple smaller services or instances and distributing them on different clusters/machines to run.
如图3所示,微服务可以包括应用服务实例A和应用服务实例B,应用服务实例A和应用服务实例B形成服务网格300的功能应用层。在一种实施方式中,应用服务实例A以容器/进程308的形式运行在机器/工作负载容器组314(Pod),应用服务实例B以容器/进程310的形式运行在机器/工作负载容器组316(Pod)。As shown in Fig. 3, microservices may include application service instance A and application service instance B, which form a functional application layer of the service grid 300. In one embodiment, application service instance A runs in the form of container/process 308 on machine/workload container group 314 (Pod), and application service instance B runs in the form of container/process 310 on machine/workload container group 316 (Pod).
在一种实施方式中,应用服务实例A可以是商品查询服务,应用服务实例B可以是商品下单服务。In one implementation, application service instance A may be a product query service, and application service instance B may be a product ordering service.
如图3所示,应用服务实例A和网格代理(sidecar)303共存于机器工作负载容器组614,应用服务实例B和网格代理305共存于机器工作负载容器314。网格代理303和网格代理305形成服务网格300的数据平面层(dataplane)。其中,网格代理303和网格代理305分别以容器/进程304,容器/进程306的形式运行,可以接收请求312,以用于进行商品查询服务,并且网格代理303和应用服务实例A之间可以双向通信,网格代理305和应用服务实例B之间可以双向通信。此外,网格代理303和网格代理305之间还可以双向通信。As shown in FIG3 , application service instance A and grid proxy (sidecar) 303 coexist in machine workload container group 614, and application service instance B and grid proxy 305 coexist in machine workload container 314. Grid proxy 303 and grid proxy 305 form the data plane layer (dataplane) of service grid 300. Grid proxy 303 and grid proxy 305 run in the form of container/process 304 and container/process 306, respectively, and can receive request 312 for commodity query service, and grid proxy 303 and application service instance A can communicate bidirectionally, and grid proxy 305 and application service instance B can communicate bidirectionally. In addition, grid proxy 303 and grid proxy 305 can also communicate bidirectionally.
在一种实施方式中,应用服务实例A的流量都通过网格代理303被路由到合适的目的 地,应用服务实例B的网络流量都通过网格代理305被路由到合适的目的地。需要说明的是,在此提及的网络流量包括但不限于超文本传输协议(Hyper Text Transfer Protocol,简称为HTTP),表述性状态传递(Representational State Transfer,简称为REST),高性能、通用的开源框架(google Remote Procedure Call,gRPC),开源的内存中的数据结构存储系统(Redis)等形式。In one embodiment, the traffic of application service instance A is routed to the appropriate destination through grid proxy 303. The network traffic of application service instance B is routed to the appropriate destination through the grid proxy 305. It should be noted that the network traffic mentioned here includes but is not limited to Hyper Text Transfer Protocol (HTTP), Representational State Transfer (REST), high-performance, general open source framework (google Remote Procedure Call, gRPC), open source in-memory data structure storage system (Redis), etc.
在一种实施方式中,可以通过为服务网格300中的代理(Envoy)编写自定义的过滤器(Filter)来实现扩展数据平面层的功能,服务网格代理配置可以是为了使服务网格正确地代理服务流量,实现服务互通和服务治理。网格代理303和网格代理305可以被配置成执行至少如下功能中的一种:服务发现(service discovery),健康检查(health checking),路由(Routing),负载均衡(Load Balancing),认证和授权(authentication and authorization),以及可观测性(observability)。In one embodiment, the function of extending the data plane layer can be achieved by writing a custom filter for the proxy (Envoy) in the service mesh 300. The service mesh proxy configuration can be to enable the service mesh to correctly proxy service traffic and achieve service intercommunication and service governance. Mesh proxy 303 and mesh proxy 305 can be configured to perform at least one of the following functions: service discovery, health checking, routing, load balancing, authentication and authorization, and observability.
如图3所示,该服务网格300还包括控制平面层。其中,控制平面层可以是由一组在一个专用的命名空间中运行的服务,在机器/工作负载容器组(machine/Pod)302中由托管控制面组件301来托管这些服务。如图3所示,托管控制面组件301与网格代理303和网格代理305进行双向通信。托管控制面组件301被配置成执行一些控制管理的功能。例如,托管控制面组件301接收网格代理303和网格代理305传送的遥测数据,可以进一步对这些遥测数据做聚合。这些服务,托管控制面组件301还可以提供面向用户的应用程序接口(Application Programming Interface,API),以便较容易地操纵网络行为,以及向网格代理303和网格代理305提供配置数据等。As shown in FIG3 , the service grid 300 also includes a control plane layer. The control plane layer may be a group of services running in a dedicated namespace, and these services are hosted by a hosted control plane component 301 in a machine/workload container group (machine/Pod) 302. As shown in FIG3 , the hosted control plane component 301 communicates bidirectionally with the grid agent 303 and the grid agent 305. The hosted control plane component 301 is configured to perform some control management functions. For example, the hosted control plane component 301 receives telemetry data transmitted by the grid agent 303 and the grid agent 305, and can further aggregate the telemetry data. For these services, the hosted control plane component 301 can also provide a user-oriented application programming interface (API) to more easily manipulate network behavior and provide configuration data to the grid agent 303 and the grid agent 305.
在上述运行环境下,本申请提供了如图4所示的高性能计算集群的控制方法。图4是根据本申请实施例1的高性能计算集群的控制方法的流程图。如图4所示,该方法包括:In the above operating environment, the present application provides a control method for a high-performance computing cluster as shown in FIG4. FIG4 is a flow chart of a control method for a high-performance computing cluster according to Embodiment 1 of the present application. As shown in FIG4, the method includes:
步骤S402,调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态。Step S402: upon receiving the job migration request, the scheduling node obtains the job status of the target job.
其中,作业迁移请求用于对目标作业进行迁移。The job migration request is used to migrate the target job.
上述的目标作业(Job)可以为待处理的任务,在电商领域,该目标作业可以为订单、商品信息等,在医疗领域,该作业可以为患者信息、医疗图像等,此处对目标作业不做限定,可以根据实际情况进行设置,可以是任意领域的待处理任务。The above-mentioned target job (Job) can be a task to be processed. In the e-commerce field, the target job can be an order, product information, etc. In the medical field, the job can be patient information, medical images, etc. There is no limitation on the target job here, and it can be set according to actual conditions. It can be a task to be processed in any field.
上述的目标作业可以为利用率比较低的节点上的作业,上述的目标作业还可以是潜在故障计算节点上的作业,通过对潜在故障计算节点上的作业进行迁移可以降低作业失败的风险。The target job may be a job on a node with relatively low utilization, or may be a job on a potential faulty computing node. The risk of job failure may be reduced by migrating the job on the potential faulty computing node.
上述的作业迁移请求可以根据对目标作业进行监测的需求生成,也可以是定时生成的,此处不做限定,可以根据实际情况生成目标作业的作业迁移请求。The above-mentioned job migration request may be generated according to the demand for monitoring the target job, or may be generated periodically, which is not limited here, and the job migration request of the target job may be generated according to the actual situation.
上述的调度节点可以为调度器。The above-mentioned scheduling node may be a scheduler.
在一种可选的实施例中,集群管理员可以通过触发目标作业进行检查点操作,从而生成作业迁移请求,调度器在接收到该目标作业迁移请求之后,可以检查目标作业的作业状态,以便根据作业状态对目标作业进行迁移。In an optional embodiment, the cluster administrator may generate a job migration request by triggering a checkpoint operation on the target job. After receiving the target job migration request, the scheduler may check the job status of the target job to migrate the target job according to the job status.
在另一种可选的实施例中,在接收到作业迁移请求的情况下,可以检查目标作业的作业状态,以便判断目标作业的作业状态是否处于运行中,若作业状态为目标作业处于运行中,则可以对目标作业的目标进程进行转储,方便对该目标作业进行监测。 In another optional embodiment, when a job migration request is received, the job status of the target job can be checked to determine whether the job status of the target job is running. If the job status is that the target job is running, the target process of the target job can be dumped to facilitate monitoring of the target job.
步骤S404,在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程。Step S404: When the job status is running, the scheduling node determines a first computing node from the high-performance computing cluster, and determines a target process on the first computing node.
其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程。The target job is running on the first computing node, and the target process is a process on the first computing node corresponding to the target job.
上述的高性能计算集群可以用于表示通过连接多台机器同时处理复杂的计算问题的计算集群。The above-mentioned high-performance computing cluster can be used to represent a computing cluster that processes complex computing problems simultaneously by connecting multiple machines.
上述的第一计算节点可以为一个或多个,此处不做限定。The first computing node mentioned above may be one or more, which is not limited here.
在一种可选的实施例中,在作业状态为运行中的情况下,调度器可以从调度数据库中查找目标作业对应的计算节点列表和在计算节点上的进程标识(ID)列表,也即,从高性能集群中确定第一计算节点,并确定在第一计算节点上的目标进程。In an optional embodiment, when the job status is running, the scheduler can search the scheduling database for a list of computing nodes corresponding to the target job and a list of process identifiers (IDs) on the computing nodes, that is, determine the first computing node from the high-performance cluster and determine the target process on the first computing node.
步骤S406,调度节点发送迁移信息至第一计算节点。Step S406: The scheduling node sends migration information to the first computing node.
其中,迁移信息用于控制第一计算节点。The migration information is used to control the first computing node.
上述的预设存储设备可以为预先设置的任意存储设备,例如,预设存储设备可以为包含有持久存储路径的设备,但不限于此,此处仅作实例进行说明。The above-mentioned preset storage device may be any pre-set storage device. For example, the preset storage device may be a device including a persistent storage path, but is not limited thereto. This is only described as an example.
在一种可选的实施例中,可以调用高性能集群中的检查点-重启服务对目标作业的目标进程做检查点转储,可以将目标进程转储到预设存储设备中。In an optional embodiment, a checkpoint-restart service in a high-performance cluster may be called to perform a checkpoint dump on a target process of a target job, and the target process may be dumped to a preset storage device.
进一步的,在目标进程镜像转储到预设存储设备后,检查点-重启服务可以向调度器返回结果,调度器综合计算节点列表各节点结果,在计算节点列表上的各个节点返回成功时,目标作业的检查点操作才算成功完成,此时调度节点可以更新作业状态并将目标作业移动到等待队列中,并设置再次被调度的条件。Furthermore, after the target process image is dumped to the preset storage device, the checkpoint-restart service can return the result to the scheduler. The scheduler comprehensively calculates the results of each node in the node list. When each node on the node list returns success, the checkpoint operation of the target job is considered to be successfully completed. At this time, the scheduling node can update the job status and move the target job to the waiting queue, and set the conditions for being scheduled again.
若检查点-重启服务返回作业计算节点列表部分节点返回成功或者各个节点返回失败,调度节点和已成功的部分节点通信恢复运行,并且可以向管理员返回操作失败和相关信息。If the checkpoint-restart service returns a list of computing nodes for the job and some nodes return success or some nodes return failure, the scheduling node communicates with the successful nodes to resume operation and can return the operation failure and related information to the administrator.
通过本申请的上述方式,可以提高HPC集群利用率,增加集群吞吐量。可以将高性能集群中的目标作业进行迁移,从而可以腾出更多的可用节点给后面的作业,可以让调度队列上原本不能立刻投入运行的作业尽快投入运行,从而提高单位时间内集群可执行作业的数量,从而提升集群吞吐量,增加效率。Through the above-mentioned method of the present application, the utilization rate of the HPC cluster can be improved and the cluster throughput can be increased. The target job in the high-performance cluster can be migrated, so as to free up more available nodes for the subsequent jobs, and the jobs that could not be put into operation immediately on the scheduling queue can be put into operation as soon as possible, thereby increasing the number of jobs that can be executed by the cluster per unit time, thereby improving the cluster throughput and increasing efficiency.
通过本申请的上述方式,可以降低HPC作业的失败概率,通过对计算机群的监测和分析,对于潜在有故障风险的节点,可以提前将正在运行的HPC作业迁移到机群上的其他节点。从而减少因意外宕机或者其他故障导致作业失败。Through the above-mentioned method of the present application, the failure probability of HPC jobs can be reduced. By monitoring and analyzing the computer cluster, the running HPC jobs can be migrated to other nodes in the cluster in advance for nodes with potential failure risks, thereby reducing job failures caused by unexpected downtime or other failures.
通过本申请的上述方式,还可以提供集群资源维护的便利,当有集群软件更新或者其他运维动作需要实施时,系统管理员可以将待操作的节点上的作业迁移走,然后即可实施资源维护操作而不用等待HPC作业运行完成后才能操作。Through the above-mentioned method of the present application, it is also possible to provide convenience for cluster resource maintenance. When cluster software updates or other operation and maintenance actions need to be implemented, the system administrator can migrate the jobs on the nodes to be operated, and then implement resource maintenance operations without waiting for the HPC job to be completed.
通过本申请的上述方式,还可以降低HPC集群功耗,有助于碳中和,可以对HPC集群上利用率比较低的节点上的作业进行迁移,例如可以将作业集中运行到机房的某些机柜,然后将空载的节点或者机柜等设备下电节电或者进行休眠操作,从而有助于降低功耗。Through the above-mentioned method of the present application, the power consumption of the HPC cluster can also be reduced, which is helpful for carbon neutrality. The jobs on the nodes with relatively low utilization on the HPC cluster can be migrated. For example, the jobs can be concentrated on certain cabinets in the computer room, and then the unloaded nodes or cabinets and other equipment can be powered off to save power or put into hibernation, which helps to reduce power consumption.
通过上述步骤,首先调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调 度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中,实现提高高性能计算集群的可靠性的目的。容易注意到的是,在作业状态为运行中的情况下,可以检查高性能集群中的计算节点,并确定出运行有目标作业的第一计算节点,可以将第一计算节点上的目标进程镜像转储到预设存储设备中,实现在用户无感知的情况下对目标作业对应的目标进程进行转储,可以方便对潜在故障节点上的目标作业进行迁移降低作业失败的风险,从而可以提高目标作业运行的可靠性,实现了提高高性能计算集群的可靠性。进而解决了相关技术中的高性能计算集群的可靠性较低的技术问题。Through the above steps, the scheduling node first obtains the job status of the target job when receiving the job migration request, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node The scheduling node determines the first computing node from the high-performance computing cluster and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to the preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster. It is easy to notice that when the job status is running, the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, and the target job on the potential fault node can be easily migrated to reduce the risk of job failure, thereby improving the reliability of the target job operation and achieving the purpose of improving the reliability of the high-performance computing cluster. Thus, the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
目前在云计算技术里面存在有成熟的虚拟机热迁移方法,若是通过虚拟机搭建HPC集群,一般是将虚拟机用作作业集群来充当执行作业计算节点,但是不包含用户在HPC作业脚本带有创建或者使用虚拟机的动作,即,在HPC作业系统中一般不会考虑使用虚拟机,在HPC的场景中为了提高性能一般会使用裸金属机器,使用程序实现的虚拟机迁移方案不适合应用在HPC的场景中。Currently, there are mature virtual machine hot migration methods in cloud computing technology. If an HPC cluster is built using virtual machines, the virtual machines are generally used as job clusters to act as job computing nodes, but this does not include actions in the HPC job scripts where users create or use virtual machines. That is, virtual machines are generally not considered in HPC job systems. In HPC scenarios, bare metal machines are generally used to improve performance. Virtual machine migration solutions implemented using programs are not suitable for use in HPC scenarios.
图5是根据本申请实施例的一种传统高性能计算集群的示意图,如图5所示,集群用户登录到登录节点后,便可以向集群的作业调度节点提交HPC作业,通常一份HPC作业由一组多机并行运算的进程组成,具体由提交作业的脚本和提交命令控制,作业调度节点上的调度器收到作业提交请求之后,便会从计算机群上选择合适的计算节点运行作业,其中,计算机群中的多个计算节点可以共享存储。作业在计算节点开始运行之后,HPC调度器除了常规的进程控制和监测外,用户感知不到移动作业所在的计算节点。若是某个计算节点出现故障,则整个作业会计算失败。FIG5 is a schematic diagram of a traditional high-performance computing cluster according to an embodiment of the present application. As shown in FIG5 , after a cluster user logs in to the login node, he can submit an HPC job to the job scheduling node of the cluster. Usually, an HPC job consists of a group of multi-machine parallel computing processes, which are specifically controlled by the script and submission command for submitting the job. After the scheduler on the job scheduling node receives the job submission request, it will select a suitable computing node from the computer cluster to run the job, wherein multiple computing nodes in the computer cluster can share storage. After the job starts running on the computing node, the HPC scheduler cannot perceive the computing node where the mobile job is located except for conventional process control and monitoring. If a computing node fails, the entire job will fail to calculate.
图6是根据本申请实施例的一种高性能计算集群的控制结构示意图,如图6所示,FIG6 is a schematic diagram of a control structure of a high performance computing cluster according to an embodiment of the present application. As shown in FIG6 ,
第一步,在计算集群启动时,计算集群中包含的计算节点会启动检查点-重启服务,该服务实现对HPC作业对应的进程做检查点-重启动作,并提供访问接口或者应用程序编程接口(Application Programming Interface,简称为API)给调度节点;The first step is that when the computing cluster is started, the computing nodes contained in the computing cluster will start the checkpoint-restart service, which implements the checkpoint-restart action for the process corresponding to the HPC job and provides an access interface or application programming interface (Application Programming Interface, referred to as API) to the scheduling node;
第二步,用户登录到登录节点,向HPC集群提交作业。其中,集群用户按照常规流程提交作业,可以忽略提交作业过程中的细节。In the second step, the user logs in to the login node and submits the job to the HPC cluster. Cluster users submit jobs according to the normal process and can ignore the details of the job submission process.
第三步和第四步,用户将作业提交到调度队列中的调度器,等待HPC集群调度器选择计算节点执行作业,为了方便让作业调度节点支持作业迁移,调度节点上的相关算法、队列和数据库需要做如下改动:In the third and fourth steps, the user submits the job to the scheduler in the scheduling queue and waits for the HPC cluster scheduler to select a computing node to execute the job. In order to facilitate the job scheduling node to support job migration, the relevant algorithms, queues, and databases on the scheduling node need to be modified as follows:
对于传统HPC作业的状态有R(运行中)、Q(排队中)等,增加状态C(CheckPointed,检查点)表示对应作业处于检查点中,作业对应的进程已被转储到持久存储中。其中,处于检查点中状态的作业不占用CPU运行,作业状态被转存储在持久存储中。For traditional HPC job states, there are R (running) and Q (queuing), etc. The addition of C (CheckPointed) indicates that the corresponding job is in the checkpoint, and the process corresponding to the job has been dumped to persistent storage. Among them, the job in the checkpoint state does not occupy the CPU to run, and the job state is transferred to persistent storage.
通常,转储过程时间跟作业进程占用的内存大小和存储的写入速度有关,对于占用内存比较大的作业需要的时间比较长,可以根据需要增加检查节点Ct(Checkpointing)表示转储中的状态。当集群管理员对作业进行检查点操作后,作业从运行队列回到待调度队列,并设置在计算节点恢复作业进程的条件。处于C状态的作业在集群资源满足条件或 者集群管理员主动发起调度时,在计算节点恢复作业对应的进程。其中,可以通过命令、接口、按钮操控等方式进行检查点操作。Generally, the dump process time is related to the memory size occupied by the job process and the storage write speed. For jobs that occupy a large amount of memory, it takes a long time. You can add a checkpoint node C t (Checkpointing) as needed to indicate the state in the dump. When the cluster administrator performs a checkpoint operation on the job, the job returns from the running queue to the waiting queue and sets the conditions for resuming the job process on the compute node. Jobs in the C state are not restored until the cluster resources meet the conditions or When the cluster administrator initiates scheduling, the process corresponding to the job is restored on the computing node. Checkpoint operations can be performed through commands, interfaces, button controls, etc.
图6中调度算法用于增加和计算节点的检查点-重启服务的通信和调用过程,将已做完检查点的作业调度回等待队列并设置为新的状态。并设置重新调度回计算节点的条件,当条件满足时恢复作业执行,更新作业和所在的计算节点列表和对应的进程ID。The scheduling algorithm in Figure 6 is used to increase the communication and call process of the checkpoint-restart service with the computing node, schedule the checkpointed job back to the waiting queue and set it to a new state. It also sets the conditions for rescheduling back to the computing node, and resumes the job execution when the conditions are met, and updates the job and the computing node list and the corresponding process ID.
图6中调度队列的作业被转储后以新的状态返回到调度队列。In FIG6 , the job in the scheduling queue is dumped and then returned to the scheduling queue in a new state.
图6中调度数据库可以增加作业状态C,Ct,以及作业对应转存储路径。The scheduling database in FIG6 can add job states C, Ct, and the job corresponding storage path.
第五步,作业进入运行状态,计算节点开始运行作业。作业开始运行后,则可以确定作业在计算机群中的计算节点列表和作业在计算节点列表上的进程ID。调度器更新作业状态并将作业相关信息存储在对应的调度数据库中。In the fifth step, the job enters the running state and the computing node starts running the job. After the job starts running, the computing node list of the job in the computer cluster and the process ID of the job on the computing node list can be determined. The scheduler updates the job status and stores the job-related information in the corresponding scheduling database.
第六步和第七步,集群管理员触发作业进行检查点操作。调度器接收到该请求后,检查作业对应的状态,若作业在运行中,调度器从调度数据库查找作业对应计算节点列表和在计算节点上的进程ID列表,调度器和计算节点列表上的节点通信,调用检查点-重启服务的程序检查节点(Job Progress Check Point),检查点-重启服务开始对上面的作业对应的进程开始做检查点转储,将进程镜像转储到指定的持久存储路径上。转储完成后检查点-重启服务向调度器返回结果,调度器综合计算节点列表各节点结果。In the sixth and seventh steps, the cluster administrator triggers the job to perform a checkpoint operation. After receiving the request, the scheduler checks the status of the job. If the job is running, the scheduler searches the scheduling database for the list of computing nodes corresponding to the job and the list of process IDs on the computing nodes. The scheduler communicates with the nodes on the computing node list and calls the program checkpoint node (Job Progress Check Point) of the checkpoint-restart service. The checkpoint-restart service starts to checkpoint the process corresponding to the above job and dumps the process image to the specified persistent storage path. After the dump is completed, the checkpoint-restart service returns the result to the scheduler, and the scheduler integrates the results of each node in the computing node list.
在计算节点列表上的各个节点返回成功时,整个作业的检查点操作才算成功完成,此时调度节点更新作业状态为C并将作业从运行队列移到等待队列,并设置再次被调度的条件;在作业计算节点列表部分节点返回成功或各个节点都返回失败时,调度节点和已成功的部分节点通信恢复运行,向管理返回操作失败和相关信息。When all nodes in the computing node list return success, the checkpoint operation of the entire job is considered to be successfully completed. At this time, the scheduling node updates the job status to C and moves the job from the running queue to the waiting queue, and sets the conditions for being scheduled again; when some nodes in the job computing node list return success or all nodes return failure, the scheduling node communicates with the successful nodes to resume operation and returns the operation failure and related information to the management.
第八步和第九步,对已经做完检查点操作的作业,HPC集群管理员可以通过预设条件或者直接控制调度器调度该作业在新的计算节点上面执行。调度器收到集群管理员请求或者满足上述的预设条件时,调度器和集群管理员可以指定新计算节点列表或者调度算法确定计算节点通信以便调用计算节点上的检查点重启服务的程序恢复节点(Job Progress Restart),检查点重启服务可以从对应存储路径上的作业进程镜像恢复作业在计算节点上面的执行。恢复过程也可能遇到错误,比如进程ID已经被占用,此时计算节点向调度节点返回失败。In steps 8 and 9, for jobs that have completed the checkpoint operation, the HPC cluster administrator can schedule the job to execute on a new computing node by presetting conditions or directly controlling the scheduler. When the scheduler receives a request from the cluster administrator or meets the above presetting conditions, the scheduler and the cluster administrator can specify a new computing node list or a scheduling algorithm to determine the computing node communication in order to call the program recovery node (Job Progress Restart) of the checkpoint restart service on the computing node. The checkpoint restart service can restore the execution of the job on the computing node from the job process image on the corresponding storage path. The recovery process may also encounter errors, such as the process ID is already occupied, in which case the computing node returns failure to the scheduling node.
调度节点收到各个新计算节点列表上的节点检查点重启服务返回节点后,在计算节点列表的节点返回成功时作业恢复运行,调度器此时可以更新作业所在状态和作业数据库中的作业所在新节点信息。若其中存在某个计算节点恢复作业进程返回失败,调度器停止其他计算节点上已做完恢复进程,并向集群管理员返回相应信息。After the scheduling node receives the node checkpoint restart service return node on each new computing node list, the job resumes when the node in the computing node list returns successfully. The scheduler can then update the job status and the job database's new node information. If there is a computing node that fails to resume the job process, the scheduler stops the completed recovery process on other computing nodes and returns the corresponding information to the cluster administrator.
图7是根据本申请实施例的一种高性能计算集群进行转储作业进程到持久存储的流程图,如图7所示,该方法包括:FIG. 7 is a flow chart of a high performance computing cluster performing a dump operation process to persistent storage according to an embodiment of the present application. As shown in FIG. 7 , the method includes:
步骤S701,调度器收到启动检查点请求;Step S701, the scheduler receives a request to start a checkpoint;
步骤S702,检查作业状态;Step S702, checking the operation status;
步骤S703,判断作业是否在运行中,若是(Y),则执行步骤S704,若否(N),则执行步骤S717;Step S703, determine whether the job is running, if yes (Y), execute step S704, if no (N), execute step S717;
步骤S704,从调度数据库中查作业节点列表; Step S704, querying the job node list from the scheduling database;
步骤S705,建立同各个计算节点检查点-重启服务通信调用;Step S705, establishing a checkpoint-restart service communication call with each computing node;
步骤S706,接收计算节点返回的操作结果;Step S706, receiving the operation result returned by the computing node;
步骤S707,是否各个计算节点返回操作成功,若是,则执行步骤S708,若否,则执行步骤S710;Step S707, whether each computing node returns that the operation is successful, if so, execute step S708, if not, execute step S710;
步骤S708,更新作业状态,调整作业队列,设置再次运行条件;Step S708, update the job status, adjust the job queue, and set the re-running conditions;
步骤S709,返回成功结束操作;Step S709, returning to indicate successful completion of the operation;
步骤S710,恢复部分成功节点作业进程继续运行;Step S710, resume the operation process of some successful nodes to continue running;
步骤S711,返回失败结束操作;Step S711, return failure and end the operation;
步骤S712,检查点-重启服务收到请求;Step S712, the checkpoint-restart service receives the request;
上述收到的请求可以为JobProgressCheckPoint的请求。The received request may be a JobProgressCheckPoint request.
步骤S713,获取进程标识和转储路径;Step S713, obtaining the process identifier and dump path;
步骤S714,判断作业是否在运行中,若是,则执行步骤S715,若否,则执行步骤S706;Step S714, determine whether the job is running, if so, execute step S715, if not, execute step S706;
步骤S715,对进程进行检查点转储操作;Step S715, performing a checkpoint dump operation on the process;
步骤S716,返回操作结果到调度节点;Step S716, returning the operation result to the scheduling node;
步骤S717,结束操作返回。Step S717, end the operation and return.
图8是根据本申请实施例的一种高性能计算集群作业从转储的进程镜像恢复的流程图,如图8所示,该方法包括:FIG8 is a flow chart of restoring a high performance computing cluster job from a dumped process image according to an embodiment of the present application. As shown in FIG8 , the method includes:
步骤S801,集群管理员控制作业恢复或检查点预设条件满足;Step S801, the cluster administrator controls the job recovery or checkpoint preset conditions to be met;
步骤S802,调度器开始执行作业进程恢复;Step S802, the scheduler starts to execute the job process recovery;
步骤S803,调度器运行调度算法得到作业恢复时所在新计算节点列表;Step S803, the scheduler runs the scheduling algorithm to obtain a list of new computing nodes where the job is restored;
步骤S804,建立不同计算节点检查点-重启服务通信调用;Step S804, establishing checkpoints of different computing nodes - restarting service communication calls;
步骤S805,接收计算节点返回的操作结果;Step S805, receiving the operation result returned by the computing node;
步骤S806,判断各个计算节点返回操作是否成功,若是,则执行步骤S807,若否,则执行步骤S809;Step S806, determining whether the return operation of each computing node is successful, if so, executing step S807, if not, executing step S809;
步骤S807,更新作业状态,调整作业队列;Step S807, updating the job status and adjusting the job queue;
步骤S808,返回成功结束操作;Step S808, returning to indicate successful completion of the operation;
步骤S809,停止恢复部分成功节点作业进程继续运行;Step S809, stop restoring the partially successful node operation process and continue to run;
步骤S810,返回失败结束操作;Step S810, return failure and end the operation;
步骤S811,检查点-重启服务收到请求;Step S811, the checkpoint-restart service receives the request;
步骤S812,从转储路径获得作业进程镜像;Step S812, obtaining the job process image from the dump path;
步骤S813,检查点-重启服务恢复镜像执行;Step S813, checkpoint-restart service to resume image execution;
步骤S814,返回操作结果到调度节点。Step S814, returning the operation result to the scheduling node.
图9是根据本申请实施例的另一种高性能计算集群的控制结构示意图,相对于图6所示的方法,本申请对三个步骤有所补充,首先,在HPC作业提交阶段时,当调度器从计算集群里选择候选计算节点后,执行作业前,先在计算节点里启动针对该作业的容器,然后在容器里执行作业进程;其次,在对集群作业执行阶段中遇到需要HPC作业迁移场景时。Figure 9 is a schematic diagram of the control structure of another high-performance computing cluster according to an embodiment of the present application. Compared with the method shown in Figure 6, the present application supplements three steps. First, during the HPC job submission phase, after the scheduler selects a candidate computing node from the computing cluster, before executing the job, the container for the job is first started in the computing node, and then the job process is executed in the container; secondly, when a scenario requiring HPC job migration is encountered during the cluster job execution phase.
有两种方式对作业进程进行检查点-重启操作对作业进行转储。其一,由于容器对于操作系统来讲也是一个进程,所以直接对容器进程进行检查点操作。其二,对容器里的作业 进程进行检查点操作,相对于前者,容器里的作业进程数可能不止一个;最后,对于集群作业恢复阶段,如果前面在做进程转储时将容器整理打包转储,则恢复容器对应的进程,在进行检查点操作的时候执行转储容器作业进程可以采用不同的方式,一种方式为如图9中所示的方式,可以先启动容器然后在容器里面运行HPC作业进程恢复操作;图10是根据本申请实施例的又一种高性能计算集群的控制结构示意图,如图10所示,另一种方式是将作业进程恢复到无容器环境的节点上。另外,对于图6提到的作业进程亦可以先启动容器环境,然后在容器环境里恢复执行。There are two ways to checkpoint and restart the job process to dump the job. First, since the container is also a process for the operating system, you can directly checkpoint the container process. Second, checkpoint the job in the container. The process performs a checkpoint operation. Compared with the former, the number of job processes in the container may be more than one; finally, for the cluster job recovery stage, if the container is packed and dumped when the process is dumped, the process corresponding to the container is restored. When performing the checkpoint operation, the dump container job process can be executed in different ways. One way is as shown in Figure 9. The container can be started first and then the HPC job process recovery operation can be run in the container; Figure 10 is a control structure diagram of another high-performance computing cluster according to an embodiment of the present application. As shown in Figure 10, another way is to restore the job process to a node without a container environment. In addition, for the job process mentioned in Figure 6, the container environment can also be started first, and then the execution can be restored in the container environment.
本申请提出的高性能计算集群的控制方法,可以方便集群用户或者管理员进一步控制集群作业在计算集群中节点的迁移,通过对潜在故障节点上作业的迁移可以降低作业失败的风险,还可以通过合适的迁移提高集群利用率,同时配合机柜节点等措施可以降低整个集群的功耗。The control method of the high-performance computing cluster proposed in this application can facilitate cluster users or administrators to further control the migration of cluster jobs in the nodes in the computing cluster. The risk of job failure can be reduced by migrating jobs on potential faulty nodes. The cluster utilization rate can also be improved through appropriate migration. At the same time, the power consumption of the entire cluster can be reduced by cooperating with measures such as cabinet nodes.
本申请上述实施例中,调度节点包括调度器、运行队列和待调度队列,在调度节点发送迁移信息至第一计算节点,该方法还包括:调度器接收第一计算节点反馈的转储结果,其中,转储结果用于表征目标进程镜像是否转储成功;在转储结果为目标进程转储成功的情况下,调度器将目标作业的作业状态更新为迁移状态,并将目标作业从运行队列移动至待调度队列,其中,运行队列用于存储作业状态为运行中的作业,待调度队列用于存储作业状态不为运行中的作业;调度器输出第一提示信息,其中,第一提示信息用于提示目标作业迁移成功。In the above embodiment of the present application, the scheduling node includes a scheduler, a running queue and a queue to be scheduled, and the scheduling node sends migration information to the first computing node. The method also includes: the scheduler receives the dump result fed back by the first computing node, wherein the dump result is used to indicate whether the target process image is successfully dumped; when the dump result is that the target process dump is successful, the scheduler updates the job status of the target job to the migration status, and moves the target job from the running queue to the queue to be scheduled, wherein the running queue is used to store jobs whose job status is running, and the queue to be scheduled is used to store jobs whose job status is not running; the scheduler outputs a first prompt message, wherein the first prompt message is used to prompt that the migration of the target job is successful.
上述的调度器(Scheduler)用于根据计算节点反馈的转储结果执行对作业进行移动。The above-mentioned scheduler is used to move the job according to the dump result fed back by the computing node.
上述的作业状态用于表示目标作业在第一计算节点中正在执行。The above-mentioned job status is used to indicate that the target job is being executed in the first computing node.
在一种可选的实施例中,调度器可以根据第一计算节点反馈的转储结果确定目标进程镜像是否转储成功,在转储结果为目标进程镜像转储成功的情况下,调度器可以将目标作业的作业状态更新为迁移状态,从而可以结束目标作业的运行,调度器可以执行将目标作业从运行队列移动到待调度队列的操作,在调度器输出第一提示信息的情况下,则说明目标作业迁移成功。In an optional embodiment, the scheduler can determine whether the target process image is dumped successfully based on the dump result fed back by the first computing node. When the dump result is that the target process image is dumped successfully, the scheduler can update the job status of the target job to the migration status, thereby terminating the operation of the target job. The scheduler can execute the operation of moving the target job from the running queue to the to-be-scheduled queue. When the scheduler outputs the first prompt information, it means that the target job has been migrated successfully.
在另一种可选的实施例中,调度器将目标作业的作业状态更新为迁移状态之后,可以将更新状态的相关信息存储在目标作业对应的调度数据库中。In another optional embodiment, after the scheduler updates the job status of the target job to the migration status, the scheduler may store the relevant information of the updated status in the scheduling database corresponding to the target job.
本申请上述实施例中,迁移状态包括:迁移完成和迁移中,调度器将目标作业的作业状态更新为迁移状态,包括:调度器获取目标进程占用的内存量,以及预设存储设备对应的存储速度;调度器基于内存量和存储速度,将作业状态更新为迁移完成或迁移中。In the above embodiments of the present application, the migration status includes: migration completed and migration in progress. The scheduler updates the job status of the target job to the migration status, including: the scheduler obtains the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device; the scheduler updates the job status to migration completed or migration in progress based on the amount of memory and the storage speed.
在一种可选的实施例中,转储过程时间跟作业进程占用的内存大小和存储的写入速度有关,对于占用内存比较大的作业需要的时间比较长,因此,调度器可以获取目标进程占用的内存量以及预设存储设备对应的存储速度,根据内存量和存储数据确定迁移完成所需要的时间,以便根据迁移时间将作业状态更新为迁移完成或者迁移中。In an optional embodiment, the dump process time is related to the memory size occupied by the job process and the storage write speed. Jobs that occupy a large amount of memory require a longer time. Therefore, the scheduler can obtain the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device, and determine the time required to complete the migration based on the amount of memory and the stored data, so as to update the job status to migration completed or migrating according to the migration time.
本申请上述实施例中,调度器基于内存量和存储速度,将作业状态更新为迁移完成或迁移中,包括:在内存量大于预设内存量,且存储速度小于预设速度的情况下,调度器将作业状态更新为迁移中;在内存量小于或等于预设内存量,或存储速度大于或等于预设速度的情况下,调度器将作业状态更新为迁移完成。 In the above embodiments of the present application, the scheduler updates the job status to migration completed or migrating based on the memory amount and the storage speed, including: when the memory amount is greater than the preset memory amount and the storage speed is less than the preset speed, the scheduler updates the job status to migrating; when the memory amount is less than or equal to the preset memory amount, or the storage speed is greater than or equal to the preset speed, the scheduler updates the job status to migration completed.
上述的预设内存量可以为根据预设存储设备的内存预先设置的内存量,预设内存量还可以根据实际情况进行设置。The above-mentioned preset memory amount may be a memory amount pre-set according to the memory of a preset storage device, and the preset memory amount may also be set according to actual conditions.
上述的预设速度可以根据预设存储设备的存储速度预先设置的速度,预设速度还可以根据实际情况进行设置。The above-mentioned preset speed can be a speed preset according to the storage speed of a preset storage device, and the preset speed can also be set according to actual conditions.
在一种可选的实施例中,在内存量大于预设内存量,且存储速度小于预设速度的情况下,说明目标作业的目标进程还未迁移完成,因此内存量较大,且速度较小,此时可以将作业状态更新为迁移中;在内存量小于或等于预设内存量,并且存储速度大于或等于预设速度的情况下,说明目标作业的目标进程已经迁移完成,因此预设存储设备的内存量较小,速度可以较大,此时可以将作业状态更新为迁移完成;在内存量小于或等于预设存储量,并且存储速度小于预设速度的情况下,说明目标作业的目标进程已经迁移完成,因此,预设存储设备的内存量较小,速度可以较小,此时可以将作业状态更新为迁移完成。In an optional embodiment, when the amount of memory is greater than the preset memory amount and the storage speed is less than the preset speed, it means that the target process of the target job has not yet been migrated, so the memory amount is large and the speed is small, and the job status can be updated to migrating at this time; when the amount of memory is less than or equal to the preset memory amount and the storage speed is greater than or equal to the preset speed, it means that the target process of the target job has been migrated, so the memory amount of the preset storage device is small and the speed can be large, and the job status can be updated to migration completed at this time; when the amount of memory is less than or equal to the preset storage amount and the storage speed is less than the preset speed, it means that the target process of the target job has been migrated, so the memory amount of the preset storage device is small and the speed can be small, and the job status can be updated to migration completed.
在另一种可选的实施例中,对于迁移过程中的作业状态可以通过不同的标识进行标记,从而便于调度器对作业状态的更新。In another optional embodiment, the job status during the migration process may be marked with different identifiers, thereby facilitating the scheduler to update the job status.
本申请上述实施例中,在第一计算节点为多个的情况下,该方法还包括:在至少一个第一计算节点的转储结果为目标进程转储失败的情况下,调度器确定多个第一计算节点中的第二计算节点,其中,第二计算节点的转储结果为目标进程转储成功;调度器发送恢复运行请求至第二计算节点,其中,恢复运行请求用于请求第二计算节点继续运行目标进程;调度器输出第二提示信息,其中,第二提示信息用于提示目标作业迁移失败。In the above embodiment of the present application, when there are multiple first computing nodes, the method also includes: when the dump result of at least one first computing node is a failure to dump the target process, the scheduler determines a second computing node among the multiple first computing nodes, wherein the dump result of the second computing node is a successful dump of the target process; the scheduler sends a resume operation request to the second computing node, wherein the resume operation request is used to request the second computing node to continue running the target process; the scheduler outputs a second prompt message, wherein the second prompt message is used to prompt that the target job migration has failed.
上述的第二提示信息的提示方式不限于文字、语音、图像等,可以根据实际需求确定第二提示信息的提示方式。The prompting method of the second prompting information is not limited to text, voice, image, etc., and the prompting method of the second prompting information can be determined according to actual needs.
在一种可选的实施例中,在至少一个第一计算节点的转储结果为目标进程转储失败的情况下,调度器可以确定出多个第一计算节点中的第二计算节点,以便于发送恢复运行请求至第二计算节点,可以避免第二计算节点受到迁移失败的影响,使得第二计算节点可以继续运行目标进程。In an optional embodiment, when the dump result of at least one first computing node is a target process dump failure, the scheduler can determine a second computing node from multiple first computing nodes to send a resume request to the second computing node, thereby avoiding the second computing node from being affected by the migration failure, so that the second computing node can continue to run the target process.
本申请上述实施例中,在调度节点发送迁移信息至第一计算节点之后,该方法还包括:在目标作业满足恢复条件,或接收到调度请求的情况下,调度节点从高性能计算集群中确定第三计算节点,其中,调度请求用于将目标作业调度至第三计算节点;调度节点发送恢复信息至第三计算节点,其中,恢复信息用于控制第三计算节点从预设存储设备中读取目标进程,并运行目标进程。In the above embodiment of the present application, after the scheduling node sends the migration information to the first computing node, the method also includes: when the target job meets the recovery conditions or receives a scheduling request, the scheduling node determines a third computing node from the high-performance computing cluster, wherein the scheduling request is used to schedule the target job to the third computing node; the scheduling node sends recovery information to the third computing node, wherein the recovery information is used to control the third computing node to read the target process from a preset storage device and run the target process.
上述的恢复条件可以为预先设置的条件,其中,可以在调度算法中设置该恢复条件,但不限于此。The above-mentioned recovery condition may be a pre-set condition, wherein the recovery condition may be set in a scheduling algorithm, but is not limited thereto.
可以增加计算节点的检查点-重启服务的通信和调用过程,将已做完检查点的作业调度回等待队列并设置为新的状态,并设置重新调度回计算节点的恢复条件,当条件满足时恢复作业执行。You can increase the communication and calling process of the checkpoint-restart service of the computing node, schedule the checkpointed jobs back to the waiting queue and set them to a new state, set the recovery conditions for rescheduling them back to the computing node, and resume job execution when the conditions are met.
上述的调度请求可以由集群管理员触发作业进行检查点操作生成。The above scheduling request can be generated by the cluster administrator to trigger the job to perform checkpoint operation.
上述的第三计算节点可以为集群管理员指定的新计算节点列表的计算节点,还可以为调度算法算出来的新计算节点列表的计算节点。The third computing node mentioned above may be a computing node in a new computing node list specified by a cluster administrator, or may be a computing node in a new computing node list calculated by a scheduling algorithm.
在一种可选的实施例中,调度器在接收到调度请求之后,可以从高性能计算集群中确 定出第三计算节点,可以通过调用第一预设接口从预设存储设备中读取目标进程,以便于恢复目标进程在第三计算节点上的运行。In an optional embodiment, after receiving the scheduling request, the scheduler can determine the The third computing node is determined, and the target process can be read from the preset storage device by calling the first preset interface, so as to restore the operation of the target process on the third computing node.
本申请上述实施例中,调度节点从高性能计算集群中确定第三计算节点,包括:在目标作业满足恢复条件的情况下,调度节点基于调度算法从高性能计算集群中确定第三计算节点,其中,所述调度算法用于为所述目标作业分配所述第三计算节点的目标资源,以使所述目标作业由所述目标资源执行;在接收到调度请求的情况下,调度节点从高性能计算集群中确定调度请求对应的计算节点,得到第三计算节点。In the above embodiment of the present application, the scheduling node determines the third computing node from the high-performance computing cluster, including: when the target job meets the recovery condition, the scheduling node determines the third computing node from the high-performance computing cluster based on the scheduling algorithm, wherein the scheduling algorithm is used to allocate the target resources of the third computing node to the target job so that the target job is executed by the target resources; when receiving a scheduling request, the scheduling node determines the computing node corresponding to the scheduling request from the high-performance computing cluster to obtain the third computing node.
可以通过集群现有的调度算法重新确定出第三计算节点,并为第三计算节点分配资源,以使得目标作业可以由目标资源执行。The third computing node may be re-determined through the existing scheduling algorithm of the cluster, and resources may be allocated to the third computing node, so that the target job may be executed by the target resources.
上述的调度算法可以用于确定检查点-重启服务的通信和调用过程,可以将已做完检查点的作业调度回等待队列并设置为新的状态,并设置重新调度回计算节点的条件,当条件满足时恢复作业执行,调度算法还可以更新作业和所在的计算节点列表和对应的进程ID。The above scheduling algorithm can be used to determine the communication and calling process of the checkpoint-restart service. It can schedule the jobs that have completed the checkpoint back to the waiting queue and set them to a new state, and set the conditions for rescheduling them back to the computing node. When the conditions are met, the job execution is resumed. The scheduling algorithm can also update the job and the computing node list and the corresponding process ID.
在一种可选的实施例中,可以根据调度算法算出新计算节点列表中的计算节点,以便于确定该计算节点为第三计算节点。In an optional embodiment, the computing node in the new computing node list may be calculated according to a scheduling algorithm, so as to determine that the computing node is the third computing node.
本申请上述实施例中,调度节点包括调度器、运行队列和待调度队列,在调度节点发送恢复信息至第三计算节点之后,该方法还包括:调度器接收第三计算节点反馈的恢复结果,其中,恢复结果用于表征目标进程是否恢复成功;在恢复结果为目标进程恢复成功的情况下,调度器将目标作业的作业状态更新为运行中,并将目标作业从待调度队列移动至运行队列;调度器输出第三提示信息,其中,第三提示信息用于提示目标作业恢复成功。In the above embodiment of the present application, the scheduling node includes a scheduler, a running queue and a queue to be scheduled. After the scheduling node sends recovery information to the third computing node, the method also includes: the scheduler receives the recovery result fed back by the third computing node, wherein the recovery result is used to characterize whether the target process has recovered successfully; when the recovery result is that the target process has recovered successfully, the scheduler updates the job status of the target job to running, and moves the target job from the queue to be scheduled to the running queue; the scheduler outputs a third prompt message, wherein the third prompt message is used to prompt that the target job has recovered successfully.
在一种可选的实施例中,调度器接收第三计算节点反馈的恢复结果,若恢复结果为目标进程恢复成功的情况下,调度器可以将目标作业的作业状态更新为运行中,便于确定目标作业的作业状态,此时可以将目标作业从待调度队列移动到运行队列,并且调度器可以输出第三提示信息,便于用户了解到目标作业恢复成功。In an optional embodiment, the scheduler receives the recovery result fed back by the third computing node. If the recovery result is that the target process has been successfully recovered, the scheduler can update the job status of the target job to running, so as to facilitate determining the job status of the target job. At this time, the target job can be moved from the to-be-scheduled queue to the running queue, and the scheduler can output a third prompt message to facilitate the user to understand that the target job has been successfully recovered.
本申请上述实施例中,在第三计算节点为多个的情况下,该方法还包括:在至少一个第三计算节点的恢复结果为目标进程恢复失败的情况下,调度器确定多个第三计算节点中的第四计算节点,其中,第四计算节点的恢复结果为目标进程恢复成功;调度器发送停止运行请求至第四计算节点,其中,停止运行请求用于请求第四计算节点停止运行目标进程;调度器输出第四提示信息,其中,第四提示信息用于提示目标作业恢复失败。In the above embodiment of the present application, when there are multiple third computing nodes, the method also includes: when the recovery result of at least one third computing node is that the target process recovery fails, the scheduler determines a fourth computing node among the multiple third computing nodes, wherein the recovery result of the fourth computing node is that the target process recovery is successful; the scheduler sends a stop operation request to the fourth computing node, wherein the stop operation request is used to request the fourth computing node to stop running the target process; the scheduler outputs a fourth prompt message, wherein the fourth prompt message is used to prompt that the target job recovery fails.
上述的第四计算节点中的目标进程已经恢复成功,也即在第四计算节点中已做完恢复目标进程的动作。The target process in the fourth computing node has been successfully restored, that is, the action of restoring the target process has been completed in the fourth computing node.
在一种可选的实施例中,若多个第三计算节点的恢复结果为目标进程恢复失败,则需要确定出多个第三计算节点中目标进程恢复成功的第四计算结果,需要对该第四计算节点的目标进程进行停止运行,使得第三计算节点上的目标进程恢复状态统一,此时可以通过调度器输出第四提示信息,以便通过第四提示信息向用户提示目标作业恢复失败。In an optional embodiment, if the recovery result of multiple third computing nodes is that the target process recovery failed, it is necessary to determine a fourth computing result that the target process in the multiple third computing nodes has been successfully recovered, and it is necessary to stop the target process of the fourth computing node so that the recovery status of the target processes on the third computing nodes is unified. At this time, the fourth prompt information can be output through the scheduler so as to prompt the user that the target job recovery has failed through the fourth prompt information.
通过本申请提出的一种将作业进程检查点-重启的技术整合到HPC作业系统的方法,可以改进作业系统的调度算法和策略,使得用户提交到HPC集群里的作业能过实现用户无感知的迁移功能,从而使得HPC集群系统在运维、故障处理、集群功耗等方面获益。By integrating the job process checkpoint-restart technology into the HPC job system through the method proposed in this application, the scheduling algorithm and strategy of the job system can be improved, so that the jobs submitted by users to the HPC cluster can achieve the user-imperceptible migration function, thereby benefiting the HPC cluster system in operation and maintenance, fault handling, cluster power consumption, etc.
实施例2 Example 2
根据本申请实施例,还提供了另一种高性能计算集群的控制方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present application, another control method embodiment of a high-performance computing cluster is also provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.
图11是根据本申请实施例2的一种高性能计算集群的控制方法的流程图,如图11所示,该方法可以包括如下步骤:FIG. 11 is a flow chart of a method for controlling a high performance computing cluster according to Embodiment 2 of the present application. As shown in FIG. 11 , the method may include the following steps:
步骤S1102,第一计算节点接收调度节点发送的迁移信息。Step S1102: The first computing node receives migration information sent by the scheduling node.
其中,第一计算节点上运行有目标作业,迁移信息是调度节点在接收到作业迁移请求,并确定目标作业的作业状态为运行中的情况下发送的,作业迁移请求用于对目标作业进行迁移。The target job is running on the first computing node, and the migration information is sent by the scheduling node when it receives a job migration request and determines that the job status of the target job is running. The job migration request is used to migrate the target job.
步骤S1104,第一计算节点基于迁移信息,将目标作业对应的目标进程镜像转储到预设存储设备中。Step S1104: the first computing node dumps the target process image corresponding to the target job to a preset storage device based on the migration information.
本申请上述实施例中,迁移信息至少包括:目标进程的进程信息和预设存储设备的存储路径,第一计算节点基于迁移信息,将目标作业对应的目标进程镜像转储到预设存储设备中,包括:第一计算节点基于进程信息,确定第一计算节点上是否存在目标进程;在第一计算节点上存在目标进程的情况下,第一计算节点基于存储路径,将目标进程镜像转储到预设存储设备中。In the above embodiment of the present application, the migration information includes at least: process information of the target process and a storage path of a preset storage device. The first computing node dumps the target process image corresponding to the target job to the preset storage device based on the migration information, including: the first computing node determines whether there is a target process on the first computing node based on the process information; when the target process exists on the first computing node, the first computing node dumps the target process image to the preset storage device based on the storage path.
在一种可选的实施例中,第一计算节点通过第一预设接口,获取调度节点发送的迁移信息。In an optional embodiment, the first computing node obtains the migration information sent by the scheduling node through a first preset interface.
上述的第一预设接口可以为Job Progress CheckPoint,其中,第一预设接口的输入信息可以为作业进程ID集合、作业进程镜像存储路径,第一预设接口的返回信息可以为成功或失败,第一预设接口的动作可以为将对应进程做检查点转存储到指定的存储。The above-mentioned first preset interface can be Job Progress CheckPoint, wherein the input information of the first preset interface can be a job process ID set and a job process mirror storage path, the return information of the first preset interface can be success or failure, and the action of the first preset interface can be to transfer the checkpoint of the corresponding process to the specified storage.
由于在计算节点上同一作业对应的进程可以能为多个,因此需要指定作业进程ID的集合,进程转储镜像可以放置在任意有效的存储中,一般为了便于迁移动作可以放在集群计算节点挂载的共享存储中,此处仅作实例不做限定。Because there may be multiple processes corresponding to the same job on a computing node, it is necessary to specify a set of job process IDs. The process dump image can be placed in any valid storage. Generally, in order to facilitate migration actions, it can be placed in the shared storage mounted on the cluster computing node. This is only an example and is not limited.
上述的目标进程的进程信息可以为目标进程的名称、数据等。The process information of the target process mentioned above may be the name, data, etc. of the target process.
上述的预设存储设备的存储路径可以为存储路径的路径名称、路径信息等,其中,预设存储设备的存储路径可以为持久化存储路径。The storage path of the above-mentioned preset storage device may be a path name, path information, etc. of the storage path, wherein the storage path of the preset storage device may be a persistent storage path.
在一种可选的实施例中,第一计算节点可以通过第一预设接口获取调度节点发送的迁移信息,通过该迁移信息可以确定出目标进程的进程信息和预设存储设备的存储路径,可以根据进程信息确定出第一计算节点上是否存在目标作业的目标进程,若存在目标进程,则说明目标作业的进程没有结束,此时可以将目标进程镜像转储到预设存储设备中;若不存在目标进程,则说明目标作业的进程已经结束,此时可以不需要将目标进程镜像转储到预设存储设备中。In an optional embodiment, the first computing node can obtain the migration information sent by the scheduling node through the first preset interface, and can determine the process information of the target process and the storage path of the preset storage device through the migration information, and can determine whether there is a target process of the target job on the first computing node based on the process information. If there is a target process, it means that the process of the target job has not ended, and the target process image can be dumped to the preset storage device; if there is no target process, it means that the process of the target job has ended, and there is no need to dump the target process image to the preset storage device.
本申请上述实施例中,在第一计算节点基于迁移信息,将目标作业对应的目标进程镜像转储到预设存储设备中之后,该方法还包括:第三计算节点接收调度节点发送的恢复信息,其中,恢复信息是调度节点在目标作业满足恢复条件,或接收到调度请求的情况下发送的;第三计算节点从预设存储设备中读取目标进程;第三计算节点运行目标进程。 In the above embodiment of the present application, after the first computing node dumps the target process image corresponding to the target job to the preset storage device based on the migration information, the method also includes: the third computing node receives the recovery information sent by the scheduling node, wherein the recovery information is sent by the scheduling node when the target job meets the recovery conditions or receives the scheduling request; the third computing node reads the target process from the preset storage device; the third computing node runs the target process.
在一种可选的实施例中,:第三计算节点通过第二预设接口,获取调度节点发送的恢复信息。In an optional embodiment, the third computing node obtains the recovery information sent by the scheduling node through the second preset interface.
上述的第二预设接口可以为Job Progress Restart,其中,第二预设接口的输入信息可以为作业镜像路径,第二预设接口的返回信息可以为成功或失败,第二预设接口的动作可以为将转储的作业镜像恢复到计算节点上。The above-mentioned second preset interface can be Job Progress Restart, wherein the input information of the second preset interface can be the job image path, the return information of the second preset interface can be success or failure, and the action of the second preset interface can be restoring the dumped job image to the computing node.
由于在转储镜像上已保存有作业进程ID信息,因此输入信息里指定作业镜像路径即可。Since the job process ID information is saved in the dump image, you only need to specify the job image path in the input information.
在一种可选的实施例中,第三计算节点可以通过第二预设接口获取到调度节点发送的恢复信息,第三计算节点可以根据存储路径从预设存储设备中主动读取目标进程,以便于将目标作业的目标进程恢复到第三计算节点中。In an optional embodiment, the third computing node can obtain the recovery information sent by the scheduling node through the second preset interface, and the third computing node can actively read the target process from the preset storage device according to the storage path to facilitate restoring the target process of the target job to the third computing node.
本申请上述实施例中,在第三计算节点上运行目标进程,包括:第三计算节点运行目标进程,包括:第三计算节点从预设存储设备读取目标进程的进程信息;第三计算节点确定是否存在进程信息对应的进程;在确定存在进程信息对应的进程的情况下,第三计算节点停止运行目标进程;在确定不存在进程信息对应的进程的情况下,第三计算节点运行目标进程。In the above embodiment of the present application, running the target process on the third computing node includes: the third computing node running the target process includes: the third computing node reading the process information of the target process from a preset storage device; the third computing node determines whether there is a process corresponding to the process information; if it is determined that there is a process corresponding to the process information, the third computing node stops running the target process; if it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
在目标进程恢复的过程中可能会遇到错误,例如进程ID已被占用,此时计算节点向调度节点返回失败。调度节点收到新计算节点列表上的节点检查点重启服务返回节点后,在各个计算节点列表的节点返回成功时作业恢复运行,调度器更新作业所在状态和数据库中的作业所在新节点信息;若其中存在某个计算节点恢复作业进程返回失败,调度器停止其他计算节点上已做完恢复进程,并向集群管理员返回相应信息。During the recovery of the target process, errors may occur, such as the process ID being occupied, in which case the compute node returns a failure to the scheduling node. After the scheduling node receives the node checkpoint restart service return on the new compute node list, the job resumes when the nodes in each compute node list return successfully, and the scheduler updates the job status and the new node information of the job in the database; if there is a compute node that fails to return the job recovery process, the scheduler stops the recovery process that has been completed on other compute nodes and returns the corresponding information to the cluster administrator.
在一种可选的实施例中,第三计算节点可以从预设存储设备中读取目标进程的进程信息,通过该目标进程的进程信息可以判断目标进程是否已被占用,若确定存在进程信息对应的进程的情况下,则说明目标进程已被占用,此时第三计算节点停止运行目标进程,若确定不存在进程信息对应的进程的情况下,则说明目标进程未被占用,此时第三计算节点运行目标进程。In an optional embodiment, the third computing node can read the process information of the target process from a preset storage device, and can determine whether the target process is occupied through the process information of the target process. If it is determined that there is a process corresponding to the process information, it means that the target process is occupied, and the third computing node stops running the target process. If it is determined that there is no process corresponding to the process information, it means that the target process is not occupied, and the third computing node runs the target process.
需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准,并提供有相应的操作入口,供用户选择授权或者拒绝。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the aforementioned method embodiments, for the sake of simplicity, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the present application is not limited by the described order of actions, because according to the present application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present application.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式 体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, or of course by hardware. Based on this understanding, the technical solution of this application, or the part that contributes to the prior art, can be in the form of a software product. It is embodied that the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including a number of instructions for enabling a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of various embodiments of the present application.
实施例3Example 3
根据本申请实施例,还提供了一种用于实施上述高性能计算集群的控制方法的高性能计算集群,图12是根据本申请实施例2的一种高性能计算集群的示意图,如图12所示,该高性能计算集群1200包括:计算节点1202、预设存储设备1204、调度节点1206。According to an embodiment of the present application, a high-performance computing cluster for implementing the control method of the above-mentioned high-performance computing cluster is also provided. Figure 12 is a schematic diagram of a high-performance computing cluster according to Example 2 of the present application. As shown in Figure 12, the high-performance computing cluster 1200 includes: a computing node 1202, a preset storage device 1204, and a scheduling node 1206.
其中,计算节点1202,用于运行作业;预设存储设备1204,用于存储进程;调度节点1206,与计算节点连接,用于在接收到作业迁移请求的情况下,获取目标作业的作业状态,在作业状态为运行中的情况下,确定第一计算节点,并确定第一计算节点上的目标进程,其中,作业迁移请求用于对目标作业进行迁移,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;第一计算节点,与预设存储设备连接,用于将目标进程镜像存储至预设存储设备。Among them, the computing node 1202 is used to run jobs; the preset storage device 1204 is used to store processes; the scheduling node 1206 is connected to the computing node, and is used to obtain the job status of the target job when a job migration request is received, and when the job status is running, determine the first computing node and determine the target process on the first computing node, wherein the job migration request is used to migrate the target job, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the first computing node is connected to the preset storage device, and is used to store the target process image to the preset storage device.
上述的第一计算节点可以为计算节点中目标作业所在的计算节点。The first computing node mentioned above may be a computing node where the target job is located among the computing nodes.
上述的预设存储设备可以用于存储一个或多个进程,其中,目标进程可以为预设存储设备中存储的与目标作业对应的进程。The above-mentioned preset storage device may be used to store one or more processes, wherein the target process may be a process corresponding to the target job stored in the preset storage device.
通过上述实施例,计算节点,用于运行作业;预设存储设备,用于存储进程镜像;调度节点,与计算节点连接,用于在接收到作业迁移请求的情况下,获取目标作业的作业状态,在作业状态为运行中的情况下,确定第一计算节点,并确定第一计算节点上的目标进程,其中,作业迁移请求用于对目标作业进行迁移,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;第一计算节点,与预设存储设备连接,用于将目标进程镜像存储至预设存储设备,实现提高高性能计算集群的可靠性的目的。容易注意到的是,在作业状态为运行中的情况下,可以检查高性能集群中的计算节点,并确定出运行有目标作业的第一计算节点,可以将第一计算节点上的目标进程镜像转储到预设存储设备中,实现在用户无感知的情况下对目标作业对应的目标进程进行转储,可以方便对潜在故障节点上的目标作业进行迁移降低作业失败的风险,从而可以提高目标作业运行的可靠性,实现了提高高性能计算集群的可靠性。进而解决了相关技术中的高性能计算集群的可靠性较低的技术问题。Through the above embodiment, the computing node is used to run the job; the preset storage device is used to store the process image; the scheduling node is connected to the computing node, and is used to obtain the job status of the target job when receiving the job migration request, and when the job status is running, determine the first computing node, and determine the target process on the first computing node, wherein the job migration request is used to migrate the target job, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the first computing node is connected to the preset storage device, and is used to store the target process image to the preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster. It is easy to notice that when the job status is running, the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, and the target job on the potential fault node can be easily migrated to reduce the risk of job failure, thereby improving the reliability of the target job operation and improving the reliability of the high-performance computing cluster. Thus, the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
需要说明的是,本申请上述实施例中涉及到的优选实施方案与实施例1提供的方案以及应用场景、实施过程相同,但不仅限于实施例1所提供的方案。It should be noted that the preferred implementation scheme involved in the above embodiments of the present application is the same as the scheme provided in Example 1, as well as the application scenario and implementation process, but is not limited to the scheme provided in Example 1.
实施例4Example 4
根据本申请实施例,还提供了一种用于实施上述高性能计算集群的控制方法的高性能计算集群的控制装置,图13是根据本申请实施例4的一种高性能计算集群的控制装置示意图,如图13所示,该装置1300包括:获取模块1302、确定模块1304、发送模块1306。According to an embodiment of the present application, a control device for a high-performance computing cluster for implementing the above-mentioned control method of the high-performance computing cluster is also provided. Figure 13 is a schematic diagram of a control device for a high-performance computing cluster according to Example 4 of the present application. As shown in Figure 13, the device 1300 includes: an acquisition module 1302, a determination module 1304, and a sending module 1306.
其中,获取模块用于在调度节点接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;确定模块用于在作业状态为运行中的情况下,通过调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标 作业对应的进程;发送模块用于通过调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中。The acquisition module is used to obtain the job status of the target job when the scheduling node receives the job migration request, wherein the job migration request is used to migrate the target job; the determination module is used to determine the first computing node from the high-performance computing cluster through the scheduling node when the job status is running, and determine the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process on the first computing node that is connected to the target job. The process corresponding to the job; the sending module is used to send migration information to the first computing node through the scheduling node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
此处需要说明的是,上述获取模块1302、确定模块1304、发送模块1306对应于实施例1中的步骤S402至步骤S406,三个模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例1所公开的内容。需要说明的是,上述模块或单元可以是存储在存储器(例如,存储器104)中并由一个或多个处理器(例如,处理器102a、102b,……,102n)处理的硬件组件或软件组件,上述模块也可以作为装置的一部分可以运行在实施例1提供的计算机终端10中。It should be noted that the acquisition module 1302, the determination module 1304, and the sending module 1306 correspond to steps S402 to S406 in Example 1, and the three modules and the corresponding steps implement the same examples and application scenarios, but are not limited to the contents disclosed in the above-mentioned Example 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (e.g., memory 104) and processed by one or more processors (e.g., processors 102a, 102b, ..., 102n), and the above-mentioned modules may also be part of the device and may be run in the computer terminal 10 provided in Example 1.
本申请上述实施例中,调度节点包括调度器、运行队列和待调度队列,该装置还包括:接收模块、更新模块、输出模块。In the above embodiments of the present application, the scheduling node includes a scheduler, a running queue and a queue to be scheduled, and the device also includes: a receiving module, an updating module, and an output module.
接收模块用于通过调度器接收第一计算节点反馈的转储结果,其中,转储结果用于表征目标进程是否转储成功;更新模块用于在转储结果为目标进程转储成功的情况下,调度器将目标作业的作业状态更新为迁移状态,并将目标作业从运行队列移动至待调度队列,其中,运行队列用于存储作业状态为运行中的作业,待调度队列用于存储作业状态不为运行中的作业;输出模块用于调度器输出第一提示信息,其中,第一提示信息用于提示目标作业迁移成功。The receiving module is used to receive the dump result fed back by the first computing node through the scheduler, wherein the dump result is used to indicate whether the target process has been dumped successfully; the updating module is used to update the job status of the target job to the migration status when the dump result is that the target process has been dumped successfully, and move the target job from the running queue to the to-be-scheduled queue, wherein the running queue is used to store jobs whose job status is running, and the to-be-scheduled queue is used to store jobs whose job status is not running; the output module is used for the scheduler to output the first prompt information, wherein the first prompt information is used to prompt that the migration of the target job is successful.
本申请上述实施例中,迁移状态包括:迁移完成和迁移中,更新模块还用于通过调度器获取目标进程占用的内存量,以及预设存储设备对应的存储速度,更新模块还用于通过调度器基于内存量和存储速度,将作业状态更新为迁移完成或迁移中。In the above embodiments of the present application, the migration status includes: migration completed and migration in progress. The update module is also used to obtain the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device through the scheduler. The update module is also used to update the job status to migration completed or migration in progress based on the memory amount and storage speed through the scheduler.
本申请上述实施例中,更新模块还用于在内存量大于预设内存量,且存储速度小于预设速度的情况下,通过调度器将作业状态更新为迁移中;更新模块还用于在内存量小于或等于预设内存量,且存储速度大于或等于预设速度的情况下,调度器将作业状态更新为迁移完成。In the above embodiments of the present application, the update module is also used to update the job status to migrating through the scheduler when the memory amount is greater than the preset memory amount and the storage speed is less than the preset speed; the update module is also used to update the job status to migration completed through the scheduler when the memory amount is less than or equal to the preset memory amount and the storage speed is greater than or equal to the preset speed.
本申请上述实施例中,该装置还包括:确定模块。In the above embodiments of the present application, the device also includes: a determination module.
其中,确定模块还用于在至少一个第一计算节点的转储结果为目标进程转储失败的情况下,调度器确定多个第一计算节点中的第二计算节点,其中,第二计算节点的转储结果为目标进程转储成功,调度器发送恢复运行请求至第二计算节点,其中,恢复运行请求用于请求第二计算节点继续运行目标进程,调度器输出第二提示信息,其中,第二提示信息用于提示目标作业迁移失败。Among them, the determination module is also used to, when the dump result of at least one first computing node is a failure to dump the target process, the scheduler determines a second computing node among multiple first computing nodes, wherein the dump result of the second computing node is a successful dump of the target process, and the scheduler sends a resume operation request to the second computing node, wherein the resume operation request is used to request the second computing node to continue running the target process, and the scheduler outputs a second prompt message, wherein the second prompt message is used to prompt that the target job migration has failed.
本申请上述实施例中,确定模块还用于在目标作业满足恢复条件,或接收到调度请求的情况下,调度节点从高性能计算集群中确定第三计算节点,其中,调度请求用于将目标作业调度至第三计算节点,调度节点发送恢复信息至第三计算节点,其中,恢复信息用于控制第三计算节点从预设存储设备中读取目标进程,并运行目标进程。In the above embodiment of the present application, the determination module is also used for the scheduling node to determine a third computing node from the high-performance computing cluster when the target job meets the recovery conditions or receives a scheduling request, wherein the scheduling request is used to schedule the target job to the third computing node, and the scheduling node sends recovery information to the third computing node, wherein the recovery information is used to control the third computing node to read the target process from a preset storage device and run the target process.
本申请上述实施例中,确定模块还用于在目标作业满足恢复条件的情况下,通过调度节点基于调度算法从高性能计算集群中确定第三计算节点,在接收到调度请求的情况下,通过调度节点基于从高性能计算集群中确定调度请求对应的计算节点,得到第三计算节点。In the above embodiment of the present application, the determination module is also used to determine a third computing node from the high-performance computing cluster based on a scheduling algorithm through a scheduling node when the target job meets the recovery conditions, and to obtain the third computing node by determining the computing node corresponding to the scheduling request from the high-performance computing cluster through the scheduling node when a scheduling request is received.
本申请上述实施例中,调度节点包括调度器、运行队列和待调度队列,接收模块还用 于调度器接收第三计算节点反馈的恢复结果,其中,恢复结果用于表征目标进程是否恢复成功;更新模块用于在恢复结果为目标进程恢复成功的情况下,调度器将目标作业的作业状态更新为运行中,并将目标作业从待调度队列移动至运行队列;输出模块还用于调度器输出第三提示信息,其中,第三提示信息用于提示目标作业恢复成功。In the above embodiment of the present application, the scheduling node includes a scheduler, a running queue and a queue to be scheduled, and the receiving module is also used The scheduler receives the recovery result fed back by the third computing node, wherein the recovery result is used to indicate whether the target process has recovered successfully; the update module is used to update the job status of the target job to running when the recovery result is that the target process has recovered successfully, and move the target job from the to-be-scheduled queue to the running queue; the output module is also used for the scheduler to output a third prompt message, wherein the third prompt message is used to prompt that the target job has recovered successfully.
本申请上述实施例中,该装置还包括:发送模块。In the above embodiments of the present application, the device further includes: a sending module.
其中,确定模块还用于在至少一个第三计算节点的恢复结果为目标进程恢复失败的情况下,调度器确定多个第三计算节点中的第四计算节点,其中,第四计算节点的恢复结果为目标进程恢复成功;发送模块用于调度器发送停止运行请求至第四计算节点,其中,停止运行请求用于请求第四计算节点停止运行目标进程;输出模块还用于调度器输出第四提示信息,其中,第四提示信息用于提示目标作业恢复失败。Among them, the determination module is also used for the scheduler to determine a fourth computing node among multiple third computing nodes when the recovery result of at least one third computing node is a failure to recover the target process, wherein the recovery result of the fourth computing node is a successful recovery of the target process; the sending module is used for the scheduler to send a stop operation request to the fourth computing node, wherein the stop operation request is used to request the fourth computing node to stop running the target process; the output module is also used for the scheduler to output a fourth prompt message, wherein the fourth prompt message is used to prompt that the recovery of the target job has failed.
需要说明的是,本申请上述实施例中涉及到的优选实施方案与实施例1提供的方案以及应用场景、实施过程相同,但不仅限于实施例1所提供的方案。It should be noted that the preferred implementation scheme involved in the above embodiments of the present application is the same as the scheme provided in Example 1, as well as the application scenario and implementation process, but is not limited to the scheme provided in Example 1.
实施例5Example 5
根据本申请实施例,还提供了一种用于实施上述高性能计算集群的控制方法的高性能计算集群的控制装置,图14是根据本申请实施例5的一种高性能计算集群的控制装置示意图,如图14所示,该装置1400包括:接收模块1402、转储模块1404。According to an embodiment of the present application, a control device for a high-performance computing cluster for implementing the above-mentioned control method of the high-performance computing cluster is also provided. Figure 14 is a schematic diagram of a control device for a high-performance computing cluster according to Example 5 of the present application. As shown in Figure 14, the device 1400 includes: a receiving module 1402 and a dump module 1404.
其中,接收模块用于通过第一计算节点接收调度节点发送的迁移信息,其中,第一计算节点上运行有目标作业,迁移信息是调度节点在接收到作业迁移请求,并确定目标作业的作业状态为运行中的情况下发送的,作业迁移请求用于对目标作业进行迁移;转储模块用于第一计算节点基于迁移信息,将目标作业对应的目标进程镜像转储到预设存储设备中。Among them, the receiving module is used to receive the migration information sent by the scheduling node through the first computing node, wherein the target job is running on the first computing node, and the migration information is sent by the scheduling node when it receives the job migration request and determines that the job status of the target job is running, and the job migration request is used to migrate the target job; the dump module is used for the first computing node to dump the target process image corresponding to the target job to a preset storage device based on the migration information.
本申请上述实施例中,迁移信息至少包括:目标进程的进程信息和预设存储设备的存储路径,转储模块还用于通过第一计算节点基于进程信息,确定第一计算节点上是否存在目标进程,在第一计算节点上存在目标进程的情况下,通过第一计算节点基于存储路径,将目标进程镜像转储到预设存储设备中。In the above embodiments of the present application, the migration information includes at least: process information of the target process and a storage path of a preset storage device. The dump module is also used to determine whether the target process exists on the first computing node based on the process information through the first computing node. If the target process exists on the first computing node, the target process image is dumped to the preset storage device through the first computing node based on the storage path.
本申请上述实施例中,该装置包括:读取模块、运行模块。In the above embodiments of the present application, the device includes: a reading module and an operating module.
其中,接收模块用于第三计算节点接收调度节点发送的恢复信息,其中,恢复信息是调度节点在目标作业满足恢复条件,或接收到调度请求的情况下发送的;读取模块用于第三计算节点从预设存储设备中读取目标进程;运行模块用于第三计算节点运行目标进程。Among them, the receiving module is used for the third computing node to receive the recovery information sent by the scheduling node, wherein the recovery information is sent by the scheduling node when the target job meets the recovery conditions or receives a scheduling request; the reading module is used for the third computing node to read the target process from the preset storage device; and the running module is used for the third computing node to run the target process.
本申请上述实施例中,运行模块还用于第三计算节点从预设存储设备读取目标进程的进程信息,第三计算节点确定是否存在进程信息对应的进程,在确定存在进程信息对应的进程的情况下,第三计算节点停止运行目标进程,在确定不存在进程信息对应的进程的情况下,第三计算节点运行目标进程。In the above embodiment of the present application, the running module is also used for the third computing node to read the process information of the target process from a preset storage device, and the third computing node determines whether there is a process corresponding to the process information. If it is determined that there is a process corresponding to the process information, the third computing node stops running the target process. If it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
需要说明的是,本申请上述实施例中涉及到的优选实施方案与实施例1提供的方案以及应用场景、实施过程相同,但不仅限于实施例1所提供的方案。It should be noted that the preferred implementation scheme involved in the above embodiments of the present application is the same as the scheme provided in Example 1, as well as the application scenario and implementation process, but is not limited to the scheme provided in Example 1.
实施例6Example 6
本申请的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任 意一个计算机终端设备。可选地,在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。The embodiment of the present application can provide a computer terminal, which can be any one of the computer terminal groups. Optionally, in this embodiment, the computer terminal may be replaced by a terminal device such as a mobile terminal.
可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。Optionally, in this embodiment, the computer terminal may be located in at least one network device among a plurality of network devices of the computer network.
在本实施例中,上述计算机终端可以执行高性能计算集群的控制方法中以下步骤的程序代码:调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中。In this embodiment, the above-mentioned computer terminal can execute the program code of the following steps in the control method of the high-performance computing cluster: when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
可选地,图15是根据本申请实施例的一种计算机终端的结构框图。如图15所示,该计算机终端A可以包括:一个或多个(图中仅示出一个)处理器102、存储器104、存储控制器、以及外设接口,其中,外设接口与射频模块、音频模块和显示器连接。Optionally, Figure 15 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in Figure 15, the computer terminal A may include: one or more (only one is shown in the figure) processors 102, a memory 104, a storage controller, and a peripheral interface, wherein the peripheral interface is connected to a radio frequency module, an audio module, and a display.
其中,存储器可用于存储软件程序以及模块,如本申请实施例中的高性能计算集群的控制方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的高性能计算集群的控制方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至终端A。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。Among them, the memory can be used to store software programs and modules, such as the program instructions/modules corresponding to the control method and device of the high-performance computing cluster in the embodiment of the present application. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, realizing the control method of the high-performance computing cluster mentioned above. The memory may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include a memory remotely arranged relative to the processor, and these remote memories can be connected to the terminal A via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中。The processor can call the information and application programs stored in the memory through the transmission device to execute the following steps: when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
可选的,上述处理器还可以执行如下步骤的程序代码:调度器接收第一计算节点反馈的转储结果,其中,转储结果用于表征目标进程是否转储成功;在转储结果为目标进程转储成功的情况下,调度器将目标作业的作业状态更新为迁移状态,并将目标作业从运行队列移动至待调度队列,其中,运行队列用于存储作业状态为运行中的作业,待调度队列用于存储作业状态不为运行中的作业;调度器输出第一提示信息,其中,第一提示信息用于提示目标作业迁移成功。Optionally, the processor may also execute program code of the following steps: the scheduler receives a dump result fed back by the first computing node, wherein the dump result is used to indicate whether the target process is successfully dumped; when the dump result is that the target process is successfully dumped, the scheduler updates the job status of the target job to a migration status, and moves the target job from the running queue to the to-be-scheduled queue, wherein the running queue is used to store jobs whose job status is running, and the to-be-scheduled queue is used to store jobs whose job status is not running; the scheduler outputs a first prompt message, wherein the first prompt message is used to prompt that the migration of the target job is successful.
可选的,上述处理器还可以执行如下步骤的程序代码:调度器获取目标进程占用的内存量,以及预设存储设备对应的存储速度;调度器基于内存量和存储速度,将作业状态更新为迁移完成或迁移中。Optionally, the processor may also execute program code of the following steps: the scheduler obtains the amount of memory occupied by the target process and the storage speed corresponding to the preset storage device; the scheduler updates the job status to migration completed or in progress based on the memory amount and storage speed.
可选的,上述处理器还可以执行如下步骤的程序代码:在内存量大于预设内存量,且存储速度小于预设速度的情况下,调度器将作业状态更新为迁移中;在内存量小于或等于 预设内存量,且存储速度大于或等于预设速度的情况下,调度器将作业状态更新为迁移完成。Optionally, the processor may further execute the following program code: when the memory amount is greater than the preset memory amount and the storage speed is less than the preset speed, the scheduler updates the job status to migrating; when the memory amount is less than or equal to When the preset memory amount is reached and the storage speed is greater than or equal to the preset speed, the scheduler updates the job status to migration completed.
可选的,上述处理器还可以执行如下步骤的程序代码:在目标作业满足恢复条件,或接收到调度请求的情况下,调度节点从高性能计算集群中确定第三计算节点,其中,调度请求用于将目标作业调度至第三计算节点;调度节点发送恢复信息至第三计算节点,其中,恢复信息用于控制第三计算节点从预设存储设备中读取目标进程,并运行目标进程。Optionally, the processor may also execute program code of the following steps: when the target job meets the recovery conditions or receives a scheduling request, the scheduling node determines a third computing node from the high-performance computing cluster, wherein the scheduling request is used to schedule the target job to the third computing node; the scheduling node sends recovery information to the third computing node, wherein the recovery information is used to control the third computing node to read the target process from a preset storage device and run the target process.
可选的,上述处理器还可以执行如下步骤的程序代码:在目标作业满足恢复条件的情况下,调度节点基于调度算法从高性能计算集群中确定第三计算节点;在接收到调度请求的情况下,调度节点从高性能计算集群中确定调度请求对应的计算节点,得到第三计算节点。Optionally, the processor may also execute the program code of the following steps: when the target job meets the recovery conditions, the scheduling node determines a third computing node from the high-performance computing cluster based on the scheduling algorithm; when a scheduling request is received, the scheduling node determines the computing node corresponding to the scheduling request from the high-performance computing cluster to obtain the third computing node.
可选的,上述处理器还可以执行如下步骤的程序代码:在目标作业满足恢复条件的情况下,基于调度算法从高性能计算集群中确定第三计算节点;在接收到调度请求的情况下,从高性能计算集群中确定调度请求对应的计算节点,得到第三计算节点。Optionally, the processor may also execute the program code of the following steps: when the target job meets the recovery conditions, determining a third computing node from the high-performance computing cluster based on the scheduling algorithm; when a scheduling request is received, determining the computing node corresponding to the scheduling request from the high-performance computing cluster to obtain the third computing node.
可选的,上述处理器还可以执行如下步骤的程序代码:第三计算节点通过第二预设接口,获取调度节点发送的恢复信息,其中,恢复信息至少包括:预设存储设备的存储路径;第三计算节点基于存储路径,从预设存储设备中读取目标进程。Optionally, the processor may also execute program code of the following steps: the third computing node obtains recovery information sent by the scheduling node through the second preset interface, wherein the recovery information includes at least: a storage path of a preset storage device; and the third computing node reads the target process from the preset storage device based on the storage path.
可选的,上述处理器还可以执行如下步骤的程序代码:第三计算节点从预设存储设备读取目标进程的进程信息;第三计算节点确定是否存在进程信息对应的进程;在确定存在进程信息对应的进程的情况下,第三计算节点停止运行目标进程;在确定不存在进程信息对应的进程的情况下,第三计算节点运行目标进程。Optionally, the processor may also execute program code of the following steps: the third computing node reads process information of the target process from a preset storage device; the third computing node determines whether there is a process corresponding to the process information; if it is determined that there is a process corresponding to the process information, the third computing node stops running the target process; if it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
可选的,上述处理器还可以执行如下步骤的程序代码:调度器接收第三计算节点反馈的恢复结果,其中,恢复结果用于表征目标进程是否恢复成功;在恢复结果为目标进程恢复成功的情况下,调度器将目标作业的作业状态更新为运行中,并将目标作业从待调度队列移动至运行队列;调度器输出第三提示信息,其中,第三提示信息用于提示目标作业恢复成功。Optionally, the processor may also execute program code of the following steps: the scheduler receives a recovery result fed back by a third computing node, wherein the recovery result is used to indicate whether the target process has recovered successfully; when the recovery result is that the target process has recovered successfully, the scheduler updates the job status of the target job to running, and moves the target job from the to-be-scheduled queue to the running queue; the scheduler outputs a third prompt message, wherein the third prompt message is used to prompt that the target job has recovered successfully.
可选的,上述处理器还可以执行如下步骤的程序代码:在至少一个第三计算节点的恢复结果为目标进程恢复失败的情况下,调度器确定多个第三计算节点中的第四计算节点,其中,第四计算节点的恢复结果为目标进程恢复成功;调度器发送停止运行请求至第四计算节点,其中,停止运行请求用于请求第四计算节点停止运行目标进程;调度器输出第四提示信息,其中,第四提示信息用于提示目标作业恢复失败。Optionally, the processor may also execute program code of the following steps: when the recovery result of at least one third computing node is that the target process recovery fails, the scheduler determines a fourth computing node among multiple third computing nodes, wherein the recovery result of the fourth computing node is that the target process recovery is successful; the scheduler sends a stop operation request to the fourth computing node, wherein the stop operation request is used to request the fourth computing node to stop running the target process; the scheduler outputs a fourth prompt message, wherein the fourth prompt message is used to prompt that the target job recovery fails.
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:第一计算节点接收调度节点发送的迁移信息,其中,第一计算节点上运行有目标作业,迁移信息是调度节点在接收到作业迁移请求,并确定目标作业的作业状态为运行中的情况下发送的,作业迁移请求用于对目标作业进行迁移;第一计算节点基于迁移信息,将目标作业对应的目标进程镜像转储到预设存储设备中。The processor can call the information and application programs stored in the memory through the transmission device to execute the following steps: the first computing node receives the migration information sent by the scheduling node, wherein a target job is running on the first computing node, and the migration information is sent by the scheduling node after receiving a job migration request and determining that the job status of the target job is running, and the job migration request is used to migrate the target job; based on the migration information, the first computing node dumps the target process image corresponding to the target job to a preset storage device.
可选的,上述处理器还可以执行如下步骤的程序代码:第一计算节点基于进程信息,确定第一计算节点上是否存在目标进程;在第一计算节点上存在目标进程的情况下,第一计算节点基于存储路径,将目标进程镜像转储到预设存储设备中。 Optionally, the processor may also execute program code of the following steps: the first computing node determines whether a target process exists on the first computing node based on process information; when the target process exists on the first computing node, the first computing node dumps the target process image to a preset storage device based on a storage path.
可选的,上述处理器还可以执行如下步骤的程序代码:第三计算节点接收调度节点发送的恢复信息,其中,恢复信息是调度节点在目标作业满足恢复条件,或接收到调度请求的情况下发送的;第三计算节点从预设存储设备中读取目标进程;第三计算节点运行目标进程。Optionally, the processor may also execute program code of the following steps: the third computing node receives recovery information sent by the scheduling node, wherein the recovery information is sent by the scheduling node when the target job meets the recovery conditions or receives a scheduling request; the third computing node reads the target process from a preset storage device; the third computing node runs the target process.
可选的,上述处理器还可以执行如下步骤的程序代码:第三计算节点从预设存储设备读取目标进程的进程信息;第三计算节点确定是否存在进程信息对应的进程;在确定存在进程信息对应的进程的情况下,第三计算节点停止运行目标进程;在确定不存在进程信息对应的进程的情况下,第三计算节点运行目标进程。Optionally, the processor may also execute program code of the following steps: the third computing node reads process information of the target process from a preset storage device; the third computing node determines whether there is a process corresponding to the process information; if it is determined that there is a process corresponding to the process information, the third computing node stops running the target process; if it is determined that there is no process corresponding to the process information, the third computing node runs the target process.
采用本申请实施例,首先调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中,实现提高高性能计算集群的可靠性的目的。容易注意到的是,在作业状态为运行中的情况下,可以检查高性能集群中的计算节点,并确定出运行有目标作业的第一计算节点,可以将第一计算节点上的目标进程镜像转储到预设存储设备中,实现在用户无感知的情况下对目标作业对应的目标进程进行转储,可以方便对潜在故障节点上的目标作业进行迁移降低作业失败的风险,从而可以提高目标作业运行的可靠性,实现了提高高性能计算集群的可靠性。进而解决了相关技术中的高性能计算集群的可靠性较低的技术问题。According to the embodiment of the present application, first, when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is in operation, the scheduling node determines the first computing node from the high-performance computing cluster, and determines the target process on the first computing node, wherein the target job is running on the first computing node, and the target process is the process corresponding to the target job on the first computing node; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device, so as to achieve the purpose of improving the reliability of the high-performance computing cluster. It is easy to notice that when the job status is in operation, the computing nodes in the high-performance cluster can be checked, and the first computing node running the target job can be determined, and the target process image on the first computing node can be dumped to the preset storage device, so as to achieve the dumping of the target process corresponding to the target job without the user's perception, so as to facilitate the migration of the target job on the potential fault node to reduce the risk of job failure, thereby improving the reliability of the target job operation and achieving the purpose of improving the reliability of the high-performance computing cluster. Thus, the technical problem of low reliability of the high-performance computing cluster in the related technology is solved.
本领域普通技术人员可以理解,图15所示的结构仅为示意,计算机终端也可以是智能手机(如AndroID手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图15其并不对上述电子装置的结构造成限定。例如,计算机终端A还可包括比图15中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图15所示不同的配置。It can be understood by those skilled in the art that the structure shown in FIG. 15 is for illustration only, and the computer terminal may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a PDA, a mobile Internet device (Mobile Internet Devices, MID), a PAD, and other terminal devices. FIG. 15 does not limit the structure of the above-mentioned electronic device. For example, the computer terminal A may also include more or fewer components (such as a network interface, a display device, etc.) than those shown in FIG. 15, or have a configuration different from that shown in FIG. 15.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by instructing the hardware related to the terminal device through a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: a flash drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
实施例7Example 7
本申请的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于保存上述实施例1所提供的高性能计算集群的控制方法所执行的程序代码。The embodiment of the present application further provides a storage medium. Optionally, in this embodiment, the storage medium can be used to store the program code executed by the control method of the high performance computing cluster provided in the above embodiment 1.
可选地,在本实施例中,上述存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。Optionally, in this embodiment, the above storage medium may be located in any computer terminal in a computer terminal group in a computer network, or in any mobile terminal in a mobile terminal group.
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:调度节点在接收到作业迁移请求的情况下,获取目标作业的作业状态,其中,作业迁移请求用于对目标作业进行迁移;在作业状态为运行中的情况下,调度节点从高性能计算集群中确定第一计算节点,并确定第一计算节点上的目标进程,其中,第一计算节点上运行有目标 作业,目标进程为第一计算节点上与目标作业对应的进程;调度节点发送迁移信息至第一计算节点,其中,迁移信息用于控制第一计算节点将目标进程镜像转储到预设存储设备中。Optionally, in this embodiment, the storage medium is configured to store program codes for executing the following steps: when the scheduling node receives a job migration request, it obtains the job status of the target job, wherein the job migration request is used to migrate the target job; when the job status is running, the scheduling node determines a first computing node from the high-performance computing cluster and determines a target process on the first computing node, wherein the target job is running on the first computing node. job, the target process is the process on the first computing node corresponding to the target job; the scheduling node sends migration information to the first computing node, wherein the migration information is used to control the first computing node to dump the target process image to a preset storage device.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above-mentioned embodiments of the present application are for description only and do not represent the advantages or disadvantages of the embodiments.
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments of the present application, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant description of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the device embodiments described above are only schematic, for example, the division of units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server or network device, etc.) to perform all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, disk or optical disk and other media that can store program codes.
以上仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。 The above are only preferred implementations of the present application. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present application. These improvements and modifications should also be regarded as the scope of protection of the present application.
Claims (16)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310441527.7A CN116501469A (en) | 2023-04-13 | 2023-04-13 | Control method of high-performance computing cluster, electronic equipment and storage medium |
| CN202310441527.7 | 2023-04-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024213056A1 true WO2024213056A1 (en) | 2024-10-17 |
Family
ID=87322400
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/087280 Pending WO2024213056A1 (en) | 2023-04-13 | 2024-04-11 | Method for controlling high-performance computing cluster, and electronic device and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116501469A (en) |
| WO (1) | WO2024213056A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119127651A (en) * | 2024-11-11 | 2024-12-13 | 杭州市北京航空航天大学国际创新研究院(北京航空航天大学国际创新学院) | A multi-dimensional situation awareness method and device for intelligent operation and maintenance of cluster systems |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116501469A (en) * | 2023-04-13 | 2023-07-28 | 阿里巴巴(中国)有限公司 | Control method of high-performance computing cluster, electronic equipment and storage medium |
| CN117290075B (en) * | 2023-11-23 | 2024-02-27 | 苏州元脑智能科技有限公司 | Process migration method, system, device, communication equipment and storage medium |
| CN120223743A (en) * | 2025-05-27 | 2025-06-27 | 天津市天河计算机技术有限公司 | Method, system and storage medium for determining user machine time in HPC cluster |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102411519A (en) * | 2011-09-08 | 2012-04-11 | 曙光信息产业股份有限公司 | Process recovery method and device |
| US10275276B2 (en) * | 2013-08-19 | 2019-04-30 | International Business Machines Corporation | Migrating jobs from a source server from which data is migrated to a target server to which the data is migrated |
| CN111274111A (en) * | 2020-01-20 | 2020-06-12 | 西安交通大学 | A prediction and anti-aging method for microservice aging |
| CN113296690A (en) * | 2020-07-27 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Data migration method, device and equipment |
| CN116501469A (en) * | 2023-04-13 | 2023-07-28 | 阿里巴巴(中国)有限公司 | Control method of high-performance computing cluster, electronic equipment and storage medium |
-
2023
- 2023-04-13 CN CN202310441527.7A patent/CN116501469A/en active Pending
-
2024
- 2024-04-11 WO PCT/CN2024/087280 patent/WO2024213056A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102411519A (en) * | 2011-09-08 | 2012-04-11 | 曙光信息产业股份有限公司 | Process recovery method and device |
| US10275276B2 (en) * | 2013-08-19 | 2019-04-30 | International Business Machines Corporation | Migrating jobs from a source server from which data is migrated to a target server to which the data is migrated |
| CN111274111A (en) * | 2020-01-20 | 2020-06-12 | 西安交通大学 | A prediction and anti-aging method for microservice aging |
| CN113296690A (en) * | 2020-07-27 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Data migration method, device and equipment |
| CN116501469A (en) * | 2023-04-13 | 2023-07-28 | 阿里巴巴(中国)有限公司 | Control method of high-performance computing cluster, electronic equipment and storage medium |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119127651A (en) * | 2024-11-11 | 2024-12-13 | 杭州市北京航空航天大学国际创新研究院(北京航空航天大学国际创新学院) | A multi-dimensional situation awareness method and device for intelligent operation and maintenance of cluster systems |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116501469A (en) | 2023-07-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024213056A1 (en) | Method for controlling high-performance computing cluster, and electronic device and storage medium | |
| US12153499B2 (en) | System and method for backing up highly available source databases in a hyperconverged system | |
| US11995100B2 (en) | System and method for highly available database service | |
| US10628273B2 (en) | Node system, server apparatus, scaling control method, and program | |
| US11907167B2 (en) | Multi-cluster database management services | |
| KR101970839B1 (en) | Replaying jobs at a secondary location of a service | |
| EP2946293B1 (en) | Healing cloud services during upgrades | |
| JP6580035B2 (en) | Pre-configuration and pre-launch computational resources | |
| US9069597B2 (en) | Operation management device and method for job continuation using a virtual machine | |
| JP5948933B2 (en) | Job continuation management apparatus, job continuation management method, and job continuation management program | |
| US20250068645A1 (en) | Multi-cluster database management system | |
| JP2008293358A (en) | Distributed processing program, distributed processing method, distributed processing apparatus, and distributed processing system | |
| CN106559441B (en) | Virtual machine monitoring method, device and system based on cloud computing service | |
| JP6123626B2 (en) | Process resumption method, process resumption program, and information processing system | |
| US11892918B2 (en) | System and method for availability group database patching | |
| CN115277398A (en) | Cluster network configuration method and device | |
| US20240345844A1 (en) | Cluster Management Method, Device, and Computing System | |
| JP2023084777A (en) | Disaster recovery system and disaster recovery method | |
| CN112965790A (en) | PXE protocol-based virtual machine starting method and electronic equipment | |
| CN115480921A (en) | Task scheduling method, storage medium and electronic device | |
| JP5544516B2 (en) | Highly available server system, high availability server system failure recovery method, and highly available server | |
| CN120029828B (en) | Process state recovery method and device, storage medium and electronic device | |
| CN115454450B (en) | Method and device for resource management of data job, electronic equipment and storage medium | |
| CN119003085A (en) | Micro-service migration method and device and electronic equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24788158 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |