CN111666134A

CN111666134A - Method and system for scheduling distributed tasks

Info

Publication number: CN111666134A
Application number: CN201910163677.XA
Authority: CN
Inventors: 廖耀华
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2020-09-15
Anticipated expiration: 2039-03-05
Also published as: CN111666134B

Abstract

The invention discloses a distributed task scheduling method and system, and relates to the technical field of computers. A specific implementation of the method includes: receiving one or more query requests for task lock objects of tasks from one or more servers; querying the task lock objects in a database; in response to querying the task lock objects, sending a record to the one or more servers, the record including the task lock status and a specific version number; receiving first update data for the task from the first server, the first update data including the first version number ; in response to determining that the first version number is the same as the particular version number, performing a first update to the database; and when an error event occurs in a second of the one or more servers, causing the one or other servers in multiple servers fire listening events. This implementation improves the reliability of the distributed system and reduces the complexity of solution deployment.

Description

A method and system for distributed task scheduling

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种分布式任务调度的方法和系统。The present invention relates to the field of computer technology, and in particular, to a method and system for distributed task scheduling.

背景技术Background technique

现有的分布式任务调度方案比较多，常用的是通过Redis方案或者Zookeeper方案获取分布式锁，来解决对多主机多任务处理的调度问题。分布式锁具有排他性，在同一时间只会有一个进程能获取到锁并执行任务，其它进程无法同时获取。当任务执行完毕后该进程释放锁，但是计算机不是100％可靠，因此会出现释放锁失败的问题。There are many existing distributed task scheduling schemes. The commonly used method is to obtain distributed locks through the Redis scheme or the Zookeeper scheme to solve the scheduling problem of multi-host multi-task processing. Distributed locks are exclusive. Only one process can acquire the lock and perform tasks at the same time, and other processes cannot acquire it at the same time. The process releases the lock when the task is done, but the computer is not 100% reliable, so there is a problem of failure to release the lock.

通过Redis方案获取分布式锁是目前使用最多的技术，基本原理就是多个线程通过Redis方案的原子操作获取锁，其中只有一个线程会拿到锁。在这种方案下对释放锁失败的问题一般的解决思路是设置Redis键的过期时间，这样即使释放锁失败，也可以到期释放锁，但对键的过期时间设置为多长不好界定。而通过Zookeeper方案获取分布式锁是通过使用它的临时有序节点来实现的，这种方式部署相对麻烦，生产环境中使用的比较少。Obtaining distributed locks through the Redis scheme is currently the most used technology. The basic principle is that multiple threads obtain locks through atomic operations of the Redis scheme, and only one thread will obtain the lock. Under this scheme, the general solution to the problem of failure to release the lock is to set the expiration time of the Redis key, so that even if the release of the lock fails, the lock can be released due to expiration, but it is difficult to define how long the expiration time of the key is set. Obtaining distributed locks through the Zookeeper solution is achieved by using its temporary ordered nodes. This method is relatively troublesome to deploy and is rarely used in production environments.

在实现本发明过程中，发明人发现现有技术中至少存在如下问题：In the process of realizing the present invention, the inventor found that there are at least the following problems in the prior art:

(1)Redis方案中对键的过期时间设置不妥便会导致锁失效的问题，但多长时间是合适的又不好界定。(1) In the Redis scheme, if the expiration time of the key is not properly set, the lock will become invalid, but it is not easy to define how long it takes.

(2)Zookeeper方案由于强依赖Zookeeper因而分布式系统的可靠性不能保证并且Zookeeper方案部署相对麻烦。(2) The reliability of the distributed system cannot be guaranteed because the Zookeeper scheme strongly relies on Zookeeper, and the deployment of the Zookeeper scheme is relatively troublesome.

因此，现行的解决方案下分布式系统中锁失效的问题没有得到很好的解决，进而使得分布式系统的可靠性难以保证。Therefore, the problem of lock failure in the distributed system under the current solution has not been well solved, which makes it difficult to guarantee the reliability of the distributed system.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例提供一种分布式任务调度的方法和系统，能够通过使用数据库乐观锁取代传统的Redis方案，因此不需要设置过期时间，避免了对键的过期时间设置不妥便会导致锁失效的问题，并且通过在使用乐观锁的基础上采用Zookeeper作为辅助检查方案，避免了直接采用Zookeeper方案带来的部署困难问题，并且不会强依赖Zookeeper。而且能够在避免了现有方案的问题的同时，还能很好地解决了分布式任务调度中的锁失效问题。In view of this, the embodiments of the present invention provide a method and system for distributed task scheduling, which can replace the traditional Redis scheme by using database optimistic locking, so it is not necessary to set the expiration time, which avoids the inconvenient setting of the expiration time of the key It will lead to the problem of lock failure, and by using Zookeeper as an auxiliary check scheme on the basis of optimistic locking, the deployment difficulties caused by directly adopting the Zookeeper scheme are avoided, and there is no strong dependence on Zookeeper. In addition, the problem of lock failure in distributed task scheduling can be well solved while avoiding the problems of the existing solutions.

为实现上述目的，根据本发明实施例的一个方面，提供了一种分布式任务调度的方法。To achieve the above object, according to an aspect of the embodiments of the present invention, a method for distributed task scheduling is provided.

根据本发明实施例的分布式任务调度的方法，包括：The method for distributed task scheduling according to an embodiment of the present invention includes:

接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求；receive one or more query requests for a task's task lock object from one or more servers;

在数据库中查询所述任务锁定对象；query the task lock object in the database;

响应于查询到所述任务锁定对象，从所述数据库读取与所述任务锁定对象相关的记录并将所述记录发送到所述一个或多个服务器，所述记录包括任务锁定状态和特定版本号；In response to querying the task lock object, read records related to the task lock object from the database and send the records to the one or more servers, the records including task lock status and a particular version No;

接收来自所述一个或多个服务器中的第一服务器的对所述任务的第一更新数据，所述第一更新数据包括第一版本号；receiving first update data for the task from a first server of the one or more servers, the first update data including a first version number;

响应于确定所述第一版本号与所述特定版本号相同，使用所述第一更新数据对所述数据库执行第一更新；以及in response to determining that the first version number is the same as the particular version number, performing a first update to the database using the first update data; and

当所述一个或多个服务器中的第二服务器发生错误事件时，使得所述一个或多个服务器中的其他服务器触发监听事件，其中，所述第一服务器和所述第二服务器相同或不同。When an error event occurs in the second server in the one or more servers, other servers in the one or more servers trigger a monitoring event, wherein the first server and the second server are the same or different .

可选地，在接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求之前还包括：Optionally, before receiving one or more query requests for the task lock object of the task from one or more servers, the method further includes:

创建与检查服务器相对应的父节点；以及create a parent node corresponding to the inspection server; and

在所述父节点下创建与所述一个或多个服务器相对应的一个或多个子节点，其中所述父节点在列表中维护其下当前存活的所有子节点。One or more child nodes corresponding to the one or more servers are created under the parent node, wherein the parent node maintains in a list all child nodes currently surviving under it.

可选地，所述一个或多个子节点是其所对应的所述一个或多个服务器的IP地址列表。Optionally, the one or more sub-nodes are a list of IP addresses of the one or more servers corresponding to the one or more sub-nodes.

可选地，所述父节点和所述一个或多个子节点是EPHEMERAL类型节点。Optionally, the parent node and the one or more child nodes are EPHEMERAL type nodes.

可选地，使得所述一个或多个服务器中的其他服务器触发监听事件进一步包括：Optionally, causing other servers in the one or more servers to trigger the monitoring event further includes:

从所述列表中删除与所述第二服务器相对应的第二子节点得到新的列表；Deleting the second child node corresponding to the second server from the list to obtain a new list;

向所述新的列表中的子节点发送所述新的列表；sending the new list to child nodes in the new list;

接收来自与所述新的列表中的子节点相对应的服务器的锁查询请求，其中所述锁查询请求是关于所述第二子节点是否持有未释放的锁；receiving a lock query request from a server corresponding to a child node in the new list, wherein the lock query request is about whether the second child node holds an unreleased lock;

根据所述锁查询请求对所述数据库进行查询；以及querying the database according to the lock query request; and

响应于查询到所述第二子节点持有未释放的锁，释放所述锁，其中所述锁是针对所述任务的乐观锁。In response to the query that the second child node holds an unreleased lock, the lock is released, wherein the lock is an optimistic lock for the task.

可选地，当所述第二子节点持有未释放的锁时，所述任务的任务锁定状态为1。Optionally, when the second child node holds an unreleased lock, the task lock status of the task is 1.

可选地，释放所述锁进一步包括：将所述任务的任务锁定状态置为0。Optionally, releasing the lock further includes: setting the task lock status of the task to 0.

接收所述一个或多个服务器的IP地址；receive the IP address of the one or more servers;

根据所述IP地址在数据库中查询任务锁定状态为1的任务；以及querying the database for tasks with a task lock status of 1 according to the IP address; and

响应于查询到所述任务锁定状态为1的所述任务，将所述任务的所述任务锁定状态置为0。In response to finding the task whose task lock status is 1, the task lock status of the task is set to 0.

可选地，在在数据库中查询所述任务锁定对象之后还包括：Optionally, after querying the task lock object in the database, the method further includes:

响应于没有查询到所述任务锁定对象，向所述数据库写入与所述任务锁定对象相关的记录。In response to the task lock object not being queried, records related to the task lock object are written to the database.

可选地，其中，所述记录至少包括以下字段：任务类型字段、任务描述字段、任务锁定状态字段和版本号字段。Optionally, the record includes at least the following fields: a task type field, a task description field, a task lock status field and a version number field.

根据本发明实施例的另一个方面，提供了一种分布式任务调度的系统。According to another aspect of the embodiments of the present invention, a distributed task scheduling system is provided.

根据本发明实施例的分布式任务调度的系统，包括：The system for distributed task scheduling according to an embodiment of the present invention includes:

查询请求接收模块，用于接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求；a query request receiving module, configured to receive one or more query requests from one or more servers to the task lock object of the task;

锁定对象查询模块，用于在数据库中查询所述任务锁定对象；a lock object query module, used for querying the task lock object in the database;

锁定对象处理模块，用于响应于查询到所述任务锁定对象，从所述数据库读取与所述任务锁定对象相关的记录并将所述记录发送到所述一个或多个服务器，所述记录包括任务锁定状态和特定版本号；A lock object processing module, configured to read records related to the task lock objects from the database and send the records to the one or more servers in response to the query to the task lock objects, the records Including task lock status and specific version number;

更新接收模块，用于接收来自所述一个或多个服务器中的第一服务器的对所述任务的第一更新数据，所述第一更新数据包括第一版本号；an update receiving module, configured to receive first update data for the task from a first server among the one or more servers, where the first update data includes a first version number;

更新执行模块，用于响应于确定所述第一版本号与所述特定版本号相同，使用所述第一更新数据对所述数据库执行第一更新；以及an update execution module for performing a first update to the database using the first update data in response to determining that the first version number is the same as the specific version number; and

辅助检查模块，用于当所述一个或多个服务器中的第二服务器发生错误事件时，使得所述一个或多个服务器中的其他服务器触发监听事件，其中，所述第一服务器和所述第二服务器相同或不同。an auxiliary inspection module, configured to cause other servers in the one or more servers to trigger a monitoring event when an error event occurs in the second server in the one or more servers, wherein the first server and the The second server is the same or different.

可选地，所述辅助检查模块进一步用于：Optionally, the auxiliary inspection module is further used for:

可选地，所述辅助检查模块进一步用于：将所述任务的任务锁定状态置为0。Optionally, the auxiliary checking module is further configured to: set the task lock status of the task to 0.

可选地，所述系统进一步包括：Optionally, the system further includes:

服务器启动模块，用于接收所述一个或多个服务器的IP地址；A server startup module for receiving the IP addresses of the one or more servers;

可选地，所述锁定对象处理模块进一步用于：Optionally, the lock object processing module is further configured to:

可选地，所述记录至少包括以下字段：任务类型字段、任务描述字段、任务锁定状态字段和版本号字段。Optionally, the record includes at least the following fields: a task type field, a task description field, a task lock status field and a version number field.

根据本发明实施例的另一个方面，提供了分布式任务调度电子设备。According to another aspect of the embodiments of the present invention, a distributed task scheduling electronic device is provided.

根据本发明实施例的分布式任务调度电子设备，包括：The distributed task scheduling electronic device according to the embodiment of the present invention includes:

一种分布式任务调度电子设备，其特征在于，包括：A distributed task scheduling electronic device, characterized in that it includes:

一个或多个处理器；one or more processors;

存储系统，用于存储一个或多个程序，a storage system for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现本发明实施例第一方面提供的分布式任务调度的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method for distributed task scheduling provided in the first aspect of the embodiments of the present invention.

根据本发明实施例的再一个方面，提供了一种计算机可读介质。According to yet another aspect of the embodiments of the present invention, a computer-readable medium is provided.

根据本发明实施例的计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现本发明实施例第一方面提供的分布式任务调度的方法。A computer-readable medium according to an embodiment of the present invention stores a computer program thereon, and when the program is executed by a processor, implements the distributed task scheduling method provided in the first aspect of the embodiment of the present invention.

上述发明中的一个实施例具有如下优点或有益效果：因为采用数据库乐观锁在分布式系统中进行任务调度同时使用Zookeeper进行辅助检查的技术手段，所以克服了键的过期时间设置不妥便会导致的锁失效以及强依赖zookeeper的技术问题，进而达到提高了分布式系统的可靠性并降低了方案部署复杂性的技术效果。An embodiment of the above invention has the following advantages or beneficial effects: because the database optimistic lock is used to perform task scheduling in a distributed system and Zookeeper is used to perform auxiliary inspection, it overcomes the problem that the expiration time of the key is not properly set. The technical problems of lock failure and strong dependence on zookeeper, thereby achieving the technical effect of improving the reliability of the distributed system and reducing the complexity of solution deployment.

上述的非惯用的可选方式所具有的进一步效果将在下文中结合具体实施方式加以说明。Further effects of the above non-conventional alternatives will be described below in conjunction with specific embodiments.

附图说明Description of drawings

附图用于更好地理解本发明，不构成对本发明的不当限定。其中：The accompanying drawings are used for better understanding of the present invention and do not constitute an improper limitation of the present invention. in:

图1是根据本发明实施例的分布式任务调度的方法的主要流程的示意图；1 is a schematic diagram of a main process of a method for distributed task scheduling according to an embodiment of the present invention;

图2是根据本发明实施例的一个示例性Docker启动阶段的示例流程示意图；Fig. 2 is a schematic flow chart of an exemplary Docker startup phase according to an embodiment of the present invention;

图3是根据本发明实施例的一个示例性任务执行阶段的示例流程示意图；FIG. 3 is an exemplary flowchart of an exemplary task execution stage according to an embodiment of the present invention;

图4是根据本发明实施例的一个示例性程序检查阶段的示例流程示意图；FIG. 4 is an exemplary flowchart of an exemplary program checking phase according to an embodiment of the present invention;

图5是根据本发明实施例的一个示例性Zookeeper检查阶段的示例流程示意图；5 is a schematic flowchart of an exemplary Zookeeper inspection phase according to an embodiment of the present invention;

图6是根据本发明实施例的另一个分布式任务调度的方法的主要流程的示意图；6 is a schematic diagram of the main flow of another method for distributed task scheduling according to an embodiment of the present invention;

图7是根据本发明实施例的分布式任务调度的系统的主要模块的示意图；7 is a schematic diagram of main modules of a system for distributed task scheduling according to an embodiment of the present invention;

图8是本发明实施例可以应用于其中的示例性系统架构图；FIG. 8 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied;

图9是适于用来实现本发明实施例的终端设备或服务器的计算机系统的结构示意图。FIG. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的示范性实施例做出说明，其中包括本发明实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本发明的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

图1是根据本发明实施例的分布式任务调度的方法的主要流程的示意图，如图1所示，根据本发明实施例的分布式任务调度的方法包括步骤S101、S102、S103、S104、S105和S106。FIG. 1 is a schematic diagram of a main process of a method for distributed task scheduling according to an embodiment of the present invention. As shown in FIG. 1 , the method for distributed task scheduling according to an embodiment of the present invention includes steps S101 , S102 , S103 , S104 , and S105 and S106.

步骤S101：接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求。Step S101: Receive one or more query requests for the task lock object of the task from one or more servers.

本方案采用数据库的乐观锁，来通过Docker执行程序释放。步骤S101至步骤S105是乐观锁方案的正常执行流程。在本文中，术语“Docker”指的是服务器，例如Linux服务器，执行方法的代码程序可以部署在Docker上面，贯穿本文所使用的术语“服务器”可以与“Docker”互换使用而不影响方案实施以及技术效果。This solution uses the optimistic lock of the database to be released by the Docker executor. Steps S101 to S105 are the normal execution flow of the optimistic locking scheme. In this paper, the term "Docker" refers to a server, such as a Linux server, on which the code program for executing the method can be deployed. The term "server" used throughout this paper can be used interchangeably with "Docker" without affecting the implementation of the solution. and technical effects.

本技术方案可以分两个阶段：(1)正常执行阶段；(2)辅助检查阶段。正常执行阶段包括Docker启动阶段和任务执行阶段。辅助检查阶段包括程序检查和Zookeeper检查。其中，Docker启动阶段可以在任务执行阶段中的每个服务器重启时完成，主要目的是解决在Docker还持有未释放的锁的情况下经历了重启所导致的锁无法释放的问题，进一步提高了分布式系统的可靠性。Zookeeper检查主要解决任务执行阶段中Docker宕机所导致的锁无法释放的问题。而程序检查会定时运行，可以每隔10分钟检查一次，和正常执行阶段同步开始，主要解决任务执行阶段中当任务执行完毕需要释放乐观锁但释放失败的问题。The technical solution can be divided into two stages: (1) a normal execution stage; (2) an auxiliary inspection stage. The normal execution phase includes the Docker startup phase and the task execution phase. The auxiliary inspection phase includes program inspection and Zookeeper inspection. Among them, the Docker startup phase can be completed when each server restarts in the task execution phase. The main purpose is to solve the problem that the lock cannot be released due to the restart when Docker still holds the unreleased lock, which further improves the Reliability of distributed systems. The Zookeeper check mainly solves the problem that the lock cannot be released due to Docker downtime during the task execution phase. The program check will run regularly, and it can be checked every 10 minutes and start synchronously with the normal execution phase. It mainly solves the problem that the optimistic lock needs to be released when the task execution is completed in the task execution phase, but the release fails.

在正常执行阶段阶段，当任务在执行中时，不会释放数据库的乐观锁直到任务执行完毕。但是我们因为有新需求上线不得不重新启动Docker容器(这是经常遇到的事情)，当Docker容器重新启动后，数据库的乐观锁就不会再释放了。因此我们增加了Docker启动的处理逻辑，主要解决数据库的乐观锁不会释放的问题，当Docker容器启动后会自动执行监听器，释放本机持有的数据库乐观锁。In the normal execution phase, when the task is executing, the optimistic lock of the database will not be released until the task is executed. But we have to restart the Docker container because of new requirements (this is often encountered). When the Docker container is restarted, the optimistic lock of the database will not be released. Therefore, we have added the processing logic of Docker startup, mainly to solve the problem that the optimistic lock of the database will not be released. When the Docker container is started, the listener will be automatically executed to release the optimistic lock of the database held by the machine.

优选地，在接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求(步骤S101)之前还包括：Preferably, before receiving one or more query requests for the task lock object of the task from one or more servers (step S101 ), the method further includes:

下面结合附图2描述一个示例性Docker启动阶段的主要步骤。The main steps of an exemplary Docker startup phase are described below in conjunction with FIG. 2 .

图2是根据本发明实施例的一个示例性Docker启动阶段的示例流程示意图，如图2所示，根据本发明实施例的一个示例性Docker启动阶段200包括步骤S201、S202、S203、S204和S205。FIG. 2 is a schematic flowchart of an exemplary Docker startup phase according to an embodiment of the present invention. As shown in FIG. 2 , an exemplary Docker startup phase 200 according to an embodiment of the present invention includes steps S201 , S202 , S203 , S204 and S205 .

步骤S201：Docker启动开始。Step S201: Docker starts to start.

在一个实施例中，Docker启动开始包括Docker应用程序启动，Web容器初始化，执行任务锁释放监听器。In one embodiment, Docker start-up includes Docker application startup, Web container initialization, and execution of task lock release listeners.

步骤S202：执行任务锁释放监听器。Step S202: Execute the task lock release listener.

在一个实施例中，任务锁释放监听器获取本机IP地址，通过IP地址查询数据库中是否存在处于锁定中的任务，即数据库中的锁定状态是1的任务。In one embodiment, the task lock release listener obtains the local IP address, and queries the database through the IP address whether there is a task in lock, that is, a task whose lock status in the database is 1.

步骤S203：检查是否有本机锁定的任务。Step S203: Check whether there is a task of local locking.

步骤S204：释放锁。Step S204: Release the lock.

如果存在锁定中的任务(步骤S203处为“是”)，则释放任务的锁，将数据库中任务的锁定状态设置为0。如果不存在(步骤S203处为“否”)，则不做任何操作，跳到步骤步骤S205。If there is a task in lock ("Yes" at step S203), the lock of the task is released, and the lock status of the task in the database is set to 0. If it does not exist ("No" at step S203), do nothing, and skip to step S205.

步骤S205：Docker启动完成。Step S205: Docker startup is completed.

Docker启动阶段解决了重启Docker时会引起的乐观所无法释放的问题，增强了分布式系统的可靠性。在一些实施例中，在步骤S101之前也可以不包括Docker启动阶段。The Docker startup phase solves the problem that the optimism caused by restarting Docker cannot be released, and enhances the reliability of the distributed system. In some embodiments, the Docker startup phase may not be included before step S101.

步骤S102：在数据库中查询所述任务锁定对象。Step S102: Query the task lock object in the database.

如果有Docker想要执行一个任务，第一步先要查询该任务是否被其他Docker锁定。因此，在步骤S101中接收到来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求之后，将在数据库中查询所述任务锁定对象以确定该任务目前是否被其他服务器锁定。If a Docker wants to execute a task, the first step is to query whether the task is locked by other Dockers. Therefore, after receiving one or more query requests for a task lock object of a task from one or more servers in step S101, the task lock object will be queried in the database to determine whether the task is currently locked by other servers.

在一个实施例中，Docker应用程序通过任务类型查询任务锁定对象。In one embodiment, the Docker application queries task lock objects by task type.

步骤S103：响应于查询到所述任务锁定对象，从所述数据库读取与所述任务锁定对象相关的记录并将所述记录发送到所述一个或多个服务器，所述记录包括任务锁定状态和特定版本号。Step S103: In response to the query to the task lock object, read records related to the task lock object from the database and send the records to the one or more servers, the records including the task lock status and a specific version number.

优选地，在在数据库中查询所述任务锁定对象之后还包括：Preferably, after querying the task lock object in the database, it further includes:

在一个实施例中，如果任务锁定对象为null，说明数据库中没有保存这个类型的任务，于是往数据库中插入一条记录，字段包括：任务类型(int)、任务描述(varchar)、任务锁定状态(int，值为0，表示任务还没有锁定)，版本号(long，值为1)。之后获取本机IP。In one embodiment, if the task lock object is null, it means that there is no task of this type in the database, so a record is inserted into the database, and the fields include: task type (int), task description (varchar), task lock status ( int, the value is 0, indicating that the task has not been locked), the version number (long, the value is 1). Then get the local IP.

在一个实施例中，如果任务锁定对象不为null，则从任务锁定对象中获取任务锁定状态，如果任务锁定状态的值为1(说明该任务已经锁定)，则流程结束。如果任务锁定状态的值为0(说明该任务未锁定)，则获取本机IP。In one embodiment, if the task lock object is not null, the task lock state is obtained from the task lock object, and if the value of the task lock state is 1 (indicating that the task has been locked), the process ends. If the value of the task lock status is 0 (indicating that the task is not locked), get the local IP.

步骤S104：接收来自所述一个或多个服务器中的第一服务器的对所述任务的第一更新数据，所述第一更新数据包括第一版本号。Step S104: Receive first update data for the task from a first server among the one or more servers, where the first update data includes a first version number.

只有在Docker确定该任务没有被锁定的情况下才会执行该任务，同时可能有多个Docker确认了该任务没有被锁定并且想要执行该任务。乐观锁机制其实就是在数据库表中引入一个版本号(version)字段来实现的。当我们要从数据库中读取数据的时候，同时把这个version字段也读出来，如果要对读出来的数据进行更新后写回数据库，则需要将version加1，同时将新的数据与新的version更新到数据表中，且必须在更新的时候同时检查目前数据库里version值是不是之前的那个version，如果是，则正常更新。如果不是，则更新失败，说明在这个过程中有其它的进程去更新过数据了。The task will only be executed if Docker is sure that the task is not locked, and at the same time there may be multiple Dockers who have confirmed that the task is not locked and want to execute the task. The optimistic locking mechanism is actually implemented by introducing a version field in the database table. When we want to read data from the database, we also read the version field. If we want to update the read data and write it back to the database, we need to add 1 to the version, and compare the new data with the new one. The version is updated to the data table, and it must be checked whether the version value in the current database is the previous version when updating. If it is, it will be updated normally. If not, the update fails, indicating that other processes have updated the data in this process.

因此，每个Docker应用程序开始更新数据库中该任务数据，更新条件包括：任务类型、任务锁定状态(值为0)、版本号(如果从第2步过来，值为1；如果从第3步过来，可以从锁定对象中获取)。更新字段包括：任务锁定状态(更新为1)、版本号(在原版本号基础上加1)、本机IP地址、更新时间(当前时间)。基于数据库乐观锁，只会有一个Docker应用程序可以修改成功。Therefore, each Docker application starts to update the task data in the database, and the update conditions include: task type, task lock status (value 0), version number (if it comes from step 2, the value is 1; if it comes from step 3) Come over, it can be obtained from the lock object). The update fields include: task lock status (updated to 1), version number (add 1 to the original version number), local IP address, and update time (current time). Based on database optimistic locking, only one Docker application can modify successfully.

步骤S105：响应于确定所述第一版本号与所述特定版本号相同，使用所述第一更新数据对所述数据库执行第一更新。Step S105: In response to determining that the first version number is the same as the specific version number, perform a first update on the database using the first update data.

在这种情况下，步骤S104中就是接收到了多个想要执行任务的服务器中最早完成更新数据的服务器，使得其成功取得乐观锁并完成更新。In this case, step S104 is to receive the server that completes the update data earliest among the servers that want to perform the task, so that it successfully obtains the optimistic lock and completes the update.

如上文所提到的，正常执行阶段包括Docker启动阶段和任务执行阶段。一般在多个Docker中的每一个开机时都会经历Docker启动阶段，在每个Docker开机后就进入了该Docker的任务执行阶段。在一个实施例中，在Docker的任务执行阶段，当任务达到可执行点的时候，多个Docker容器会同时去执行这个任务，这时所有的Docker都会去抢数据库这把乐观锁，只有抢到乐观锁的Docker可以执行任务。As mentioned above, the normal execution phase includes the Docker startup phase and the task execution phase. Generally, each of the multiple Dockers will go through the Docker startup phase when they are powered on, and the Docker task execution phase will be entered after each Docker is powered on. In one embodiment, in the task execution phase of Docker, when the task reaches the executable point, multiple Docker containers will execute the task at the same time. At this time, all Dockers will grab the optimistic lock of the database. Optimistically locked Docker can execute tasks.

下面结合附图3描述一个示例性任务执行阶段的主要步骤。The main steps of an exemplary task execution phase are described below in conjunction with FIG. 3 .

图3是根据本发明实施例的一个示例性任务执行阶段的示例流程示意图，如图3所示，根据本发明实施例的一个示例性任务执行阶段300包括步骤S301、S302、S303、S304、S305、S306和S307。FIG. 3 is a schematic flowchart of an exemplary task execution stage according to an embodiment of the present invention. As shown in FIG. 3 , an exemplary task execution stage 300 according to an embodiment of the present invention includes steps S301 , S302 , S303 , S304 , and S305 , S306 and S307.

步骤S301：一个或多个Docker准备执行任务。Step S301: One or more Dockers are ready to execute tasks.

在一个实施例中，步骤S301可进一步包括与所述一个或多个Docker相对应的子步骤S301_1至步骤S301_N，分别表示Docker_1至Docker_N准备执行任务的步骤。In one embodiment, step S301 may further include sub-steps S301_1 to S301_N corresponding to the one or more Dockers, respectively representing the steps of Docker_1 to Docker_N preparing to execute tasks.

如果任务锁定对象为null，说明数据库中没有保存这个类型的任务，于是往数据库中插入一条记录，字段包括：任务类型(int)、任务描述(varchar)、任务锁定状态(int，值为0，表示任务还没有锁定)，版本号(long，值为1)。然后获取本机的IP地址。If the task lock object is null, it means that there is no task of this type in the database, so insert a record into the database, the fields include: task type (int), task description (varchar), task lock status (int, value 0, Indicates that the task has not been locked), the version number (long, the value is 1). Then get the IP address of the machine.

如果任务锁定对象不为null，则从任务锁定对象中获取任务锁定状态，如果任务锁定状态的值为1(说明该任务已经锁定)，则流程结束。如果任务锁定状态的值为0(说明该任务未锁定)，则获取本机的IP地址。If the task lock object is not null, the task lock state is obtained from the task lock object, and if the value of the task lock state is 1 (indicating that the task has been locked), the process ends. If the value of the task lock status is 0 (indicating that the task is not locked), obtain the IP address of the machine.

步骤S302：所述一个或多个Docker通过任务号和版本号获取数据库乐观锁。Step S302: the one or more Dockers obtain the database optimistic lock through the task number and the version number.

每个Docker应用程序开始更新数据库中该任务数据，更新条件包括：任务类型、任务锁定状态(值为0)、版本号(如果步骤S301中发现任务锁定对象为null，值为1；如果步骤S301中发现任务锁定对象不为null，可以从锁定对象中获取)。更新字段包括：任务锁定状态(更新为1)、版本号(在原版本号基础上加1)、本机IP地址、更新时间(当前时间)。Each Docker application starts to update the task data in the database, and the update conditions include: task type, task lock status (value 0), version number (if the task lock object is found to be null in step S301, the value is 1; if step S301 It is found that the task lock object is not null and can be obtained from the lock object). The update fields include: task lock status (updated to 1), version number (add 1 to the original version number), local IP address, and update time (current time).

步骤S303：所述一个或多个Docker中的一个Docker成功获取乐观锁。Step S303: One of the one or more Dockers successfully acquires the optimistic lock.

基于数据库乐观锁，只会有一个Docker应用程序可以修改成功。Based on database optimistic locking, only one Docker application can modify successfully.

步骤S304：获取到乐观锁的Docker开始执行任务。Step S304: The Docker that has obtained the optimistic lock starts to execute the task.

获取到乐观锁的Docker应用程序开始执行任务，当任务执行完毕后，释放锁，释放锁的操作就是把任务锁定状态更新为0。The Docker application that has obtained the optimistic lock starts to execute the task. When the task is executed, the lock is released. The operation of releasing the lock is to update the task lock status to 0.

步骤S305：获取到乐观锁的Docker执行任务完毕并释放乐观锁。Step S305: Docker that has acquired the optimistic lock completes the execution task and releases the optimistic lock.

步骤S306：乐观锁释放成功。Step S306: The optimistic lock is released successfully.

步骤S307：乐观锁释放失败，发送告警邮件。Step S307: If the optimistic lock release fails, a warning email is sent.

在步骤S305中，数据库不可能100％可靠，因此存在释放乐观锁失败的问题，除此之外，Docker服务器宕机也会导致释放乐观锁失败，项目上线的时候，如果任务正在执行，也会导致释放乐观锁失败。如果乐观锁释放失败，需要增加人为干预的环节。当释放乐观锁失败时，会发送报警邮件和短信，及时通知开发人员手动释放乐观锁。In step S305, the database cannot be 100% reliable, so there is a problem of failure to release the optimistic lock. In addition, the downtime of the Docker server will also cause the failure to release the optimistic lock. When the project is online, if the task is being executed, it will also fail to release the optimistic lock. Causes the release of the optimistic lock to fail. If the optimistic lock release fails, it is necessary to increase the link of human intervention. When the release of the optimistic lock fails, an alarm email and SMS will be sent to promptly notify the developer to manually release the optimistic lock.

如上文所述，辅助检查阶段包括程序检查和Zookeeper检查。在任务正常执行的过程中，不可能保证服务100％可靠，因此增加了辅助检查阶段。辅助检查阶段包括两类检查，一类是程序检查，一类是Zookeeper检查。程序检查主要是解决任务在执行过程中中断的问题，和上述任务执行阶段的步骤S307有点类似，但是处理的时间节点不一样，任务执行阶段的步骤S307处于任务已经执行完毕，但是释放乐观锁出现问题，这时可以直接人工干预释放乐观锁。但是程序检查主要是针对任务在执行过程中出现了问题导致任务中断。As mentioned above, the auxiliary inspection phase includes program inspection and Zookeeper inspection. During the normal execution of the task, it is impossible to guarantee the service is 100% reliable, so an auxiliary inspection phase is added. The auxiliary inspection phase includes two types of inspections, one is program inspection and the other is Zookeeper inspection. The program check is mainly to solve the problem that the task is interrupted during the execution process, which is similar to step S307 in the above-mentioned task execution stage, but the processing time node is different. Step S307 in the task execution stage is when the task has been executed, but the release of the optimistic lock occurs. At this time, you can directly manually intervene to release the optimistic lock. But the program check is mainly aimed at the task interruption caused by the problem during the execution of the task.

下面结合附图4描述一个示例性程序检查阶段的主要步骤。The main steps of an exemplary program inspection phase are described below in conjunction with FIG. 4 .

图4是根据本发明实施例的一个示例性程序检查阶段的示例流程示意图，如图4所示，根据本发明实施例的一个示例性程序检查阶段400包括步骤S401、S402、S403和S404。FIG. 4 is a schematic flowchart of an exemplary program inspection phase according to an embodiment of the present invention. As shown in FIG. 4 , an exemplary program inspection phase 400 according to an embodiment of the present invention includes steps S401 , S402 , S403 and S404 .

步骤S401：检查程序启动。Step S401: the check program is started.

在一个实施例中，检查程序启动后从任务表里面获取执行中的任务对象。In one embodiment, after the checker is started, the task object being executed is obtained from the task table.

步骤S402：获取任务对乐观锁的持有时间。Step S402: Obtain the holding time of the optimistic lock by the task.

在一个实施例中，用当前时间减去任务获取到乐观锁的时间，得到当前任务持有乐观锁的时间。In one embodiment, the time when the task acquires the optimistic lock is subtracted from the current time to obtain the time when the current task holds the optimistic lock.

步骤S403：确定所述持有时间是否超过30分钟。Step S403: Determine whether the holding time exceeds 30 minutes.

在其他实施例中，30分钟这个值可以配置，根据任务执行长短配置，一般配置为所有任务中执行时间最长的那个任务持有乐观锁的时间，因此一般情况下不会触发该条件，除非程序确实出现问题。In other embodiments, the value of 30 minutes can be configured. It is configured according to the length of task execution. Generally, it is configured as the time that the task with the longest execution time holds the optimistic lock among all tasks. Therefore, this condition will not be triggered in general, unless The program does have a problem.

步骤S404：确定所述持有时间超过30分钟，发送告警邮件和短信。Step S404: It is determined that the holding time exceeds 30 minutes, and an alarm email and a short message are sent.

如果任务持有乐观锁的时间超过30分钟，则发送告警邮件和短信。开发人员收到告警信息后，检查任务是否已经中断，如果中断则手动释放乐观锁，如果任务还在继续执行，则忽略。If the task holds the optimistic lock for more than 30 minutes, an alert email and SMS will be sent. After the developer receives the alarm information, he checks whether the task has been interrupted. If it is interrupted, the optimistic lock is released manually. If the task is still executing, it is ignored.

步骤S106：当所述一个或多个服务器中的第二服务器发生错误事件时，使得所述一个或多个服务器中的其他服务器触发监听事件，其中，所述第一服务器和所述第二服务器相同或不同。Step S106: when an error event occurs in the second server in the one or more servers, causing other servers in the one or more servers to trigger a monitoring event, wherein the first server and the second server same or different.

步骤S106是和步骤S101至步骤S105描述的乐观锁方案的正常执行流程同步的Zookeeper检查。Step S106 is a Zookeeper check synchronized with the normal execution flow of the optimistic locking scheme described in steps S101 to S105.

在使用数据库乐观锁来对分布式系统进行调度时，如果周期性任务(例如，每隔1个小时执行一次)正在执行，那么数据库中该任务已经被锁定，其它执行任务的Docker服务器无法执行该任务。如果当该任务执行到一半时，执行该任务的Docker突然宕机，那么数据库中该任务的锁无法释放，如果不采取措施释放该锁，这个任务以后都不会再执行(本来是每隔1个小时执行一次的)。本文中所提供的Zookeeper检查就是在解决Docker宕机所引起的锁无法释放的问题。在一个实施例中，这里的“释放”动作是由其中的某个Docker完成的。When using database optimistic locking to schedule a distributed system, if a periodic task (for example, executed every 1 hour) is being executed, the task in the database has been locked, and other Docker servers executing the task cannot execute the task. Task. If the Docker executing the task suddenly crashes when the task is half executed, the lock of the task in the database cannot be released. If no measures are taken to release the lock, the task will not be executed again in the future (it was originally every 1 performed once an hour). The Zookeeper check provided in this article is to solve the problem that the lock cannot be released due to Docker downtime. In one embodiment, the "release" action here is done by one of the Dockers.

Zookeeper检查主要解决Docker宕机的问题。例如，如果Docker_1正在执行任务A，这时Docker1宕机，但是任务A还没有正常结束，那么任务A持有的乐观锁不会释放，当Docker_2执行任务A时，发现任务A的锁没有释放，则放弃执行任务A，这样任务A永远也不会执行。Zookeeper checks mainly solve the problem of Docker downtime. For example, if Docker_1 is executing task A and Docker1 is down, but task A has not ended normally, the optimistic lock held by task A will not be released. When Docker_2 executes task A, it finds that the lock of task A has not been released. Then give up the execution of task A, so that task A will never be executed.

优选地，在接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求之前还包括：Preferably, before receiving one or more query requests for the task lock object of the task from one or more servers, the method further includes:

优选地，使得所述一个或多个服务器中的其他服务器触发监听事件进一步包括：Preferably, causing other servers in the one or more servers to trigger the monitoring event further includes:

下面结合附图5描述一个示例性Zookeeper检查阶段的主要步骤。The main steps of an exemplary Zookeeper inspection phase are described below in conjunction with FIG. 5 .

图5是根据本发明实施例的一个示例性Zookeeper检查阶段的示例流程示意图，如图5所示，根据本发明实施例的一个示例性Zookeeper检查阶段500包括步骤S501、S502、S503、S504、S505、S506和S507。FIG. 5 is a schematic flowchart of an exemplary Zookeeper check phase according to an embodiment of the present invention. As shown in FIG. 5 , an exemplary Zookeeper check phase 500 according to an embodiment of the present invention includes steps S501 , S502 , S503 , S504 , and S505 , S506 and S507.

步骤S501：Zookeeper检测一个或多个Docker的心跳。Step S501: Zookeeper detects the heartbeat of one or more Dockers.

所有执行任务的Docker注册Zookeeper。注册逻辑如下：首先在Zookeeper服务端创建一个节点/SERVERS，接着每个Docker在启动的时候都去这个节点下创建一个EPHEMERAL类型的节点，比如Docker_1创建/SERVERS/${Docker_1的IP}，Docker_2创建/SERVERS/${Docker_2的IP}，然后Docker_1、Docker_2、.......、Docker_n都watch/SERVERS这个父节点。EPHEMERAL类型节点有一个很重要的特性，就是客户端和Zookeeper服务端连接断掉就会使节点消失，那么当某一个Docker宕机的时候，其对应的节点就会消失，然后集群中所有对/SERVERS进行watch的客户端都会收到通知。All Dockers that execute tasks are registered with Zookeeper. The registration logic is as follows: first create a node /SERVERS on the Zookeeper server, then each Docker creates an EPHEMERAL type node under this node when it starts, for example, Docker_1 creates /SERVERS/${Docker_1's IP}, Docker_2 creates /SERVERS/${Docker_2's IP}, then Docker_1, Docker_2, ......, Docker_n all watch the parent node of /SERVERS. An EPHEMERAL type node has a very important feature, that is, if the connection between the client and the Zookeeper server is disconnected, the node will disappear. Then when a Docker goes down, its corresponding node will disappear, and then all pairs of / / in the cluster will disappear. Clients that watch by SERVERS will be notified.

步骤S502：发现所述一个或多个Docker中有Docker宕机。Step S502: It is found that a Docker in the one or more Dockers is down.

步骤S503：仍存活的Docker执行watcher监视器。Step S503: the surviving Docker executes the watcher monitor.

如果Docker_1宕机，则其它Docker(除了Docker_1)会触发监听事件，执行如下操作：首先获取/SERVERS节点下面的所有子节点(所有存活Docker的IP列表)，因为Docker_1宕机，则/SERVERS节点下面的所有子节点不包含Docker_1的IP，于是跟上次的IP列表比较可以得到Docker_1的IP。If Docker_1 is down, other Dockers (except Docker_1) will trigger monitoring events, and perform the following operations: First, get all child nodes under the /SERVERS node (the IP list of all surviving Dockers), because Docker_1 is down, then under the /SERVERS node All child nodes of Docker_1 do not contain the IP of Docker_1, so the IP of Docker_1 can be obtained by comparing with the last IP list.

步骤S504：仍存活的Docker检查宕机的Docker是否持有锁。Step S504: The surviving Docker checks whether the downed Docker holds a lock.

步骤S505：响应于确定宕机的Docker持有特定任务的锁，则释放该锁。Step S505: In response to determining that the downed Docker holds the lock of the specific task, release the lock.

其它Docker拿到Docker_1的IP后，从数据库中查询Docker_1尚未执行完的任务(即Docker_1还持有任务的乐观锁没有释放，但是Docker_1已经宕机了，永远也不可能释放乐观锁)，将Docker_1持有任务的乐观锁释放掉。After other Dockers get the IP of Docker_1, they query the database for tasks that Docker_1 has not yet executed (that is, Docker_1 still holds the optimistic lock of the task and has not been released, but Docker_1 has been down, and the optimistic lock will never be released), and Docker_1 The optimistic lock holding the task is released.

在本申请中的术语具有其通用含义。乐观锁机制采取了更加宽松的加锁机制。大多是基于数据版本Version记录机制实现。术语“Zookeeper”是一个分布式的，开放源码的分布式应用程序协调服务，是Google的Chubby一个开源的实现，是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件，提供的功能包括：配置维护、域名服务、分布式同步、组服务等。术语“Docker”是一个开源的应用容器引擎，让开发者可以打包他们的应用以及依赖包到一个可移植的容器中，然后发布到任何流行的Linux机器上，也可以实现虚拟化。容器是完全使用沙箱机制，相互之间不会有任何接口。Terms in this application have their ordinary meanings. The optimistic locking mechanism adopts a more relaxed locking mechanism. Most of them are implemented based on the data version Version record mechanism. The term "Zookeeper" is a distributed, open source distributed application coordination service, an open source implementation of Google's Chubby, an important component of Hadoop and Hbase. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name service, distributed synchronization, group service, etc. The term "Docker" is an open source application container engine that allows developers to package their applications and dependencies into a portable container that can then be distributed to any popular Linux machine, also virtualized. Containers are completely sandboxed and do not have any interface with each other.

本发明采用数据库乐观锁技术和Zookeeper完成分布式任务调度，在任务获取锁阶段通过数据库乐观锁来保证每个任务在同一时刻只会分配给一个线程，当任务执行完毕后再释放锁，因为数据库的可靠性和易用性，该方案实现简单且可靠，但是存在一个问题，当Docker出现故障后，任务获取的锁无法释放，这时，我们会通过Zookeeper检测Docker心跳，当Docker宕机后，Zookeeper会检测到宕机的Docker并执行watcher监视器来释放锁。本方案中，当Docker发生故障时，由Zookeeper完成锁的释放，Zookeeper在整个过程中只是作为一种辅助装备，并没有强依赖Zookeeper，解决了Redis锁失效的问题和Zookeeper的可靠性问题。在一些实施方式中，本申请的实施方式也可应用于悲观锁。The invention adopts the database optimistic locking technology and Zookeeper to complete distributed task scheduling. In the task acquiring lock stage, the database optimistic locking is used to ensure that each task can only be allocated to one thread at the same time, and the lock is released after the task execution is completed, because the database The reliability and ease of use of Docker are simple and reliable, but there is a problem. When Docker fails, the lock acquired by the task cannot be released. At this time, we will detect the Docker heartbeat through Zookeeper. When Docker goes down, Zookeeper will detect the down Docker and execute the watcher to release the lock. In this solution, when Docker fails, Zookeeper completes the release of the lock. Zookeeper is only used as an auxiliary equipment in the whole process, and does not strongly depend on Zookeeper, which solves the problem of Redis lock failure and the reliability of Zookeeper. In some embodiments, embodiments of the present application may also be applied to pessimistic locking.

图6是根据本发明实施例的另一个分布式任务调度的方法的主要流程的示意图，如图6所示，根据本发明实施例的另一个分布式任务调度的方法包括步骤S601、S602、S603、S604、S605、S606、S607、S608、S609和S610。FIG. 6 is a schematic diagram of the main flow of another distributed task scheduling method according to an embodiment of the present invention. As shown in FIG. 6 , another distributed task scheduling method according to an embodiment of the present invention includes steps S601 , S602 and S603 , S604, S605, S606, S607, S608, S609 and S610.

步骤S601：创建与检查服务器相对应的父节点。Step S601: Create a parent node corresponding to the check server.

步骤S602：在所述父节点下创建与所述一个或多个服务器相对应的一个或多个子节点，其中所述父节点在列表中维护其下当前存活的所有子节点。Step S602: Create one or more child nodes corresponding to the one or more servers under the parent node, wherein the parent node maintains all child nodes currently surviving under it in a list.

步骤S603：接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求。Step S603: Receive one or more query requests for the task lock object of the task from one or more servers.

步骤S604：在数据库中查询所述任务锁定对象。Step S604: Query the task lock object in the database.

步骤S605：响应于没有查询到所述任务锁定对象，向所述数据库写入与所述任务锁定对象相关的记录。Step S605: In response to the task lock object not being queried, write a record related to the task lock object to the database.

步骤S606：响应于查询到所述任务锁定对象，从所述数据库读取与所述任务锁定对象相关的记录并将所述记录发送到所述一个或多个服务器，所述记录包括任务锁定状态和特定版本号。Step S606: In response to the query to the task lock object, read records related to the task lock object from the database and send the records to the one or more servers, the records including the task lock status and a specific version number.

步骤S607：接收来自所述一个或多个服务器中的第一服务器的对所述任务的第一更新数据，所述第一更新数据包括第一版本号。Step S607: Receive first update data for the task from a first server among the one or more servers, where the first update data includes a first version number.

步骤S608：响应于确定所述第一版本号与所述特定版本号不同，向所述第一服务器发通知。Step S608: In response to determining that the first version number is different from the specific version number, send a notification to the first server.

步骤S609：响应于确定所述第一版本号与所述特定版本号相同，使用所述第一更新数据对所述数据库执行第一更新。Step S609: In response to determining that the first version number is the same as the specific version number, perform a first update on the database using the first update data.

步骤S610：当所述一个或多个服务器中的第二服务器发生错误事件时，使得所述一个或多个服务器中的其他服务器触发监听事件，其中，所述第一服务器和所述第二服务器相同或不同。Step S610: When an error event occurs in the second server in the one or more servers, cause other servers in the one or more servers to trigger a monitoring event, wherein the first server and the second server same or different.

图7是根据本发明实施例的分布式任务调度的系统的主要模块的示意图，如图7所示，根据本发明实施例的分布式任务调度的系统700包括：FIG. 7 is a schematic diagram of main modules of a distributed task scheduling system according to an embodiment of the present invention. As shown in FIG. 7 , a distributed task scheduling system 700 according to an embodiment of the present invention includes:

查询请求接收模块701，用于接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求。The query request receiving module 701 is configured to receive one or more query requests for the task lock object of the task from one or more servers.

锁定对象查询模块702，用于在数据库中查询所述任务锁定对象。The lock object query module 702 is configured to query the task lock object in the database.

锁定对象处理模块703，用于响应于查询到所述任务锁定对象，从所述数据库读取与所述任务锁定对象相关的记录并将所述记录发送到所述一个或多个服务器，所述记录包括任务锁定状态和特定版本号。The lock object processing module 703 is configured to, in response to the query to the task lock object, read records related to the task lock object from the database and send the records to the one or more servers, the Records include task lock status and specific version numbers.

更新接收模块704，用于接收来自所述一个或多个服务器中的第一服务器的对所述任务的第一更新数据，所述第一更新数据包括第一版本号。An update receiving module 704, configured to receive first update data for the task from a first server among the one or more servers, where the first update data includes a first version number.

更新执行模块705，用于响应于确定所述第一版本号与所述特定版本号相同，使用所述第一更新数据对所述数据库执行第一更新。An update execution module 705, configured to perform a first update on the database using the first update data in response to determining that the first version number is the same as the specific version number.

辅助检查模块706，用于当所述一个或多个服务器中的第二服务器发生错误事件时，使得所述一个或多个服务器中的其他服务器触发监听事件，其中，所述第一服务器和所述第二服务器相同或不同。Auxiliary checking module 706, configured to cause other servers in the one or more servers to trigger a monitoring event when an error event occurs in the second server in the one or more servers, wherein the first server and the The second server is the same or different.

可选地，所述辅助检查模块706进一步用于：Optionally, the auxiliary inspection module 706 is further configured to:

可选地，其中，所述一个或多个子节点是其所对应的所述一个或多个服务器的IP地址列表。Optionally, the one or more sub-nodes are a list of IP addresses of the one or more servers corresponding to the one or more sub-nodes.

可选地，其中，所述父节点和所述一个或多个子节点是EPHEMERAL类型节点。Optionally, the parent node and the one or more child nodes are EPHEMERAL type nodes.

可选地，其中，当所述第二子节点持有未释放的锁时，所述任务的任务锁定状态为1。Optionally, when the second child node holds an unreleased lock, the task lock status of the task is 1.

可选地，所述分布式任务调度的系统700进一步包括：Optionally, the distributed task scheduling system 700 further includes:

服务器启动模块707，用于接收所述一个或多个服务器的IP地址；A server startup module 707, configured to receive the IP addresses of the one or more servers;

可选地，所述锁定对象处理模块703进一步用于：Optionally, the lock object processing module 703 is further configured to:

图8示出了可以应用本发明实施例的分布式任务调度方法或分布式任务调度系统的示例性系统架构800。FIG. 8 shows an exemplary system architecture 800 of a distributed task scheduling method or a distributed task scheduling system to which embodiments of the present invention may be applied.

如图8所示，系统架构800可以包括终端设备801、802、803，网络804和服务器805。网络804用以在终端设备801、802、803和服务器805之间提供通信链路的介质。网络804可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 8 , the system architecture 800 may include terminal devices 801 , 802 , and 803 , a network 804 and a server 805 . The network 804 is a medium used to provide a communication link between the terminal devices 801 , 802 , 803 and the server 805 . Network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

用户可以使用终端设备801、802、803通过网络804与服务器805交互，以接收或发送消息等。终端设备801、802、803上可以安装有各种通讯客户端应用，例如购物类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等(仅为示例)。The user can use the terminal devices 801, 802, 803 to interact with the server 805 through the network 804 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 801 , 802 and 803 , such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social platform software, etc. (only examples).

终端设备801、802、803可以是具有显示屏并且支持网页浏览的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.

服务器805可以是提供各种服务的服务器，例如对用户利用终端设备801、802、803所浏览的购物类网站提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对接收到的产品信息查询请求等数据进行分析等处理，并将处理结果(例如目标推送信息、产品信息--仅为示例)反馈给终端设备。The server 805 may be a server that provides various services, such as a background management server that provides support for shopping websites browsed by the terminal devices 801 , 802 and 803 (just an example). The background management server can analyze and process the received product information query request and other data, and feed back the processing results (such as target push information, product information—just an example) to the terminal device.

需要说明的是，本发明实施例所提供的分布式任务调度方法一般由服务器805执行，相应地，分布式任务调度系统一般设置于服务器805中。It should be noted that the distributed task scheduling method provided by the embodiment of the present invention is generally executed by the server 805 , and accordingly, the distributed task scheduling system is generally set in the server 805 .

应该理解，图8中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 8 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

下面参考图9，其示出了适于用来实现本发明实施例的终端设备的计算机系统900的结构示意图。图9示出的终端设备仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。Referring next to FIG. 9 , it shows a schematic structural diagram of a computer system 900 suitable for implementing a terminal device according to an embodiment of the present invention. The terminal device shown in FIG. 9 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present invention.

如图9所示，计算机系统900包括中央处理单元(CPU)901，其可以根据存储在只读存储器(ROM)902中的程序或者从存储部分908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中，还存储有系统900操作所需的各种程序和数据。CPU 901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。As shown in FIG. 9, a computer system 900 includes a central processing unit (CPU) 901, which can be loaded into a random access memory (RAM) 903 according to a program stored in a read only memory (ROM) 902 or a program from a storage section 908 Instead, various appropriate actions and processes are performed. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901 , the ROM 902 , and the RAM 903 are connected to each other through a bus 904 . An input/output (I/O) interface 905 is also connected to bus 904 .

以下部件连接至I/O接口905：包括键盘、鼠标等的输入部分906；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分907；包括硬盘等的存储部分908；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器910也根据需要连接至I/O接口905。可拆卸介质911，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器910上，以便于从其上读出的计算机程序根据需要被安装入存储部分908。The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, etc.; an output section 907 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 908 including a hard disk, etc. ; and a communication section 909 including a network interface card such as a LAN card, a modem, and the like. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 910 as needed so that a computer program read therefrom is installed into the storage section 908 as needed.

特别地，根据本发明公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本发明公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分909从网络上被下载和安装，和/或从可拆卸介质911被安装。在该计算机程序被中央处理单元(CPU)901执行时，执行本发明的系统中限定的上述功能。In particular, the processes described above with reference to the flowcharts may be implemented as computer software programs in accordance with the disclosed embodiments of the present invention. For example, embodiments disclosed herein include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909, and/or installed from the removable medium 911. When the computer program is executed by the central processing unit (CPU) 901, the above-described functions defined in the system of the present invention are executed.

需要说明的是，本发明所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本发明中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本发明中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present invention may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the present invention, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present invention, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

描述于本发明实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的模块也可以设置在处理器中，例如，可以描述为：一种处理器包括查询请求接收模块、锁定对象查询模块、锁定对象处理模块、更新接收模块、更新执行模块和辅助检查模块。其中，这些模块的名称在某种情况下并不构成对该模块本身的限定，例如，锁定对象查询模块还可以被描述为“用于在数据库中查询所述任务锁定对象的模块”。The modules involved in the embodiments of the present invention may be implemented in a software manner, and may also be implemented in a hardware manner. The described modules can also be set in the processor, for example, it can be described as: a processor includes a query request receiving module, a locked object query module, a locked object processing module, an update receiving module, an update execution module and an auxiliary checking module. Wherein, the names of these modules do not constitute a limitation of the module itself under certain circumstances, for example, the lock object query module can also be described as "a module for querying the task lock object in the database".

作为另一方面，本发明还提供了一种计算机可读介质，该计算机可读介质可以是上述实施例中描述的设备中所包含的；也可以是单独存在，而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被一个该设备执行时，使得该设备包括：接收来自一个或多个服务器对任务的任务锁定对象的一个或多个查询请求；在数据库中查询所述任务锁定对象；响应于查询到所述任务锁定对象，从所述数据库读取与所述任务锁定对象相关的记录并将所述记录发送到所述一个或多个服务器，所述记录包括任务锁定状态和特定版本号；接收来自所述一个或多个服务器中的第一服务器的对所述任务的第一更新数据，所述第一更新数据包括第一版本号；响应于确定所述第一版本号与所述特定版本号相同，使用所述第一更新数据对所述数据库执行第一更新；以及当所述一个或多个服务器中的第二服务器发生错误事件时，使得所述一个或多个服务器中的其他服务器触发监听事件，其中，所述第一服务器和所述第二服务器相同或不同。As another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or may exist alone without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by a device, the device includes: one or more objects that receive task locking objects for tasks from one or more servers; query request; query the task lock object in a database; in response to querying the task lock object, read a record related to the task lock object from the database and send the record to the one or more a server, the record includes a task lock status and a specific version number; receiving first update data for the task from a first server of the one or more servers, the first update data including a first version number; in response to determining that the first version number is the same as the particular version number, performing a first update to the database using the first update data; and when a second of the one or more servers occurs In the event of an error, other servers in the one or more servers trigger a monitoring event, wherein the first server and the second server are the same or different.

上述发明中的一个实施例具有如下优点或有益效果：因为采用数据库乐观锁在分布式系统中进行任务调度同时使用Zookeeper进行辅助检查的技术手段，所以克服了键的过期时间设置不妥便会导致的锁失效以及强依赖zookeeper的技术问题，进而达到提高了分布式系统的可靠性并降低了方案部署复杂性的技术效果。An embodiment of the above invention has the following advantages or beneficial effects: because the database optimistic lock is used to perform task scheduling in a distributed system and Zookeeper is used to perform auxiliary inspection, it overcomes the problem that improper setting of the expiration time of the key will lead to The technical problems of lock failure and strong dependence on zookeeper, thereby achieving the technical effect of improving the reliability of the distributed system and reducing the complexity of solution deployment.

上述具体实施方式，并不构成对本发明保护范围的限制。本领域技术人员应该明白的是，取决于设计要求和其他因素，可以发生各种各样的修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等，均应包含在本发明保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. a method for distributed task scheduling, is characterized in that, comprises:

receive one or more query requests for a task's task lock object from one or more servers;

query the task lock object in the database;

In response to querying the task lock object, read records related to the task lock object from the database and send the records to the one or more servers, the records including task lock status and a particular version No;

receiving first update data for the task from a first server of the one or more servers, the first update data including a first version number;

in response to determining that the first version number is the same as the particular version number, performing a first update to the database using the first update data; and

When an error event occurs in the second server in the one or more servers, other servers in the one or more servers trigger a monitoring event, wherein the first server and the second server are the same or different .

2. The method according to claim 1, wherein before receiving one or more query requests for the task lock object of the task from one or more servers, further comprising:

create a parent node corresponding to the inspection server; and

One or more child nodes corresponding to the one or more servers are created under the parent node, wherein the parent node maintains in a list all child nodes currently surviving under it.

3. The method according to claim 2, wherein the one or more sub-nodes is a list of IP addresses of the one or more servers corresponding to the one or more sub-nodes.

4. The method of claim 2, wherein the parent node and the one or more child nodes are EPHEMERAL type nodes.

5. The method according to claim 2, wherein causing other servers in the one or more servers to trigger a listening event further comprises:

Deleting the second child node corresponding to the second server from the list to obtain a new list;

sending the new list to child nodes in the new list;

receiving a lock query request from a server corresponding to a child node in the new list, wherein the lock query request is about whether the second child node holds an unreleased lock;

querying the database according to the lock query request; and

In response to the query that the second child node holds an unreleased lock, the lock is released, wherein the lock is an optimistic lock for the task.

6 . The method according to claim 5 , wherein when the second child node holds an unreleased lock, the task lock status of the task is 1. 7 .

7 . The method of claim 5 , wherein releasing the lock further comprises: setting a task lock status of the task to 0. 8 .

8. The method according to claim 1, wherein before receiving one or more query requests for task lock objects of the task from one or more servers, further comprising:

receive the IP address of the one or more servers;

querying the database for tasks with a task lock status of 1 according to the IP address; and

In response to finding the task whose task lock status is 1, the task lock status of the task is set to 0.

9. The method according to claim 1, wherein after querying the task lock object in the database, it further comprises:

In response to the task lock object not being queried, records related to the task lock object are written to the database.

10. The method according to claim 1 or 9, wherein the record comprises at least the following fields: a task type field, a task description field, a task lock status field and a version number field.

11. A system for distributed task scheduling, comprising:

a query request receiving module, configured to receive one or more query requests from one or more servers to the task lock object of the task;

a lock object query module, used for querying the task lock object in the database;

A lock object processing module, configured to read records related to the task lock objects from the database and send the records to the one or more servers in response to the query to the task lock objects, the records Including task lock status and specific version number;

an update receiving module, configured to receive first update data for the task from a first server among the one or more servers, where the first update data includes a first version number;

an update execution module for performing a first update to the database using the first update data in response to determining that the first version number is the same as the specific version number; and

an auxiliary inspection module, configured to cause other servers in the one or more servers to trigger a monitoring event when an error event occurs in the second server in the one or more servers, wherein the first server and the The second server is the same or different.

12. The system of claim 11, wherein the auxiliary inspection module is further configured to:

create a parent node corresponding to the inspection server; and

13. The system according to claim 12, wherein the one or more sub-nodes is a list of IP addresses of the one or more servers corresponding to the one or more sub-nodes.

14. The system of claim 12, wherein the parent node and the one or more child nodes are EPHEMERAL type nodes.

15. The system of claim 12, wherein the auxiliary inspection module is further configured to:

sending the new list to child nodes in the new list;

querying the database according to the lock query request; and

16 . The system of claim 15 , wherein when the second child node holds an unreleased lock, the task lock status of the task is 1. 17 .

17. The system according to claim 15, wherein the auxiliary checking module is further configured to: set the task lock status of the task to 0.

18. The system of claim 11, wherein the system further comprises:

A server startup module for receiving the IP addresses of the one or more servers;

19. The system according to claim 11, wherein the lock object processing module is further configured to:

20. The system according to claim 11 or 19, wherein the record comprises at least the following fields: a task type field, a task description field, a task lock status field and a version number field.

21. A distributed task scheduling electronic device, characterized in that, comprising:

one or more processors;

a storage system for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

22. A computer-readable medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method according to any one of claims 1-10 is implemented.