CN116303327A

CN116303327A - A kind of Ceph client lock optimization method and device

Info

Publication number: CN116303327A
Application number: CN202310260649.6A
Authority: CN
Inventors: 邢典; 刘宽; 夏勇; 黄景平
Original assignee: China Telecom Cloud Technology Co Ltd
Current assignee: China Telecom Cloud Technology Co Ltd
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-23

Abstract

The application provides a Ceph client lock optimization method and device. By adopting a divide-and-conquer idea, a mode that a plurality of threads respectively maintain one OSDMap object is adopted, read locking in the IO read-write process is avoided, a user-defined load balancing mapping algorithm is utilized to map loads to specific threads, and a high-efficiency multi-thread concurrency processing without locking is realized by combining a high-efficiency locking queue SPDK ring inter-thread message synchronization mechanism, so that the method can be widely applied to multi-thread locking-free solutions in high concurrency scenes.

Description

A kind of Ceph client lock optimization method and device

技术领域technical field

本申请涉及云计算、分布式存储技术领域，具体涉及一种Ceph客户端锁优化方法及装置。The present application relates to the technical fields of cloud computing and distributed storage, and in particular to a method and device for optimizing Ceph client locks.

背景技术Background technique

Ceph是一种为优秀的性能、可靠性和可扩展性而设计的统一的、分布式存储系统，是一种典型的客户端服务端模型的分布式系统。Ceph客户端IO读写过程中会对OSDMap资源对象加读锁，该对象含有根据Crush算法计算IO落在哪个OSD所需的必要属性，在OSD故障导致状态发生变化时，集群中心Monitor节点会更新OSDMap资源对象。Ceph客户端在必要时需要从Monitor节点获取最新的OSDMap资源对象重新计算Crush，重试IO。OSDMap资源对象锁是一个典型的读多写少锁。Ceph is a unified, distributed storage system designed for excellent performance, reliability, and scalability. It is a typical distributed system with a client-server model. During the IO read and write process of the Ceph client, a read lock will be added to the OSDMap resource object. This object contains the necessary attributes required to calculate which OSD the IO falls on according to the Crush algorithm. When the OSD failure causes the status to change, the monitor node in the cluster center will be updated. OSDMap resource object. The Ceph client needs to obtain the latest OSDMap resource object from the Monitor node to recalculate Crush and retry IO when necessary. The OSDMap resource object lock is a typical read-more-write-less lock.

Ceph客户端将每个IO封装成Op操作，在客户端进程中维护一个全局的opsMap记录所有正在进行中的Op操作，当有IO需要提交时会将Op插入Map，当有IO完成时，会从Map中删除，也会定期扫描该Map以处理超时异常的Op。在原有Ceph客户端方案中通过互斥锁来实现多线程间的同步，这种方式通常在多线程场景下成为系统性能瓶颈。The Ceph client encapsulates each IO into an Op operation, and maintains a global opsMap in the client process to record all ongoing Op operations. When there is an IO that needs to be submitted, the Op will be inserted into the Map. When the IO is completed, it will be Deleted from the Map, the Map will also be scanned periodically to handle Ops with timeout exceptions. In the original Ceph client solution, mutual exclusion locks are used to achieve synchronization between multiple threads. This method usually becomes a system performance bottleneck in a multi-threaded scenario.

发明内容Contents of the invention

为了解决上述技术问题，本申请提供了一种Ceph客户端锁优化方法和装置。本申请针对读多写少的OSDMap资源对象锁，通过采用多个线程各自维护一个OSDMap资源对象的方式，避免IO读写过程中加读锁，在相同的硬件资源下，缩短了时延，能有效提升存储系统的性能。本申请所采用的技术方案如下：In order to solve the above technical problems, the present application provides a Ceph client lock optimization method and device. This application is aimed at OSDMap resource object locks with more reads and fewer writes. By using multiple threads to maintain an OSDMap resource object, it avoids adding read locks during the IO read and write process. Under the same hardware resources, the delay is shortened and the Effectively improve the performance of the storage system. The technical scheme adopted in this application is as follows:

一种Ceph客户端锁优化方法，其特征在于，Ceph客户端线程池中包括多个线程worker，并指定多个线程worker中的一个为主线程；A kind of Ceph client lock optimization method, it is characterized in that, comprise a plurality of thread workers in the Ceph client thread pool, and specify one of a plurality of thread workers as main thread;

每一个线程对应于一个OSDMap资源对象，并能够免锁访问所对应的OSDMap资源对象；Each thread corresponds to an OSDMap resource object, and can access the corresponding OSDMap resource object without lock;

在更新OSDMap资源对象时，所述主线程负责OSDMap资源对象的更新过程。When updating the OSDMap resource object, the main thread is responsible for the updating process of the OSDMap resource object.

进一步的，所述主线程负责OSDMap资源对象的更新过程，具体包括：在更新OSDMap资源对象时，所述主线程先接收更新OSDMap资源对象请求，所述主线程的本地开始更新对应的OSDMap资源对象，并将更新OSDMap资源对象请求转发至其他线程，通知其他线程更新对应的OSDMap资源对象。Further, the main thread is responsible for updating the OSDMap resource object, which specifically includes: when updating the OSDMap resource object, the main thread first receives a request to update the OSDMap resource object, and the local part of the main thread starts to update the corresponding OSDMap resource object , and forward the request to update the OSDMap resource object to other threads, and notify other threads to update the corresponding OSDMap resource object.

进一步的，其他线程更新完对应的OSDMap资源对象后，向所述主线程反馈更新完成消息；若所述主线程收到了其他所有线程反馈的更新完成消息，则OSDMap资源对象更新完成，由所述主线程对外整体回复OSDMap资源对象更新完成。Further, after other threads have updated the corresponding OSDMap resource object, they feed back an update completion message to the main thread; if the main thread receives the update completion message fed back by all other threads, the update of the OSDMap resource object is completed, and the The main thread replies to the outside as a whole that the update of the OSDMap resource object is completed.

进一步的，多个线程worker之间通过任务队列加事件机制完成通信，或者使用SPDK ring和SPDK轮询机制完成线程间通信。Furthermore, multiple thread workers complete communication through task queue and event mechanism, or use SPDK ring and SPDK polling mechanism to complete inter-thread communication.

一种Ceph客户端锁优化方法，Ceph客户端线程池中包括多个线程worker；A kind of Ceph client lock optimization method, comprises a plurality of thread workers in the Ceph client thread pool;

将每个IO封装成Op提交时，使用HASH算法得到该Op归属线程id；When each IO is encapsulated into an Op submission, the HASH algorithm is used to obtain the thread id of the Op's ownership;

将该Op投递到其归属线程的任务队列中进行处理并发送事件通知，或者由归属线程自己通过主动轮询方式来取出Op，并插入到本线程的opsMap中进行处理。Post the Op to the task queue of its own thread for processing and send an event notification, or the own thread itself will take out the Op through active polling and insert it into the opsMap of this thread for processing.

进一步的，各个线程接收到OSD回复的op_reply消息后，根据消息的tid使用相同HASH算法获取到该Op的归属线程，然后将该任务投递至归属线程的任务队列，归属线程根据tid查询本线程的opsMap获取到Op并处理。Further, after each thread receives the op_reply message replied by OSD, it uses the same HASH algorithm to obtain the belonging thread of the Op according to the tid of the message, and then delivers the task to the task queue of the belonging thread. opsMap gets the Op and processes it.

进一步的，整个Op处理流程无需加写锁，且同一个Op的发送和回复处理在同一个线程中。Furthermore, the entire Op processing flow does not need to add a write lock, and the sending and reply processing of the same Op are processed in the same thread.

进一步的，使用任务队列加事件通知机制实现线程间通信，或者使用SPDK ring和轮询机制实现线程间通信。Further, use task queue plus event notification mechanism to realize inter-thread communication, or use SPDK ring and polling mechanism to realize inter-thread communication.

一种Ceph客户端锁优化装置，该装置包括处理器以及存储有所述处理器可执行指令的存储器，当所述指令被处理器执行时，所述处理器执行上述方法。A Ceph client lock optimization device, the device includes a processor and a memory storing instructions executable by the processor, and when the instructions are executed by the processor, the processor executes the above method.

通过本申请实施例，可以获得如下技术效果：Through the embodiment of the present application, the following technical effects can be obtained:

(1)在改进方案中，使用HASH算法将Op打散绑定到业务线程，各个线程只需维护总数据集的一部分，从而避免了互斥锁的使用，提高了系统并发能力，降低了IO时延；(1) In the improved solution, the HASH algorithm is used to disperse and bind Ops to business threads, and each thread only needs to maintain a part of the total data set, thus avoiding the use of mutexes, improving the system's concurrency capability, and reducing IO delay;

(2)每个线程处理的数据集相对固定，能够更加充分的利用缓存局部性原理，从而在有限的硬件资源下得到更好的性能；(2) The data set processed by each thread is relatively fixed, and the principle of cache locality can be more fully utilized to obtain better performance under limited hardware resources;

(3)针对读多写少对象，每个线程维护一个对象副本，IO读写流程无需加读锁，提高系统并发能力。(3) For objects that read more and write less, each thread maintains a copy of the object, and the IO read and write process does not need to add a read lock, which improves the concurrency of the system.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the embodiments or the description of the prior art. Obviously, the accompanying drawings in the following description are of the present application For some embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without paying creative efforts.

图1为现有技术中OSDMap资源对象访问流程的示意图；Fig. 1 is a schematic diagram of an OSDMap resource object access process in the prior art;

图2为现有技术中Op处理流程的示意图；Fig. 2 is the schematic diagram of Op processing flow in the prior art;

图3为本申请改进后的OSDMap资源对象更新流程的示意图；Fig. 3 is the schematic diagram of the OSDMap resource object update process after the improvement of the present application;

图4为本申请的改进后的Op处理流程的示意图。Fig. 4 is a schematic diagram of the improved Op processing flow of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的全部其他实施例，都属于本申请保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.

本申请针对Ceph客户端锁的优化方案采用了分而治之的思想，利用自定义负载均衡映射算法，每个Op会生成自增的唯一事务ID，该ID转换为字符串作为FNV-1哈希函数的输入，Hash结果对worker数(线程数)取模得到该Op归属worker线程。将负载映射到特定线程后，结合高效的无锁队列SPDK ring线程间消息同步机制实现任务转发到线程处理(SPDKring是SPDK框架提供的一种多线程高效编程模型，它采用一个无锁环形队列(称为ring)来实现高效的线程间消息转发)。本申请是一种通用的方法，可广泛应用于高并发场景下的多线程无锁化化解决方案。This application adopts the idea of divide and conquer for the optimization scheme of Ceph client lock. Using the custom load balancing mapping algorithm, each Op will generate an auto-incrementing unique transaction ID, and the ID will be converted into a string as the key of the FNV-1 hash function. Input, the Hash result is modulo the number of workers (number of threads) to get the worker thread that the Op belongs to. After the load is mapped to a specific thread, combined with the efficient lock-free queue SPDK ring inter-thread message synchronization mechanism, the task is forwarded to the thread processing (SPDKring is a multi-threaded efficient programming model provided by the SPDK framework, which uses a lock-free ring queue ( called ring) to achieve efficient inter-thread message forwarding). This application is a general method that can be widely used in multi-threaded lock-free solutions in high concurrency scenarios.

图1为现有技术中OSDMap资源对象访问流程的示意图。现有Ceph客户端进程在处理IO时加OSDMap读锁流程，正常IO读写时，Ceph客户端每个线程会先获取OSDMap资源对象读锁，查询该对象的相关属性，调用Crush算法计算得到IO所要发往的OSD节点。在OSD状态发生变化时会更新OSDMap到集群中心Monitor节点，且OSDMap版本号epoch会自增。在某些情况客户端线程会主动去Monitor获取最新的OSDMap信息，并刷新本地内存中的OSDMap资源对象。此时更新线程会先去获取写锁，然后更新OSDMap资源对象。FIG. 1 is a schematic diagram of an OSDMap resource object access process in the prior art. The existing Ceph client process adds an OSDMap read lock process when processing IO. During normal IO read and write, each thread of the Ceph client will first obtain the OSDMap resource object read lock, query the relevant attributes of the object, and call the Crush algorithm to calculate the IO The OSD node to send to. When the OSD state changes, the OSDMap will be updated to the monitor node in the cluster center, and the OSDMap version number epoch will be incremented automatically. In some cases, the client thread will actively go to the Monitor to obtain the latest OSDMap information, and refresh the OSDMap resource object in the local memory. At this time, the update thread will first acquire the write lock, and then update the OSDMap resource object.

图2为现有技术中Op处理流程的示意图。现有Op处理流程需要加互斥锁访问opsMap，每个线程会先获取写锁，然后将Op插入到全局的ops Map中，修改Op相关属性，投递到网络线程待发送队列并通知网络线程后释放锁。每个线程收到OSD回复的op_reply消息后，先获取写锁，根据消息的事务tid从ops Map中查询到对应的Op，经过相关处理后，从opsMap中删除Op，最后释放锁。FIG. 2 is a schematic diagram of an Op processing flow in the prior art. The existing Op processing flow needs to add a mutex to access the opsMap. Each thread will first obtain a write lock, then insert the Op into the global ops Map, modify the relevant properties of the Op, and post it to the network thread waiting queue and notify the network thread. Release the lock. After each thread receives the op_reply message from the OSD, it first obtains the write lock, then queries the corresponding Op from the ops Map according to the transaction tid of the message, deletes the Op from the ops Map after relevant processing, and finally releases the lock.

图3为本申请改进后的OSDMap资源对象更新流程的示意图。FIG. 3 is a schematic diagram of the improved OSDMap resource object update process of the present application.

Ceph客户端线程池中包括多个线程worker，并指定多个线程worker中的一个为主线程；The Ceph client thread pool includes multiple thread workers, and specifies one of the multiple thread workers as the main thread;

在更新OSDMap资源对象时，所述主线程负责OSDMap资源对象的更新过程；When updating the OSDMap resource object, the main thread is responsible for the updating process of the OSDMap resource object;

主线程负责OSDMap资源对象的更新过程，具体包括：在更新OSDMap资源对象时，所述主线程先接收更新OSDMap资源对象请求，所述主线程的本地开始更新对应的OSDMap资源对象，并将更新OSDMap资源对象请求转发至其他线程，通知其他线程更新对应的OSDMap资源对象；其他线程更新完对应的OSDMap资源对象后，向所述主线程反馈更新完成消息；若所述主线程收到了其他所有线程反馈的更新完成消息，则OSDMap资源对象更新完成，由所述主线程对外整体回复OSDMap资源对象更新完成；The main thread is responsible for the update process of the OSDMap resource object, which specifically includes: when updating the OSDMap resource object, the main thread first receives a request for updating the OSDMap resource object, and the local part of the main thread starts to update the corresponding OSDMap resource object, and will update the OSDMap resource object. The resource object request is forwarded to other threads, and other threads are notified to update the corresponding OSDMap resource object; after other threads have updated the corresponding OSDMap resource object, they feed back the update completion message to the main thread; if the main thread receives feedback from all other threads The update completion message of the OSDMap resource object is completed, and the main thread replies to the outside world that the update of the OSDMap resource object is complete;

多个线程worker之间通过任务队列加事件机制完成通信，或者使用SPDK ring和SPDK轮询机制完成线程间通信。Multiple thread workers communicate through task queues and event mechanisms, or use SPDK ring and SPDK polling mechanisms to complete inter-thread communication.

图4为本申请的改进后的Op处理流程的示意图，Ceph客户端多个线程在并发向服务端发起请求前，会先获取OSDMap，根据OSDMap的信息组装成一个Op操作，该Op操作最终会发送给服务端处理(对应图4的op_submit)，客户端在等待服务端完成前，会将Op对象放在客户端的Ops Map中管理。服务端完成该Op操作后会给客户端回复ack消息，客户端在处理ack消息时(对应图4的op_reply)删除Op对象或者根据Op对象重试Op操作。Figure 4 is a schematic diagram of the improved Op processing flow of this application. Before multiple threads of the Ceph client concurrently initiate requests to the server, they will first obtain the OSDMap, and assemble an Op operation according to the information of the OSDMap. The Op operation will eventually Send it to the server for processing (corresponding to op_submit in Figure 4). Before the client waits for the server to complete, it will manage the Op object in the client's Ops Map. After the server completes the Op operation, it will reply the client with an ack message. When the client processes the ack message (corresponding to op_reply in Figure 4), it deletes the Op object or retries the Op operation based on the Op object.

在改进后的Op处理流程中，将每个IO封装成Op提交时，使用HASH算法得到该Op归属线程id；将该Op投递到其归属线程的任务队列中进行处理并发送事件通知，或者由归属线程自己通过主动轮询方式来取出Op，并插入到本线程的ops Map中进行处理；In the improved Op processing flow, when each IO is encapsulated into an Op submission, the HASH algorithm is used to obtain the Op's belonging thread id; the Op is delivered to the task queue of its belonging thread for processing and an event notification is sent, or by The belonging thread itself takes out the Op through active polling, and inserts it into the ops Map of this thread for processing;

后续处理和原流程一致。业务线程接收到OSD回复的op_reply消息后，根据消息的tid使用相同HASH算法获取到该Op的归属线程，然后将该任务投递至归属线程的任务队列，归属线程根据tid查询本线程的ops Map获取到Op并处理。Subsequent processing is consistent with the original process. After receiving the op_reply message from OSD, the business thread obtains the op's belonging thread according to the tid of the message using the same HASH algorithm, and then delivers the task to the task queue of the belonging thread, and the belonging thread queries the ops Map of this thread according to the tid to obtain to Op and deal with it.

整个Op处理流程无需加写锁，且同一个Op的发送和回复处理在同一个线程中。可以使用任务队列加事件通知机制实现线程间通信，或者使用SPDK ring和轮询机制实现线程间通信。The entire Op processing flow does not need to add a write lock, and the sending and reply processing of the same Op are processed in the same thread. You can use task queue plus event notification mechanism to realize inter-thread communication, or use SPDK ring and polling mechanism to realize inter-thread communication.

综上所述，在本申请的技术方案中，每个线程访问本线程的OSDMap资源对象，在更新OSDMap资源对象时，由某个指定的主线程来负责整个更新过程，在该主线程收到更新OSDMap资源对象时，本地开始更新该对象并通知其他线程更新，其他线程更新完会给主线程回复更新完成消息，主线程在发现其他线程都更新完后即可对外整体回复更新完成。To sum up, in the technical solution of this application, each thread accesses the OSDMap resource object of this thread, and when updating the OSDMap resource object, a designated main thread is responsible for the entire update process. When an OSDMap resource object is updated, the object starts to be updated locally and other threads are notified to update. After the other threads are updated, they will reply to the main thread with an update completion message.

针对ops Map锁，改进的方法使用了一种分而治之的思想，通过特定的映射规则将每个IO对应的Op映射到后端的线程，将原有全局的ops Map打散到各个线程，各个线程维护ops Map的子集，线程间使用任务队列加事件通知机制，或者SPDK ring和轮询机制实现线程间通信。由于每个线程处理的数据集相对固定，从而能够充分的利用缓存局部性原理，提升程序的运行速度。For the ops Map lock, the improved method uses a divide-and-conquer idea. Through specific mapping rules, the Op corresponding to each IO is mapped to the back-end thread, and the original global ops Map is scattered to each thread. Each thread maintains A subset of ops Map, using task queue and event notification mechanism between threads, or SPDK ring and polling mechanism to achieve inter-thread communication. Since the data set processed by each thread is relatively fixed, the principle of cache locality can be fully utilized to improve the running speed of the program.

虽然以上描述了本申请的具体实施方式，但是本领域的技术人员应当理解，这些仅是举例说明，本申请的保护范围是由所附权利要求书限定的。本领域的技术人员在不背离本申请的原理和实质的前提下，可以对这些实施方式作出多种变更或修改，但这些变更和修改均落入本申请的保护范围。Although the specific embodiments of the present application have been described above, those skilled in the art should understand that these are only examples, and the protection scope of the present application is defined by the appended claims. Those skilled in the art can make various changes or modifications to these embodiments without departing from the principle and essence of the present application, but these changes and modifications all fall within the protection scope of the present application.

Claims

1. The Ceph client lock optimizing method is characterized in that a Ceph client thread pool comprises a plurality of thread works, and one of the thread works is designated as a main thread;

each thread corresponds to an OSDMap resource object and can access the corresponding OSDMap resource object without lock;

when updating the OSDMap resource object, the main thread is responsible for the update process of the OSDMap resource object.

2. The method according to claim 1, wherein the main thread is responsible for an update procedure of an OSDMap resource object, in particular comprising: when updating the OSDMap resource object, the main thread receives a request for updating the OSDMap resource object, and the main thread locally starts to update the corresponding OSDMap resource object, forwards the request for updating the OSDMap resource object to other threads, and informs the other threads to update the corresponding OSDMap resource object.

3. The method of claim 2, wherein after other threads update the corresponding OSDMap resource object, an update completion message is fed back to the main thread; if the main thread receives the update completion information fed back by all other threads, the update of the OSDMap resource object is completed, and the main thread integrally replies the update completion of the OSDMap resource object to the outside.

4. A method according to any of claims 1 to 3, wherein communication is accomplished between a plurality of thread workers via a task queue plus event mechanism, or inter-thread communication is accomplished using SPDK ring and SPDK polling mechanisms.

5. A Ceph client lock optimization method is characterized in that a Ceph client thread pool comprises a plurality of thread works;

when each IO is packaged into Op for submission, the HASH algorithm is used for obtaining the Op attribution thread id;

delivering the Op to a task queue of the home thread to process and send event notification, or taking out the Op by the home thread through a polling mode, and inserting the Op into an ops Map of the thread to process.

6. The method of claim 5, wherein after each thread receives the op_reply message replied by the OSD, the home thread of the Op is obtained by using the same HASH algorithm according to the tid of the message, and then the task is delivered to a task queue of the home thread, and the home thread obtains the Op by querying the ops Map of the thread according to the tid and processes the Op.

7. The method of claim 6, wherein the entire Op process flow does not require a write lock, and the sending and replying processes of the same Op are in the same thread.

8. The method according to one of claims 5 to 7, characterized in that inter-thread communication is implemented using a task queue plus event notification mechanism or inter-thread communication is implemented using an SPDK ring and polling mechanism.

9. A Ceph client lock optimisation apparatus comprising a processor and a memory storing instructions executable by the processor, the processor performing the method of any one of claims 1 to 4 when the instructions are executed by the processor.

10. A Ceph client lock optimisation apparatus comprising a processor and a memory storing instructions executable by the processor, the processor performing the method of any one of claims 5 to 8 when the instructions are executed by the processor.