CN118069666A

CN118069666A - PostgreSQL dynamic transferable table space method based on memory separation architecture

Info

Publication number: CN118069666A
Application number: CN202311722297.8A
Authority: CN
Inventors: 胡彬; 李跃森
Original assignee: China Telecom Cloud Technology Co Ltd
Current assignee: China Telecom Cloud Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-05-24

Abstract

The present invention discloses a PostgreSQL dynamic transferable tablespace method based on a storage-computation separation architecture, which comprises the following steps: monitoring data update of a PostgreSQL database, copying the original data row to a tablespace, generating a new version of the data row through the data update, and storing an Undo pointer in the row header of the new version of the data row; in response to transaction rollback of the PostgreSQL database, copying the original data row from the tablespace back to the original location; in response to all sessions no longer needing to access the original data row of the tablespace, clearing the original data row and releasing the tablespace; and configuring the tablespace Async Freeze algorithm to solve the unavailability phenomenon of overlapping XID value domains.

Description

A PostgreSQL dynamic transferable tablespace method based on storage and computing separation architecture

技术领域Technical Field

本发明涉及云计算的关系型数据库技术领域，更具体的说是涉及一种基于存算分离架构的PostgreSQL动态可转移表空间方法。The present invention relates to the technical field of relational databases for cloud computing, and more specifically to a PostgreSQL dynamic transferable tablespace method based on a storage-computing separation architecture.

背景技术Background technique

基于传统关系型数据库架构的分布式系统设计具有较高的技术复杂度，其主要有两个方面的原因，也是分布式关系数据库架构需要解决的2个问题：The design of distributed systems based on traditional relational database architecture has high technical complexity. There are two main reasons for this, which are also the two problems that the distributed relational database architecture needs to solve:

1.复杂的关系模型对数据的组织和访问提出了很高要求1. Complex relational models place high demands on data organization and access

2.存储和计算耦合增加了系统在水平扩展、资源调度、故障转移、备份恢复等方面的难度2. Storage and computing coupling increases the difficulty of the system in terms of horizontal expansion, resource scheduling, failover, backup and recovery, etc.

抛开依赖专有硬件的解决方案，业界主流的方案从上面两点入手，对应两大技术方向：Apart from solutions that rely on proprietary hardware, the mainstream solutions in the industry start from the above two points, corresponding to two major technical directions:

1.share-nothing：对数据进行水平分割，引入分布式执行引擎支持跨节点的复杂查询1. Share-nothing: Horizontally split data and introduce a distributed execution engine to support complex queries across nodes

2.share-storage：将数据下沉到分布式存储，计算侧无状态化，存储层池化2. Share-storage: sink data to distributed storage, make the computing side stateless, and pool the storage layer

经过多年的探索，PostgreSQL在上述两个方向上已有很多优秀的实现，但从前面要解决的2个架构问题来看，这些实现都只解决了其中一个。例如众多MPP架构的数据库系统采用share-nothing架构，在集群扩展性方面具有优势，但对于单个数据分片，还是传统的主从架构，问题2未能有效解决。而以存算分离为基础的各种云原生方案，虽然已经发展出多主架构，但为了平衡分布式场景带来的延迟和性能挑战，通常采用乐观锁一类的方式实现状态同步，这在实际应用时对业务提出更高的要求，如果业务不做流量的合理拆分，其写能力理论上也无法线性扩展，也就是问题1未能有效解决。一种可行的方案是结合两种架构，具体的可以将MPP数据库运行在存算分离的架构上，这种架构结合二者的优势，能同时解决上述两个问题。After years of exploration, PostgreSQL has many excellent implementations in the above two directions, but from the perspective of the two architectural problems to be solved, these implementations have only solved one of them. For example, many MPP architecture database systems adopt a share-nothing architecture, which has advantages in cluster scalability, but for a single data shard, it is still a traditional master-slave architecture, and problem 2 has not been effectively solved. Although various cloud native solutions based on storage and computing separation have developed a multi-master architecture, in order to balance the latency and performance challenges brought by distributed scenarios, optimistic locking and other methods are usually used to achieve state synchronization, which puts higher requirements on the business in actual application. If the business does not split the traffic reasonably, its write capability cannot be linearly expanded in theory, that is, problem 1 has not been effectively solved. A feasible solution is to combine the two architectures. Specifically, the MPP database can be run on a storage and computing separation architecture. This architecture combines the advantages of both and can solve the above two problems at the same time.

但当前存算分离架构都是将单个数据库实例的数据作为一个整体与计算逻辑进行切割，其并不关心元数据的组织。这就阻碍了MPP数据库有效利用存算分离架构的优势，实现计算节点的极致弹性。一个简单的场景是：当MPP的一个节点需要进行拆分时，虽然其数据已经与计算分离，但由于数据是一个整体无法拆分，导致计算也无法拆分。如果走传统的数据复制迁移的方式，这又回到了传统MPP架构，存算分离的优势就没有了。However, the current storage-computing separation architecture separates the data of a single database instance from the computing logic as a whole, and does not care about the organization of metadata. This prevents the MPP database from effectively taking advantage of the storage-computing separation architecture and achieving the ultimate elasticity of computing nodes. A simple scenario is: when an MPP node needs to be split, although its data has been separated from the computing, the computing cannot be split because the data is a whole. If the traditional data replication and migration method is adopted, this will return to the traditional MPP architecture, and the advantages of storage-computing separation will be lost.

对比本发明提供一种需要基于存算分离架构的PostgreSQL动态可转移表空间方法，能高效支持数据的拆分，从而令两种架构有机的结合起来，实现计算、存储的极致弹性架构。In contrast, the present invention provides a PostgreSQL dynamic transferable tablespace method based on a storage-computation separation architecture, which can efficiently support data splitting, thereby organically combining the two architectures to achieve an extremely flexible architecture for computing and storage.

发明内容Summary of the invention

针对现有技术存在的不足，本发明的目的在于提供一种基于存算分离架构的PostgreSQL动态可转移表空间方法，能高效支持数据的拆分，从而令两种架构有机的结合起来，实现计算、存储的极致弹性架构。In view of the deficiencies in the prior art, the purpose of the present invention is to provide a PostgreSQL dynamic transferable tablespace method based on a storage-computation separation architecture, which can efficiently support data splitting, thereby organically combining the two architectures to achieve an extremely flexible architecture for computing and storage.

为实现上述目的，本发明提供了如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于存算分离架构的PostgreSQL动态可转移表空间方法，包括以下步骤：A PostgreSQL dynamic transferable tablespace method based on storage and computing separation architecture includes the following steps:

监测到PostgreSQL数据库发生数据更新，将所述原始数据行拷贝到表空间，数据更新产生新版本数据行，于所述新版本数据行的行头存储Undo指针；It is monitored that data update occurs in the PostgreSQL database, the original data row is copied to the table space, the data update generates a new version data row, and an Undo pointer is stored in the row header of the new version data row;

响应于所述PostgreSQL数据库发生事务回滚，将所述原始数据行从所述表空间拷贝回原地；In response to a transaction rollback in the PostgreSQL database, copying the original data row from the table space back to the original location;

响应于所有会话不再需要访问所述表空间的所述原始数据行，清理所述原始数据行，释放所述表空间；In response to all sessions no longer needing to access the original data row in the table space, clearing the original data row and releasing the table space;

配置表空间Async Freeze算法解用于解决XID值域重合不可用现象。The configuration tablespace Async Freeze algorithm solution is used to solve the problem of XID value range overlap and unavailability.

作为本发明的进一步改进，所述Async Freeze算法包括以下步骤：As a further improvement of the present invention, the Async Freeze algorithm comprises the following steps:

在数据页面增加表空间挂载版本号，表空间每次挂载到一个新的PG实例，该版本号加1，页面更新时将当前版本号写入，通过判断该版本号，以获得当前页面的最后一次更新是否由当前实例发起；Add the tablespace mount version number to the data page. Each time the tablespace is mounted to a new PG instance, the version number is incremented by 1. When the page is updated, the current version number is written. By judging the version number, it is possible to determine whether the last update of the current page was initiated by the current instance.

在表空间挂载到新实例后，启动异步任务执行Async Freeze，该任务会扫描所有表空间页面，通过版本号识别出未更新的页面，执行Freeze操作回收XID；After the tablespace is mounted to the new instance, start the asynchronous task to execute Async Freeze, which scans all tablespace pages, identifies the unupdated pages by version number, and executes the Freeze operation to reclaim the XID.

表空间挂载到新实例后即可支持读写操作，如读到的是未Freeze的页面，立即发起Freeze操作；After the tablespace is mounted to the new instance, it can support read and write operations. If the page read is an unfreezed page, the Freeze operation is initiated immediately.

Freeze后的页面，其挂载版本号也同步更新，当做普通页面处理。After the page is frozen, its mounting version number is also updated synchronously and it is treated as a normal page.

作为本发明的进一步改进，基于PostgreSQL在云原生架构基础上，进行表空间的跨实例动态转移，其主要有如下改造步骤：As a further improvement of the present invention, based on PostgreSQL on the cloud native architecture, dynamic cross-instance migration of tablespaces is performed, which mainly includes the following transformation steps:

表空间存储对接分布式存储；Table space storage is connected to distributed storage;

表空间卸载操作；Tablespace unload operation;

表空间挂载操作。Tablespace mount operation.

作为本发明的进一步改进，所述表空间卸载操作包括以下步骤：As a further improvement of the present invention, the table space unloading operation includes the following steps:

停止目标表空间的写操作；Stop writing to the target tablespace;

执行checkpoint，确保表空间数据刷脏；Execute checkpoint to ensure that the tablespace data is flushed;

从PostgreSQL删除该表空间相关元数据；Delete the tablespace-related metadata from PostgreSQL;

卸载分布式文件系统。Unmount the Distributed File System.

作为本发明的进一步改进，所述表空间挂载操作包括以下步骤：As a further improvement of the present invention, the table space mounting operation includes the following steps:

挂载分布式文件系统；Mount the distributed file system;

在PostgreSQL中增加表空间相关元数据；Add tablespace related metadata in PostgreSQL;

开启表空间读写操作；Enable tablespace read and write operations;

启动表空间Async Freeze操作。Start the tablespace Async Freeze operation.

作为本发明的进一步改进，所述监测到PostgreSQL数据库发生数据更新，将所述原始数据行拷贝到表空间，数据更新产生新版本数据行，于所述新版本数据行的行头存储Undo指针，包括如下步骤：所述PostgreSQL数据库运行UndoLauncher进程监测正在运行的操作是否为写操作；响应于所述UndoLauncher进程监测到正在运行写操作，所述UndoLauncher进程向UndoWorker进程发送写操作通知；所述UndoWorker进程接收到写操作通知，将所述原始数据行拷贝至表空间；所述PostgreSQL数据库中的写操作原地执行，得到新版本数据行；于所述新版本数据行的行头存储Undo指针，所述Undo指针指向表空间中所述原始数据行。As a further improvement of the present invention, the monitoring of data update in the PostgreSQL database, copying the original data row to the table space, the data update generating a new version of the data row, and storing an Undo pointer in the row header of the new version of the data row include the following steps: the PostgreSQL database runs an UndoLauncher process to monitor whether the running operation is a write operation; in response to the UndoLauncher process monitoring that a write operation is running, the UndoLauncher process sends a write operation notification to the UndoWorker process; the UndoWorker process receives the write operation notification and copies the original data row to the table space; the write operation in the PostgreSQL database is executed in situ to obtain a new version of the data row; and the Undo pointer is stored in the row header of the new version of the data row, and the Undo pointer points to the original data row in the table space.

作为本发明的进一步改进，所述响应于所有会话不再需要访问所述表空间的所述原始数据行，清理所述原始数据行，释放所述表空间，包括如下步骤：通过UndoLauncher进程监控到所有会话不再需要访问所述原始数据行，通知UndoCleanWorker进程进行数据清理；所述UndoCleanWorker进程接收到数据清理通知，对所述原始数据行进行清理，释放所述表空间。As a further improvement of the present invention, in response to all sessions no longer needing to access the original data rows of the table space, cleaning up the original data rows and releasing the table space, comprises the following steps: monitoring through the UndoLauncher process that all sessions no longer need to access the original data rows, notifying the UndoCleanWorker process to perform data cleaning; the UndoCleanWorker process receives the data cleaning notification, cleans up the original data rows and releases the table space.

作为本发明的进一步改进，还包括：As a further improvement of the present invention, it also includes:

数据更新模块，用以监测到PostgreSQL数据库发生数据更新，将所述原始数据行拷贝到表空间，数据更新产生新版本数据行，于所述新版本数据行的行头存储Undo指针；事务回滚模块，用以响应于所述PostgreSQL数据库发生事务回滚，将所述原始数据行从所述表空间拷贝回原地；数据清理模块，响应于所有会话不再需要访问所述所述表空间的所述原始数据行，清理所述原始数据行，释放所述表空间。A data update module is used to monitor data updates in the PostgreSQL database, copy the original data rows to the table space, generate new version data rows through data updates, and store Undo pointers in the headers of the new version data rows; a transaction rollback module is used to copy the original data rows from the table space back to their original locations in response to transaction rollbacks in the PostgreSQL database; a data cleanup module is used to clean up the original data rows and release the table space in response to all sessions no longer needing to access the original data rows in the table space.

一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时，实现如上述所述的基于存算分离架构的PostgreSQL动态可转移表空间方法。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the PostgreSQL dynamic transferable tablespace method based on the storage-computation separation architecture as described above.

一种设备，包括：A device comprising:

存储器，用于存储指令；A memory for storing instructions;

处理器，用于执行所述指令，使得所述设备执行实现如上述所述的基于存算分离架构的PostgreSQL动态可转移表空间方法的操作。The processor is used to execute the instruction so that the device performs operations to implement the PostgreSQL dynamic transferable tablespace method based on the storage and computing separation architecture as described above.

本发明的有益效果：Beneficial effects of the present invention:

传统MPP架构下，数据节点扩缩容操作需要迁移、删除数据，一方面会对业务负载造成影响，同时整个流程步骤多，耗时长，成功率低。通过本发明，可以有效解决上述缺点，使得计算节点的扩缩容在秒级完成。同时结合云原生架构的优点，真正实现了从计算到存储的极致弹性。In the traditional MPP architecture, the expansion and contraction of data nodes requires migration and deletion of data, which will affect the business load. At the same time, the whole process has many steps, takes a long time, and has a low success rate. Through the present invention, the above shortcomings can be effectively solved, so that the expansion and contraction of computing nodes can be completed in seconds. At the same time, combined with the advantages of cloud native architecture, the ultimate elasticity from computing to storage is truly realized.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为MPP架构图；Figure 1 is a diagram of the MPP architecture;

图2为存算分离架构；Figure 2 shows the storage and computing separation architecture;

图3为云原生分布式架构；Figure 3 shows the cloud-native distributed architecture;

图4为加入表空间的云原生分布式架构。Figure 4 shows the cloud-native distributed architecture with tablespace added.

具体实施方式Detailed ways

为了使得本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例的附图，对本发明实施例的技术方案进行清楚、完整地描述。显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于所描述的本发明的实施例，本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiment of the present invention clearer, the technical solution of the embodiment of the present invention will be clearly and completely described below in conjunction with the drawings of the embodiment of the present invention. Obviously, the described embodiment is a part of the embodiment of the present invention, not all of the embodiments. Based on the described embodiment of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

除非另外定义，本发明使用的技术术语或者科学术语应当为本发明所属领域内具有一般技能的人士所理解的通常意义。本发明中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系，当被描述对象的绝对位置改变后，则该相对位置关系也可能相应地改变。Unless otherwise defined, the technical terms or scientific terms used in the present invention should be understood by people with ordinary skills in the field to which the present invention belongs. "First", "second" and similar words used in the present invention do not indicate any order, quantity or importance, but are only used to distinguish different components. "Include" or "comprise" and similar words mean that the elements or objects appearing before the word include the elements or objects listed after the word and their equivalents, without excluding other elements or objects. "Connect" or "connected" and similar words are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "down", "left", "right" and the like are only used to indicate relative positional relationships. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

为了保持本发明实施例的以下说明清楚且简明，本发明省略了已知功能和已知部件的详细说明。In order to keep the following description of the embodiments of the present invention clear and concise, detailed descriptions of well-known functions and well-known components are omitted.

参考图1至图4所示，为本发明一种基于存算分离架构的PostgreSQL动态可转移表空间方法的具体实施方式，包括以下步骤：Referring to FIG. 1 to FIG. 4 , a specific implementation of a PostgreSQL dynamic transferable tablespace method based on a storage-computation separation architecture of the present invention is shown, and the method includes the following steps:

其中监测到PostgreSQL数据库发生数据更新还包括：The data updates of the PostgreSQL database monitored also include:

步骤101，所述PostgreSQL数据库运行UndoLauncher进程监测正在运行的操作是否为写操作；Step 101, the PostgreSQL database runs the UndoLauncher process to monitor whether the running operation is a write operation;

步骤102，响应于所述UndoLauncher进程监测到正在运行写操作，所述UndoLauncher进程向UndoWorker进程发送写操作通知；Step 102, in response to the UndoLauncher process detecting that a write operation is being performed, the UndoLauncher process sends a write operation notification to the UndoWorker process;

步骤103，所述UndoWorker进程接收到写操作通知，将所述原始数据行拷贝至Undo表空间；Step 103, the UndoWorker process receives the write operation notification and copies the original data row to the Undo tablespace;

步骤104，所述PostgreSQL数据库中的写操作原地执行，得到新版本数据行；Step 104, the write operation in the PostgreSQL database is performed in situ to obtain a new version of the data row;

步骤105，于所述新版本数据行的行头存储Undo指针，所述Undo指针指向Undo表空间中所述原始数据行。具体地，数据更新操作包括INSERT操作、UPDATE操作和DELETE操作。Step 105: store an Undo pointer in the row header of the new version data row, wherein the Undo pointer points to the original data row in the Undo tablespace. Specifically, data update operations include INSERT operations, UPDATE operations, and DELETE operations.

所述监测到PostgreSQL数据库发生数据更新，将所述原始数据行拷贝到表空间，数据更新产生新版本数据行，于所述新版本数据行的行头存储Undo指针，包括如下步骤：所述PostgreSQL数据库运行UndoLauncher进程监测正在运行的操作是否为写操作；响应于所述UndoLauncher进程监测到正在运行写操作，所述UndoLauncher进程向UndoWorker进程发送写操作通知；所述UndoWorker进程接收到写操作通知，将所述原始数据行拷贝至表空间；所述PostgreSQL数据库中的写操作原地执行，得到新版本数据行；于所述新版本数据行的行头存储Undo指针，所述Undo指针指向表空间中所述原始数据行。The method monitors data update in the PostgreSQL database, copies the original data row to the table space, generates a new version of the data row through the data update, and stores an Undo pointer at the row header of the new version of the data row, including the following steps: the PostgreSQL database runs an UndoLauncher process to monitor whether the operation being run is a write operation; in response to the UndoLauncher process monitoring that a write operation is being run, the UndoLauncher process sends a write operation notification to the UndoWorker process; the UndoWorker process receives the write operation notification and copies the original data row to the table space; the write operation in the PostgreSQL database is executed in situ to obtain a new version of the data row; and the Undo pointer is stored at the row header of the new version of the data row, the Undo pointer pointing to the original data row in the table space.

其中响应于所述PostgreSQL数据库发生事务回滚还包括：Wherein, in response to the transaction rollback occurring in the PostgreSQL database, the step further includes:

步骤201，通过UndoLauncher进程监测PostgreSQL数据库中正在运行的操作是否为写操作；Step 201, monitoring whether the operation being run in the PostgreSQL database is a write operation through the UndoLauncher process;

步骤202，若为写操作，UndoLauncher进程向UndoWorker进程发送写操作通知；Step 202, if it is a write operation, the UndoLauncher process sends a write operation notification to the UndoWorker process;

步骤203，UndoWorker进程接收到写操作通知，将原始数据行拷贝到Undo表空间；Step 203: The UndoWorker process receives the write operation notification and copies the original data row to the Undo tablespace;

步骤204，写操作原地执行，得到新版本数据行，在新版本数据行的DB_ROL_TRX字段存储Undo指针；Step 204, the write operation is performed in situ to obtain a new version of the data row, and the Undo pointer is stored in the DB_ROL_TRX field of the new version of the data row;

进一步地，Undo指针指向原始数据行所在位置，使查询SQL读取Undo表空间中该行的原始数据进行查询。Furthermore, the Undo pointer points to the location of the original data row, so that the query SQL reads the original data of the row in the Undo tablespace for query.

步骤205，将步骤203和步骤204中的操作写入WAL日志。Step 205: Write the operations in step 203 and step 204 into the WAL log.

进一步地，将步骤203和步骤204中的操作写入WAL日志，用于检查Undo空间中的原始数据行的数据一致性和数据的持久化。其中，WAL日志即WriteAheadLog预写式日志，数据修改提交前先写入log文件中。当服务器崩溃的时候，可以有效地回放日志，使数据恢复到崩溃以前。也就是，服务器崩溃后重新启动，程序可以检查log文件，将服务器崩溃时计划执行的操作内容与实际执行的操作内容进行比较，根据事务的执行情况选择撤销已执行操作、继续完成已执行操作或者保持原样，实现数据的可恢复和提交成功的数据被持久化到磁盘。Further, the operations in step 203 and step 204 are written into the WAL log, which is used to check the data consistency of the original data row in the Undo space and the persistence of the data. Among them, the WAL log is the WriteAheadLog pre-written log, which is written into the log file before the data modification is submitted. When the server crashes, the log can be effectively replayed to restore the data to before the crash. That is, after the server crashes and restarts, the program can check the log file, compare the operation content planned to be executed when the server crashes with the operation content actually executed, and choose to cancel the executed operation, continue to complete the executed operation or keep it as it is according to the execution of the transaction, so as to realize the recoverability of the data and the persistence of the successfully submitted data to the disk.

所述响应于所有会话不再需要访问所述表空间的所述原始数据行，清理所述原始数据行，释放所述表空间，包括如下步骤：通过UndoLauncher进程监控到所有会话不再需要访问所述原始数据行，通知UndoCleanWorker进程进行数据清理；所述UndoCleanWorker进程接收到数据清理通知，对所述原始数据行进行清理，释放所述表空间。In response to all sessions no longer needing to access the original data rows of the tablespace, cleaning up the original data rows and releasing the tablespace, the steps include: monitoring through the UndoLauncher process that all sessions no longer need to access the original data rows, notifying the UndoCleanWorker process to perform data cleaning; the UndoCleanWorker process receives the data cleaning notification, cleans up the original data rows and releases the tablespace.

还包括：Also includes:

本发明基于云原生分布式架构，该架构在存算分离的云原生架构之上，整合MPP的分布式计算框架，以实现计算和存储的极致弹性。从图3可以看到，计算节点是典型的MPP架构，每个节点负责部分数据的读写，各个节点数据分区不重合。所有数据统一持久化在分布式存储系统中，这样每个节点的存储具有了弹性，且不再需要备机来保证数据的可靠性。但计算节点并不具备弹性，我们只能通过升级硬件提升单个节点的性能，如果要增加节点，就需要将原有节点中的部分数据迁移到新节点，然后分别删除部分不需要的数据。The present invention is based on a cloud-native distributed architecture, which integrates the MPP distributed computing framework on top of the cloud-native architecture with storage and computing separation to achieve extreme elasticity in computing and storage. As can be seen from Figure 3, the computing node is a typical MPP architecture, each node is responsible for reading and writing part of the data, and the data partitions of each node do not overlap. All data is uniformly persisted in the distributed storage system, so that the storage of each node is elastic, and no backup machine is required to ensure the reliability of the data. However, the computing nodes are not elastic, and we can only improve the performance of a single node by upgrading the hardware. If we want to add nodes, we need to migrate part of the data in the original node to the new node, and then delete some unnecessary data separately.

举个例子：整个集群的数据分为[A,Z]共26个分区，其中一个节点负责[A,F]的读写，现在想把该节点拆分为两个节点分别负责[A,C]和[D,F]的读写，传统做法是通过复制原节点为新的节点，然后删除原节点中属于[D,F]的数据，同时删除新节点中属于[A,C]的数据。这些操作都基于大量数据的同步和删除，代价巨大。For example, the data of the entire cluster is divided into 26 partitions [A, Z]. One node is responsible for the read and write of [A, F]. Now we want to split the node into two nodes responsible for the read and write of [A, C] and [D, F] respectively. The traditional approach is to copy the original node to a new node, then delete the data belonging to [D, F] in the original node, and delete the data belonging to [A, C] in the new node. These operations are based on the synchronization and deletion of a large amount of data, which is very costly.

PostgreSQL中有表空间的概念，我们把表空间和数据分区这两个概念统一起来。同时在PostgreSQL中增加卸载和挂载表空间(包含数据的非空表空间)的逻辑，这两种操作只需要修改少量表空间相关的元数据，可以快速完成。而真正的表空间数据都在分布式存储系统中，并不需要迁移拷贝，这样就可以实现表空间(数据分区)跨节点的快速迁移。在此基础上可以实现秒级的节点扩缩容、负载调度和容灾切换等。PostgreSQL has the concept of tablespaces. We have unified the concepts of tablespaces and data partitions. At the same time, we have added the logic of unloading and mounting tablespaces (non-empty tablespaces containing data) in PostgreSQL. These two operations only require modifying a small amount of metadata related to the tablespace and can be completed quickly. The actual tablespace data is in the distributed storage system and does not need to be migrated or copied. In this way, fast migration of tablespaces (data partitions) across nodes can be achieved. On this basis, node expansion and contraction, load scheduling, and disaster recovery switching can be achieved in seconds.

PostgreSQL中实现表空间跨实例动态迁移，还需要解决XID回收的问题。因为XID的值域范围局限于产生该XID的数据库实例内，一但表空间被挂载到另一个实例，这些XID的值域会因为和该实例XID的值域重合而不可用。To implement dynamic migration of tablespaces across instances in PostgreSQL, the problem of XID recycling needs to be solved. Because the range of XID values is limited to the database instance that generates the XID, once the tablespace is mounted to another instance, the range of these XID values will overlap with the range of the instance's XID and become unavailable.

所述Async Freeze算法包括以下步骤：The Async Freeze algorithm includes the following steps:

基于PostgreSQL在云原生架构基础上，进行表空间的跨实例动态转移，其主要有如下改造步骤：Based on PostgreSQL's cloud-native architecture, dynamic cross-instance migration of tablespaces is performed. The main transformation steps are as follows:

表空间卸载操作；Tablespace unload operation;

表空间卸载操作包括以下步骤：The tablespace unload operation consists of the following steps:

停止目标表空间的写操作；Stop writing to the target tablespace;

卸载分布式文件系统。Unmount the Distributed File System.

表空间挂载操作。Tablespace mount operation.

表空间挂载操作包括以下步骤：The tablespace mount operation includes the following steps:

挂载分布式文件系统；Mount the distributed file system;

开启表空间读写操作；Enable tablespace read and write operations;

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the embodiment of the present invention. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

此外，尽管已经在本发明中描述了示例性实施例，其范围包括任何和所有基于本发明的具有等同元件、修改、省略、组合(例如，各种实施例交叉的方案)、改编或改变的实施例。权利要求书中的元件将被基于权利要求中采用的语言宽泛地解释，并不限于在本说明书中或本申请的实施期间所描述的示例，其示例将被解释为非排他性的。因此，本说明书和示例旨在仅被认为是示例，真正的范围和精神由以下权利要求以及其等同物的全部范围所指示。In addition, although exemplary embodiments have been described in the present invention, the scope includes any and all embodiments based on the present invention with equivalent elements, modifications, omissions, combinations (e.g., various embodiments intersecting schemes), adaptations or changes. The elements in the claims will be interpreted broadly based on the language adopted in the claims, and are not limited to the examples described in this specification or during the implementation of this application, and the examples will be interpreted as non-exclusive. Therefore, this specification and examples are intended to be considered as examples only, and the true scope and spirit are indicated by the following claims and the full scope of their equivalents.

以上描述旨在是说明性的而不是限制性的。例如，上述示例(或其一个或更多方案)可以彼此组合使用。例如本领域普通技术人员在阅读上述描述时可以使用其它实施例。另外，在上述具体实施方式中，各种特征可以被分组在一起以简单化本发明。这不应解释为一种不要求保护的公开的特征对于任一权利要求是必要的意图。相反，本发明的主题可以少于特定的公开的实施例的全部特征。从而，以下权利要求书作为示例或实施例在此并入具体实施方式中，其中每个权利要求独立地作为单独的实施例，并且考虑这些实施例可以以各种组合或排列彼此组合。本发明的范围应参照所附权利要求以及这些权利要求赋权的等同形式的全部范围来确定。The above description is intended to be illustrative rather than restrictive. For example, the above examples (or one or more of them) can be used in combination with each other. For example, a person of ordinary skill in the art can use other embodiments when reading the above description. In addition, in the above-mentioned specific embodiments, various features can be grouped together to simplify the present invention. This should not be interpreted as an intention that a disclosed feature that is not required to be protected is necessary for any claim. On the contrary, the subject matter of the present invention may be less than all the features of a specific disclosed embodiment. Thus, the following claims are incorporated into the specific embodiments as examples or embodiments, wherein each claim is independently used as a separate embodiment, and it is considered that these embodiments can be combined with each other in various combinations or arrangements. The scope of the present invention should be determined with reference to the attached claims and the full scope of equivalent forms granted by these claims.

以上实施例仅为本发明的示例性实施例，不用于限制本发明，本发明的保护范围由权利要求书限定。本领域技术人员可以在本发明的实质和保护范围内，对本发明做出各种修改或等同替换，这种修改或等同替换也应视为落在本发明的保护范围内。The above embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention. The protection scope of the present invention is defined by the claims. Those skilled in the art may make various modifications or equivalent substitutions to the present invention within the essence and protection scope of the present invention, and such modifications or equivalent substitutions shall also be deemed to fall within the protection scope of the present invention.

Claims

1. A PostgreSQL dynamic transferable tablespace method based on storage and computing separation architecture, characterized by comprising the following steps:

It is monitored that data update occurs in the PostgreSQL database, the original data row is copied to the table space, the data update generates a new version data row, and an Undo pointer is stored in the row header of the new version data row;

In response to a transaction rollback in the PostgreSQL database, copying the original data row from the table space back to the original location;

In response to all sessions no longer needing to access the original data row in the table space, clearing the original data row and releasing the table space;

The configuration tablespace Async Freeze algorithm solution is used to solve the problem of XID value range overlap and unavailability.

2. According to a PostgreSQL dynamic transferable tablespace method based on storage and computing separation architecture according to claim 1, it is characterized in that: the Async Freeze algorithm comprises the following steps:

Add the tablespace mount version number to the data page. Each time the tablespace is mounted to a new PG instance, the version number is incremented by 1. When the page is updated, the current version number is written. By judging the version number, it is possible to determine whether the last update of the current page was initiated by the current instance.

After the tablespace is mounted to the new instance, start the asynchronous task to execute Async Freeze, which scans all tablespace pages, identifies the unupdated pages by version number, and executes the Freeze operation to reclaim the XID.

After the tablespace is mounted to the new instance, it can support read and write operations. If the page read is an unfreezed page, the Freeze operation is initiated immediately.

After the page is frozen, its mounting version number is also updated synchronously and it is treated as a normal page.

3. According to a PostgreSQL dynamic transferable tablespace method based on a storage-computing separation architecture according to claim 2, it is characterized in that: based on PostgreSQL on the basis of a cloud native architecture, a cross-instance dynamic transfer of tablespace is performed, which mainly includes the following transformation steps:

Table space storage is connected to distributed storage;

Tablespace unload operation;

Tablespace mount operation.

4. A PostgreSQL dynamic transferable tablespace method based on storage and computing separation architecture according to claim 3, characterized in that: the tablespace unloading operation comprises the following steps:

Stop writing to the target tablespace;

Execute checkpoint to ensure that the tablespace data is flushed;

Delete the metadata related to the tablespace from PostgreSQL;

Unmount the Distributed File System.

5. A PostgreSQL dynamic transferable tablespace method based on storage and computing separation architecture according to claim 4, characterized in that: the tablespace mounting operation comprises the following steps:

Mount the distributed file system;

Add tablespace related metadata in PostgreSQL;

Enable tablespace read and write operations;

Start the tablespace Async Freeze operation.

6. According to claim 5, a PostgreSQL dynamic transferable tablespace method based on a storage-computation separation architecture is characterized in that: when data update is detected in the PostgreSQL database, the original data row is copied to the tablespace, the data update generates a new version of the data row, and an Undo pointer is stored in the header of the new version of the data row, comprising the following steps: the PostgreSQL database runs an UndoLauncher process to monitor whether the operation being run is a write operation; in response to the UndoLauncher process detecting that a write operation is being run, the UndoLauncher process sends a write operation notification to the UndoWorker process; the UndoWorker process receives the write operation notification and copies the original data row to the tablespace; the write operation in the PostgreSQL database is executed in situ to obtain a new version of the data row; the Undo pointer is stored in the header of the new version of the data row, and the Undo pointer points to the original data row in the tablespace.

7. A PostgreSQL dynamic transferable tablespace method based on storage-computation separation architecture according to claim 6, characterized in that: in response to all sessions no longer needing to access the original data rows of the tablespace, cleaning up the original data rows and releasing the tablespace, comprises the following steps: monitoring through the UndoLauncher process that all sessions no longer need to access the original data rows, notifying the UndoCleanWorker process to perform data cleaning; the UndoCleanWorker process receives the data cleaning notification, cleans up the original data rows, and releases the tablespace.

8. The PostgreSQL dynamic transferable tablespace method based on the storage and computing separation architecture according to claim 7, characterized in that it also includes:

A data update module is used to monitor data updates in the PostgreSQL database, copy the original data rows to the table space, generate new version data rows through data updates, and store Undo pointers in the headers of the new version data rows; a transaction rollback module is used to copy the original data rows from the table space back to their original locations in response to transaction rollbacks in the PostgreSQL database; a data cleanup module is used to clean up the original data rows and release the table space in response to all sessions no longer needing to access the original data rows in the table space.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, the method for dynamically transferable tablespaces of PostgreSQL based on a storage-computation separation architecture as described in any one of claims 1 to 8 is implemented.

10. A device, comprising:

A memory for storing instructions;

A processor is used to execute the instruction so that the device performs operations to implement the PostgreSQL dynamic transferable tablespace method based on the storage-computation separation architecture as described in any one of claims 1 to 8.