CN119166717A

CN119166717A - A method for real-time synchronization of heterogeneous databases supporting full-increment integration

Info

Publication number: CN119166717A
Application number: CN202411184715.7A
Authority: CN
Inventors: 王贤锋; 吕建成; 王常瑞; 陈超; 王强松; 苟启文
Original assignee: Anhui Keli Information Industry Co Ltd
Current assignee: Anhui Keli Information Industry Co Ltd
Priority date: 2024-08-27
Filing date: 2024-08-27
Publication date: 2024-12-20

Abstract

The invention discloses a method for supporting real-time synchronization of heterogeneous databases with full increment integration, which comprises the steps of designing a unified interface layer to configure a source database and a target database, compiling a specific adapter for each supported database type through a database adapter mode, respectively initializing a log analyzer, a transaction manager and a data buffer zone, reading and analyzing newly added log records, only extracting data of a change part for each data change record, putting the extracted incremental data into the data buffer zone, converting a data format and establishing a mapping relation between a source data model and a target data model, packaging data generated by a data capturing layer into a message queue, transmitting the message to the message queue, receiving the transmitted message by a real-time synchronization engine, monitoring the whole synchronization process, and storing logs generated by the data capturing layer and the real-time synchronization engine into the target database by using the message queue. Has the advantages of real-time performance and wide support.

Description

Method for supporting real-time synchronization of heterogeneous databases integrated in full increment

Technical Field

The invention relates to the technical field of database management, in particular to a method for supporting real-time synchronization of heterogeneous databases integrated in full increment.

Background

Many organizations and businesses today use multiple different types of databases to store and manage their data, which may include relational databases, noSQL databases, columnar databases, and the like. Furthermore, these databases may be distributed in different geographical locations, even running on different cloud platforms, with different ways of data storage and querying. Therefore, real-time data synchronization is a key technology to ensure that data among different databases remains consistent and real-time. However, real-time data synchronization between heterogeneous databases faces some challenges due to the heterogeneity and complexity.

There are several common methods to achieve data synchronization between heterogeneous databases:

Method one, ETL tool (Extract Transform Load), the ETL tool is typically used to extract data from one database, transform and process it, and load it into another database. While ETL tools can achieve batch synchronization of data, they cannot meet the real-time synchronization requirements, resulting in some delay after data update, and require additional conversion and processing steps.

Method two, database replication and synchronization tools some database management systems provide built-in replication and synchronization functionality that allows data to be replicated from one database instance to another. However, these tools are typically limited to a particular type of database and lack comprehensive support for heterogeneous databases.

The method three, the message queue and the event driven architecture can realize asynchronous data transmission between databases by using the message queue and the event driven architecture and provide a certain degree of real-time. But this approach typically requires the development of custom data processing and synchronization logic and may present a risk of data loss or duplication.

It can be seen that the defects existing in the prior art in realizing real-time synchronization of heterogeneous databases are as follows:

1. The real-time performance is insufficient, and many existing data synchronization schemes cannot provide real-time data synchronization, so that a certain delay exists after data updating, and the requirement of certain application scenes on real-time data cannot be met.

2. The complexity is high, some solutions require additional data conversion, format mapping and processing steps, increase implementation complexity and cost, and may lead to problems in terms of data consistency and accuracy.

3. Performance bottlenecks-large-scale data synchronization may cause performance bottlenecks, particularly when synchronizing across networks or across cloud platforms, which may affect the response speed and throughput of the system.

4. There is a limited support in that some database replication and synchronization tools only support certain types of databases and cannot meet the overall synchronization requirements between heterogeneous databases.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for supporting real-time synchronization of heterogeneous databases integrated in full increment, which comprises the following steps:

A unified interface layer is designed on the heterogeneous data access layer to support connection of various databases, and a specific adapter is written for each supported database type through a database adapter mode so as to shield the difference of different databases;

Monitoring the change of the database log file, reading and analyzing the newly added log record, determining incremental data, and detecting the start and end marks of the transaction in the log;

For each data change record, only extracting data of a change part, and putting the extracted incremental data into the data buffer area;

When a transaction end mark is detected, checking the data change in the data buffer area, and issuing the data change in the data buffer area to the target database;

Converting the data of the source database into a format acceptable by the target database through conversion of the data format, and ensuring that the data in different data models are represented and processed according to a unified standard through establishing a data mapping relation between the source data model and the target data model;

And monitoring the whole synchronization process, and storing the logs generated by the data capturing layer and the real-time synchronization engine into the target database by using the message queue.

Further, the data capturing layer monitors the data change of the source database in real time, and can adopt different capturing strategies according to the characteristics of different databases, wherein the capturing strategies comprise capturing based on a CDC connector or a trigger or capturing based on a time stamp or an auto-increment primary key id.

Further, the CDC connector-based method includes:

For databases providing log functions, the CDC connector may capture changes in data by analyzing the log, convert the change operations in the log to a processable data format, and extract only the newly added or changed data since the last synchronization to reduce the load of data transmission and processing.

Further, the trigger-based capture method includes:

Setting a trigger on a table or database of the source database, automatically triggering a capturing operation when the data changes, and storing the changed data into a temporary table or queue.

Further, the method based on the capture of the timestamp or the self-increment primary key id comprises the following steps:

Recording a time stamp or an auto-increment primary key id of each synchronization in the process of data synchronization;

and in the next synchronization, screening the data changed from the last synchronization according to the time stamp or the self-increment primary key id.

Further, the reading and parsing the newly added log record includes:

processing the log parsing and data capturing using multithreading or multiprocessing, wherein each thread or process is responsible for processing a portion of the log records;

Before the data change is released, the transaction manager is used for checking the integrity of the transaction, so that the data change of each transaction is released after being submitted, and the data consistency is ensured.

Further, the method for converting the data of the source database into the format acceptable to the target database through the conversion of the data format comprises the following steps:

Data cleaning, namely improving the data quality by removing repeated data, correcting error data and filling missing values;

data standardization, namely converting the data into a unified format or standard so as to facilitate subsequent processing and analysis;

Data adaptation, converting data into a format that meets the needs of a particular system or application.

Further, the method for establishing the data mapping relationship between the source data model and the target data model comprises manual data mapping or full-automatic data mapping or semi-automatic data mapping, wherein the specific clear data element correspondence relationship is established between the source data model and the target data model, and data in different data models are ensured to be represented and processed according to a unified standard, and the method for establishing the data mapping relationship between the source data model and the target data model comprises the following steps of:

the manual data mapping is to establish a definite data corresponding relation in a manual mode;

The full-automatic data mapping is to automatically match data fields by using natural language processing technology;

the semi-automatic data mapping is a result generated after full-automatic data mapping, and if error mapping is generated between fields, the mapping fields can be manually adjusted.

Further, the monitoring of the entire synchronization process also includes designing exception handling mechanisms to ensure data integrity and synchronization reliability.

Further, the exception handling mechanism includes:

Recording anomaly information and context using a log framework;

using database transaction management to ensure consistency of data;

for user requests, a reasonable error response and feedback mechanism is used;

After the exception is captured, resource cleaning is carried out to ensure that all allocated resources are released;

a monitoring and alarm system is provided to notify the relevant personnel in time when an anomaly occurs.

Compared with the prior art, the invention has the following beneficial effects:

1. compared with the traditional periodic polling or batch processing, the method and the device can capture data change in real time by introducing a log-based real-time capture mechanism, namely directly analyzing the change log of the database, immediately trigger the synchronization process and meet the application scene with extremely high requirements on the real-time performance.

2. The method and the system simplify complexity and cost, can automatically identify and process format differences among different data sources through an intelligent data conversion and field mapping mechanism, reduce the requirements of manual configuration and customizing of conversion logic, and simultaneously reduce the complexity and maintenance cost of the system through an integrated and modularized design.

3. The system breaks through performance bottleneck, adopts a parallel processing and distributed architecture, can execute data synchronization tasks on a plurality of nodes in parallel, effectively disperses processing pressure, and improves the overall performance and throughput of the system.

4. And the system has wide support, and through a flexible database adapter mode and an extensible plug-in architecture, a developer can easily add support to a new data source according to actual requirements, so that the availability and the adaptability of the system are improved.

Drawings

FIG. 1 is a schematic block diagram of an embodiment of the present disclosure;

FIG. 2 is a schematic block diagram of a data capture layer implemented by a CDC connector and exemplified by binlog of mysql in accordance with an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a data capture algorithm disclosed in an embodiment of the present invention;

FIG. 4 is a timing diagram of a data capture algorithm according to an embodiment of the present invention;

FIG. 5 is a flow chart of monitoring and logging according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions and technical effects of the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments.

The invention aims to provide a method for supporting real-time synchronization of a full-increment integrated heterogeneous database, which realizes high-efficiency and low-delay data synchronization in a heterogeneous database environment, supports full-increment and increment data synchronization and ensures consistency of a target database and a source database.

Referring to fig. 1, the present invention is illustrated in terms of a heterogeneous data access layer, a data capture layer, a real-time synchronization engine, and monitoring and logging:

1. heterogeneous data access layer

The unified interface layer is designed on the heterogeneous data access layer to support the connection of various databases, such as a relational database, a non-relational database, a file system, a message queue, an API interface and the like, and the data in the data sources are captured in real time or at regular time.

The heterogeneous data access layer writes a specific adapter for each supported database type through a database adapter schema to mask the differences of the different databases. And the system automatically selects a data source driver according to the data source type so as to be connected with the database.

2. Data capture layer

The data capturing layer can still keep high-efficiency performance under the conditions of large data volume and high concurrence through the high-efficiency data capturing algorithm, and meanwhile, when the captured data changes, the integrity and consistency of the transaction need to be considered, so that the captured data can accurately reflect the state of the source database.

Referring to fig. 3-4, the specific implementation method includes:

2.1 initializing

And initializing a log analyzer for analyzing the database log file.

The transaction manager is initialized for detecting and managing transaction boundaries.

A data buffer is initialized for temporarily storing captured data.

2.2 Real time Log resolution

Monitoring the change of the log file of the database, and reading the newly added log record when the log file is changed.

2.3, Parse Log records

Analyzing the read log record, extracting the specific content of the data change, such as INSERT, UPDATE, DELETE operation, and determining the increment change of the data according to the operation type.

2.4 Transaction boundary detection

When the log record is analyzed, the start and end marks of the transaction are detected, so that the data change is not issued before the transaction is ended, and the consistency of the data is ensured.

2.5 Incremental data extraction

And extracting only data of a change part for each data change record, and placing the extracted incremental data into a data buffer.

2.6 Parallel processing

And processing log analysis and data capture by using multiple threads or multiple processes, wherein each thread or process is responsible for processing a part of log records, and the processing efficiency is improved.

2.7 Data consistency assurance

Before the data change is released, a transaction manager is used for checking the integrity of the transaction, so that the data change of each transaction is released after being submitted, and the data consistency is ensured.

2.8 Data Release

When the end of transaction flag is detected, the data changes in the data buffer are checked and published to a target database or other data processing system.

The capture policy for the data capture layer may be based on CDC connector or trigger capture or based on timestamp or auto-added primary key id capture, which are described in detail below, respectively:

a. CDC-based connector

Many database systems currently provide a Log function, which records all change operations of the database, such as redox Log of binlog, oracle in MySQL. The CDC connector can capture the change of the data by analyzing the logs, convert the change operation in the logs into a processable data format, and only extract the newly added or changed data since the last synchronization, thereby reducing the load of data transmission and processing.

The data CDC technology based on the database log is combined with real-time data conversion, and data can be acquired in real time in a zero-invasion manner through a large number of CDC connectors, so that full and incremental integration of the data can be easily realized.

Reference is made to fig. 2, which illustrates a binlog of MySQ as an example:

After the database is configured, the system acquires a driver for establishing connection with a target data source. At the decision point of 'whether data exist', the system checks whether available data exist in the target system, if so, the system continues to execute, otherwise, the system ends;

And the next step of assembling connection information, namely constructing connection parameters according to the collected information, connecting a database, monitoring and acquiring data if the connection is successful, and starting an event to analyze binlog.

The "start file state monitoring" is an important link in the flow, and is responsible for monitoring the change of the file state in the target system. After the file state monitoring is started, a check thread is established every three minutes, whether the system has a reading blocking condition or not is checked regularly, so that smooth reading execution is guaranteed, if yes, whether the collected file is outdated is continuously judged, if not, whether the collected file is the last file is continuously judged, the log file is completely read, and then the monitor is closed.

B. Trigger-based capture

Trigger setting, namely setting a trigger on a table or a database of a source database, and automatically triggering a capturing operation when the data change.

And (3) data capture, namely after the trigger is triggered, executing corresponding capture logic and storing the changed data into a temporary table or a queue.

C. timestamp-based or self-increasing primary key id capture

And recording the time stamp or the self-increasing main key id of each synchronization in the data synchronization process.

And data screening, namely screening the data which changes from the last synchronization according to the time stamp or the self-increment primary key id in the next synchronization.

3. Data conversion and mapping layer

3.1, Data conversion

The data conversion and mapping layer converts the data of the source database into a format acceptable by the target database through the conversion of the data format so as to be used for solving the problems of data type difference, inconsistent field names and the like, and ensures that the data in different data models are represented and processed according to a unified standard by establishing a data mapping relation between the source data model and the target data model. The method for converting the data format comprises data cleaning, data standardization and data adaptation. Specific:

data cleaning, namely improving the data quality by removing repeated data, correcting error data and filling missing values.

Data normalization, converting data into a uniform format or standard for subsequent processing and analysis.

Common operations are:

data type conversion, such as converting a string type to a numeric type.

Data splitting and merging, namely splitting a complex data structure into simple components or merging data of a plurality of data sources into a whole.

Data aggregation-the aggregation or grouping of data to generate higher level views of the data.

Data transformation, such as logarithmic transformation, exponential transformation, etc., to improve the readability and analytical capabilities of the data.

3.2, Data mapping

The method for realizing the data mapping relationship between the source data model and the target data model can comprise manual data mapping or full-automatic data mapping or semi-automatic data mapping, and mainly comprises the following steps:

manual data mapping, namely establishing clear data corresponding relation in a manual mode, and being suitable for scenes with small data volume and relatively simple data structure.

And full-automatic data mapping, namely automatically matching data fields by using a natural language processing technology, so as to realize efficient data mapping.

Semi-automatic data mapping, namely, for the result generated after full-automatic data mapping, if error mapping is generated between fields, the mapping fields can be manually adjusted.

The main features of the data mapping mode are as follows:

and establishing a specific and clear data element corresponding relation between the source data model and the target data model.

The independence and the flexibility are that the data mapping mode enables the persistence data storage layer, the data presentation layer residing in the memory and the data mapping itself to be independent and independent, thereby improving the flexibility and maintainability of the system.

Standardization and consistency by defining well-defined mapping rules and logic to ensure that data in different data models are represented and processed according to a unified standard.

The method comprises the steps of establishing a data mapping relation between a source data model and a target data model, reducing errors and manual intervention in a data integration process through a clear data corresponding relation, improving data integration efficiency, converting data into a format which is easier to analyze and inquire, improving inquiry speed, reducing system load, further optimizing data analysis and inquiry performance, and integrating data of a plurality of data sources into a unified data model, so that analysis and inquiry of data sources are convenient to carry out, and support of analysis and inquiry of data sources is realized.

4. Real-time synchronization engine

The method comprises the steps of packaging data generated by a data capturing layer into information to be sent to an information queue, such as a specific Topic of Kafka, receiving the sent information by a real-time synchronous engine, supporting batch writing and transactional operation, ensuring the integrity and consistency of the data, realizing asynchronous processing and load balancing of the data, and dividing the batch data into a plurality of parts to be written into a database in parallel by utilizing a multithreading technology, wherein the writing speed can be remarkably improved, but the management of database connection and the synchronization of transactions are required to be paid attention.

5. Monitoring and logging

Monitor the entire synchronization process and provide fault recovery capability. The monitored data is written to the target database clickhouse or ELASTICSEARCH. The clickhouse has the advantages of low delay, high throughput, high-performance query, low resource consumption and the like, and is very suitable for application scenes requiring high-speed data processing and real-time analysis. Refer to the collection, processing, storage and query process of log data shown in fig. 5. The log generated by the data capture layer and the real-time synchronization engine is stored in the target database, clickhouse or ELASTICSEARCH, using a message queue. The system is capable of efficiently processing large amounts of log data and provides powerful query and analysis capabilities.

Various abnormal conditions, such as network interruption, database faults and the like, can be encountered in the data capturing and writing process, and the integrity of data and the reliability of synchronization are ensured by designing a perfect abnormal processing mechanism, in particular:

5.1 identifying potential anomaly types

Network anomalies such as connection timeouts, network outages, DNS resolution failures, etc.

Database anomalies such as connection failures, SQL syntax errors, transaction conflicts, deadlocks, etc.

Abnormal system resources such as insufficient memory, insufficient disk space, excessive CPU utilization, etc.

Business logic anomalies such as data validation failures, business rule violations, and the like.

5.2 Design exception handling policy

And the prevention strategy is to prevent abnormal occurrence through proper resource management and system monitoring, and set reasonable timeout time, resource quota and load balancing.

The capture strategy uses a try-catch block at a critical location of the code to capture and process the exception in the prediction, ensuring that the type of exception captured is as specific as possible so that more targeted processing measures can be taken.

Recording policies-recording all captured anomalies and their contextual information, such as time stamps, user IDs, request parameters, etc., for subsequent analysis and debugging.

Rollback policy-for operations involving database transactions, if an exception occurs, it is ensured that the state before the transaction begins can be rolled back to maintain data consistency.

Retry strategy-for some recoverable anomalies, such as a brief interruption of the network, by a retry mechanism, the operation is automatically retried before the upper limit of retry times is reached.

5.3 Processing method of exception handling mechanism

Log framework-Log 4j is used to record anomaly information and context.

Database transaction management-declarative transaction management using JDBC transactions and Spring to ensure data consistency.

Error handling and feedback by using a reasonable error response and feedback mechanism for the user request so that the user can understand what has happened and possibly take further action.

Resource cleaning, after an exception is captured, ensuring that all allocated resources are released, such as database connections, file handles, and the like.

Monitoring and alerting the monitoring and alerting system is set up to inform the relevant personnel in time when an abnormality occurs.

By comparing this embodiment with the prior art method of building an API layer, dual-write kafka, this embodiment has significant advantages in terms of development effort, impact on-demand applications, and monitoring and maintenance, as shown in the following table:

the data sources supportable by the embodiment comprise more than 100 databases such as MySQL、MongoDB、PostgresSQL、0racle、SQL Server、Elasticsearch、Kafka、OceanBase、Cassandra、ASE、Gaussdb2000、MariaDB、ApacheDoris、StarRocks、Databend、DM、HBase、HIVE、MQ、KUDU、Greenplum、TiDB、Hana、TCP/IPClickHouse、File、CSV, the data CDC technology based on database logs is combined with real-time data conversion and data mapping, a large number of CDC connectors are arranged in the database, data can be acquired in real time in a zero-invasion mode, and full-increment integration of the data is easily achieved.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for supporting real-time synchronization of heterogeneous databases integrated in full increment, comprising:

when a transaction end mark is detected, checking the data change in the data buffer area, and issuing the data change in the data buffer area to a target database;

2. The method for supporting real-time synchronization of heterogeneous databases with full incremental integration according to claim 1, wherein the data capturing layer monitors data changes of the source database in real time and can adopt different capturing strategies according to characteristics of different databases, wherein the capturing strategies comprise capturing based on CDC connectors or based on triggers or capturing based on timestamps or self-increasing primary key ids.

3. The method for supporting real-time synchronization of disparate databases in full incremental unification of claim 2 wherein the CDC connector based method comprises:

4. The method for supporting real-time synchronization of disparate databases in full incremental unification according to claim 2, wherein the trigger capture based method comprises:

5. The method for supporting real-time synchronization of heterogeneous databases with full incremental integration according to claim 2, wherein the method based on time stamp or auto-increment primary key id capture comprises:

6. The method for supporting real-time synchronization of disparate databases in full incremental integration of claim 1 wherein the reading and parsing the newly added log records comprises:

7. The method for supporting real-time synchronization of disparate databases in full incremental integration according to claim 1, wherein the converting the data of the source database into a format acceptable to the target database by converting the data format comprises:

8. The method for supporting real-time synchronization of heterogeneous databases with full incremental integration according to claim 1, wherein the method for ensuring that data in different data models are represented and processed according to a unified standard by establishing a specific and clear data element correspondence between a source data model and a target data model, wherein the method for establishing a data mapping relationship between the source data model and the target data model comprises manual data mapping or full-automatic data mapping or semi-automatic data mapping, wherein:

9. The method for real-time synchronization of heterogeneous databases supporting full incremental integration according to claim 1, wherein the monitoring of the entire synchronization process further comprises designing exception handling mechanisms to ensure data integrity and synchronization reliability.

10. The method for supporting real-time synchronization of disparate databases in full incremental unification of claim 9 wherein the exception handling mechanism comprises:

Recording anomaly information and context using a log framework;

using database transaction management to ensure consistency of data;

for user requests, a reasonable error response and feedback mechanism is used;