CDL is a simple and efficient real-time data integration service. It captures data change events from various OLTP databases and pushes them to Kafka. The Sink Connector consumes data in topics and imports the data to the software applications of big data ecosystems. In this way, data is imported to the data lake in real time.
The CDL service contains two roles: CDLConnector and CDLService. CDLConnector is the instance for executing a data capture job, and CDLService is the instance for managing and creating a job.
You can create data synchronization and comparison tasks on the CDLService WebUI.
Data source |
Destination end |
Description |
---|---|---|
MySQL |
Hudi |
This task synchronizes data from the MySQL database to Hudi. |
Kafka |
This task synchronizes data from the MySQL database to Kafka. |
|
PgSQL |
Hudi |
This task synchronizes data from the PgSQL database to Hudi. |
Kafka |
This task synchronizes data from the PgSQL database to Kafka. |
|
Hudi |
DWS |
This task synchronizes data from the Hudi database to DWS. |
ClickHouse |
This task synchronizes data from the Hudi database to ClickHouse. |
|
ThirdKafka |
Hudi |
This task synchronizes data from the ThirdKafka database to Hudi. |
To check whether binary logging is enabled for the MySQL database:
Use a tool (Navicat is used in this example) or CLI to connect to the MySQL database and run the show variables like 'log_%' command to view the configuration.
For example, in Navicat, choose File > New Query to create a query, enter the following SQL statement, and click Run. If log_bin is displayed as ON in the result, the function is enabled successfully.
show variables like 'log_%'
If the bin log function of the MySQL database is not enabled, perform the following operations:
Modify the MySQL configuration file my.cnf (my.ini for Windows) as follows:
server-id = 223344 log_bin = mysql-bin binlog_format = ROW binlog_row_image = FULL expire_logs_days = 10
After the modification, restart MySQL for the configurations to take effect.
To check whether GTID is enabled for the MySQL database:
Run the show global variables like '%gtid%' command to check whether GTID is enabled. For details, see the official documentation of the corresponding MySQL version. (For details about how to enable the function in MySQL 8.x, see https://dev.mysql.com/doc/refman/8.0/en/replication-mode-change-online-enable-gtids.html.)
Set user permissions:
To execute MySQL tasks, users must have the SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE and REPLICATION CLIENT permissions.
Run the following command to grant the permissions:
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'Username' IDENTIFIED BY 'Password';
Run the following command to update the permissions:
FLUSH PRIVILEGES;
When the delete operation is performed on the source database, a delete event contains only the primary key information. In this case, for the delete data written to Hudi, only the primary key has values, and the values of other service fields are null.
When the size of a single piece of data in the database exceeds 8 KB (including 8 KB), an update event contains only changed fields. In this case, the values of some fields in the Hudi data are __debezium_unavailable_value.
The related commands are as follows:
SELECT CASE relreplident WHEN 'd' THEN 'default' WHEN 'n' THEN 'nothing' WHEN 'f' THEN 'full' WHEN 'i' THEN 'index' END AS replica_identity FROM pg_class WHERE oid = 'tablename'::regclass;
#------------------------------------------------ #WRITE-AHEAD LOG #------------------------------------------------ # - Settings - wal_level = logical # minimal, replica, or logical # (change requires restart) #fsync = on #flush data to disk for crash safety ...
# Stop pg_ctl stop # Start pg_ctl start
Before a synchronization task is started, both the source and target tables exist and have the same table structure. The value of ads_last_update_date in the DWS table is the current system time.
The upper-layer source supports openGauss and OGG. Kafka topics at the source end can be consumed by Kafka in the MRS cluster.
You have the permissions to operate ClickHouse. For details, see ClickHouse User and Permission Management.
This section describes the data types supported by CDL synchronization tasks and the mapping between data types of the source database and Spark data types.
PostgreSQL Data Type |
Spark (Hudi) Data Type |
---|---|
int2 |
int |
int4 |
int |
int8 |
bigint |
numeric(p, s) |
decimal[p,s] |
bool |
boolean |
char |
string |
varchar |
string |
text |
string |
timestamptz |
timestamp |
timestamp |
timestamp |
date |
date |
json, jsonb |
string |
float4 |
float |
float8 |
double |
MySQL Data Type |
Spark (Hudi) Data Type |
---|---|
int |
int |
integer |
int |
bigint |
bigint |
double |
double |
decimal[p,s] |
decimal[p,s] |
varchar |
string |
char |
string |
text |
string |
timestamp |
timestamp |
datetime |
timestamp |
date |
date |
json |
string |
float |
double |
Oracle Data Type |
Spark (Hudi) Data Type |
---|---|
NUMBER(3), NUMBER(5) |
bigint |
INTEGER |
decimal |
NUMBER(20) |
decimal |
NUMBER |
decimal |
BINARY_DOUBLE |
double |
CHAR |
string |
VARCHAR |
string |
TIMESTAMP, DATETIME |
timestamp |
timestamp with time zone |
timestamp |
DATE |
timestamp |
Spark (Hudi) Data Type |
DWS Data Type |
---|---|
int |
int |
long |
bigint |
float |
float |
double |
double |
decimal[p,s] |
decimal[p,s] |
boolean |
boolean |
string |
varchar |
date |
date |
timestamp |
timestamp |
Spark (Hudi) Data Type |
ClickHouse Data Type |
---|---|
int |
Int32 |
long |
Int64 (bigint) |
float |
Float32 (float) |
double |
Float64 (double) |
decimal[p,s] |
Decimal(P,S) |
boolean |
bool |
string |
String (LONGTEXT, MEDIUMTEXT, TINYTEXT, TEXT, LONGBLOB, MEDIUMBLOB, TINYBLOB, BLOB, VARCHAR, CHAR) |
date |
Date |
timestamp |
DateTime |
Data comparison checks the consistency between data in the source database and that in the target Hive. If the data is inconsistent, CDL can attempt to repair the inconsistent data. For detail, see Creating a CDL Data Comparison Job.