Binlog-ETL
Binlog-ETL is a system designed for synchronising the data in MySQL to HIVE data warehouse on Hadoop. The best feature is that it can keep
the latest snapshot of MySQL tables in the DW while MySQL databases are updating.
Binlog
Binlog is the short for binary log. The binary log of MySQL contains “events” that describe database changes such as table creation
operations or changes to table data. It is also used for MySQL Master-Slave synchronization. So in this repository, it is designed to
be used for synchronising MySQL tables to Data Wharehouse constructed with HIVE and Hadoop.
Features
- Create HIVE tables and partitions
- Parse MySQL binary logs and extract updates
- Upload data to HDFS
- Support multi-processes
- Support MySQL sharding databases and tables
- Keep the latest snapshot and remove duplicates
- The above features are all automatical
Requriements
- MySQL version 5.6 or newer
- Enable binary log in my.cnf
Data Flow
The Binlog-ETL system lays between MySQL cluster and HIVE data warehouse. It requests bin-logs from MySQL cluster initiativly and creates snapshots automatically.
- Step 1, request latest bin-logs and download them to local disks
- Step 2, parse bin-logs and transform the data to target format
- Step 3, upload the new data to HDFS. The data are the latest updates of MySQL tables
- Step 4, merge the new data with the old snapshot, keep the latest updates and remove the duplicates.
- Step 5, Create new a snapshot and end.