Welcome to My GitHub Page

This page is the hub for my projects related with Big Data mainly, though I have done many backend developments as well. But currently I am staying focused on Big Data. Because I realise that I just have passion on data related work during my past career and I want to get deep insight into Big Data so that I can program an intelligent app to make people's live easier in future. Sounds like a native dream? Yeah, the engineer's ambition.

Big Data

From my perspective, I classify the application of Big Data into two fields, engineering and algorithm. The Big Data applications in engineering field are mainly the tools, such as Hadoop, Spark and ETL tools. They are defined as the foundation platforms providing basic data storage, data computing and data transportation functions to support top applications. And the algorithms, such as machine learning, artificial intelligence and pattern recognition, are the minds to make the applications serve people in a smart way. In this page, most projects are from the engineering field, only the Anomaly Detection is an algorithm project. The reason behind this phenomenon is the foundation platform right now is not completed enough to support the intelligent applications. So I contribute more time in building the platform.

Engineering projects

Binlog ETL framework

It is used to synchronise MySQL tables to HIVE as the DW(data warehouse). It keeps the latest snapshot of MySQL tables in the HIVE in the form of partition by time.
http://focus-andy.github.io/Binlog-ETL

Near Realtime ETL framework

It is used to synchronise data from realtime pipe applications, for example Kalfa, to HIVE as the DW. Generally, the original data source is in text file format, such as webserver logs, program's files. But the data source can be in any formats if the data can be transferred to the pipes.
http://focus-andy.github.io/Near-RealTime-ETL

Fast Querying Service

This project is still on going. The background is querying information on current Big Data platform, such as Hadoop and HIVE, is very slow. Some querying jobs can even last hours. And the development, deployment and maintenance all need senior professional engineers to work for months. Why not try Fast Querying Service? Clients can query information immediately from My Fast Querying Platform after uploading their data to it. And this is all clients need to do. Fast Querying Service is simple, fast, and reliable.

Algorithm projects

Anomaly Detection

Anomaly Detection is a sub-topic in Machine Learning. It is also the aim of the project collaborating with Intel. RP(Relation Probability) is an Unsupervised Machine Learning Algorithm designed to solve this problem. It learns the probability distribution of the relations among the features of all the train data(the historical data in general), then detects the anomalies in the new data.
http://anomalydetection.github.io/demola