2017年3月7日星期二

The infrastructure for big data processing platform

basic infrastructure of data process platform
The data processing and data flows are shown as above.
Data source:
The current data sources, mainly come from the application data, database, server logs and others. Mysql-binlog is mainly used to add a user or active computing. And logs can also be divided as interface.
Data integration:
Data integration mainly refers to data collection tools for various data. Data can be sent from the source collection to data cache center( kafka) by javanio. Part of the data source will not be integrated into data cache center, directly ETL to Hadoop.
Data sorting:
This process is based on message buffer(kafka)for publish or subscribe. The original data source will through data formatting, data classification and filtering by a lot of sorting node data, finally generate multiple subscription data source(topic), and the fineness of topic can be adjusted according to data analysis, then come to split or combination. There are two directions for these sorted data: one is for the real time computing system for consumption; and the other is for the generation of slice file by time to hold-up to the off-line computing.
Data storage:
Data storage is mainly for the time slice file that after sorting. For the consideration of performance, the original data may be very rough, so it needs some special ETL tools to generate the final document with a specific format. And the final document will be loaded to the distributed file system Hadoop regularly, then data can support for Hive distributed computing tasks.
Distributed computing:
The source data that stored into Hadoop cluster will generate a lot of offline computing tasks, computed by map-reduce  and then will be displayed on the front, finally the result data will be stored into the hive table according to the partition about time type- time value- data type.
Data pre-loading:
The result of data will be loaded into the memory, so that can support for the requirements of different front-end data.
Data display:
At present, there are two main data displays: data platform(Web), report mail and third party applications.

没有评论:

发表评论