Apache airflow is one of the most common tools for routine task execution such as data ETL pipeline. When implementing in production environment, scalability and high-availability will be prior concerns for any data engineer. Apache airflow Celery Executor can be one of the tools that comes in handy.
This post consists of step-by-step process to establish an Airflow 2.1.0 Cluster with python version 3.7.3 on Linux CentOS 7. And will construct a simple cluster based on the structure shows in below diagram.
The Python installation required the GCC compiler on your system. …
AWS is one of the platform that you can run your web server or application quickly and easily
This is a course learning note on AWS Fundamental:Going Cloud Native, provide by AWS through Coursera. There are more settings regarding security and accessibility by setting Roles, Groups, Users, permission policies, or issues on balancing your incoming traffic by utilizing Load Balancer, but we are not going to talk about that in this post.
VPC, an Amazon Virtual Private Cloud, in which we put all our application and services in, so that we can have full control on how internet traffic goes…
Step by Step setting up Apache Spark cluster based on Hadoop
This is a step by step notes to setting up Apache Spark cluster for team projects, help received from Mr. Steven Chiu, Instructor of Institute for Information Industry, Taiwan.
Spark is probably one of the most popular Big data processing platforms that being used nowadays. In order to manage vast amount of data that we collect for our team projects, we decided to utilize Hadoop and Spark platform for furthur data processing.
This note demonstrates setting up Spark cluster on Linux ubuntu system.
First of all, we need to…
A Life-long enthusiast of Data Engineering and Data Analysis.