Setting Up Apache Airflow Celery Executor Cluster

Apache airflow is one of the most common tools for routine task execution such as data ETL pipeline. When implementing in production environment, scalability and high-availability will be prior concerns for any data engineer. Apache airflow Celery Executor can be one of the tools that comes in handy.

This post consists of step-by-step process to establish an Airflow 2.1.0 Cluster with python version 3.7.3 on Linux CentOS 7. And will construct a simple cluster based on the structure shows in below diagram.

Install python3.7

The Python installation required the GCC compiler on your system. Login to your server and use the following command to install prerequisites packages for Python.

# yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make libffi-devel wget

Download Python using the following command from the Python official website. You can also download the specific version by specifying it.

# cd /usr/src
# wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tar.xz

Now extract the downloaded package.

# tar Jxvf Python-3.7.3.tar.xz

# cd Python-3.7.3

Use below commands to compile Python source code on your system. Python installation route can be specified by changing the path using prefix.

# ./configure prefix=/usr/local/python3# make && make install

Now remove downloaded source archive file.

# rm /usr/src/Python-3.7.3.tar.xz

Check the latest version of installed python.

# python3 -V

Add the following statement in your bashrc file to make python3 as your default python.

alias python='/usr/local/python3/bin/python3' 
alias pip='/usr/local/python3/bin/pip3'

After putting above statement in the bashrc file, use source command to make new setting in effect.

# source ./.bashrc

Python3 install reference:https://blog.jiebu-lang.com/centos-7-install-python-3-7/

Install virtualenv

# pip install --upgrade pip
# pip install virtualenv

And set following path to .bashrc file, then source it

alias virtualenv='/usr/local/python3/bin/virtualenv'

Then create a virtual env for airflow

# virtualenv ~/airflow

Activate the virtual env

source ~/airflow/bin/activate

Install MySQL

Reference:

https://tecadmin.net/install-mysql-8-on-centos/

After install mysql, if there is a need to let this mysql instance to be connected remotely from other server, make sure to open the firewall for 3306 port.

# firewall-cmd --zone=public --add-port=3306/tcp --permanent
success
# firewall-cmd --reload
success
# iptables-save | grep 3306
-A IN_public_allow -p tcp -m tcp --dport 3306 -m conntrack --ctstate NEW -j ACCEPT

Check port status with

# netstat -na |grep 3306

Firewall open reference: https://www.thegeekdiary.com/centos-rhel-7-how-to-open-a-port-in-the-firewall-with-firewall-cmd/

To connect MySQL remotely, ssh to the other server, and put

# mysql -u username -p -h 10.10.10.10

make sure to substitute username to your pre-assigned username, and 10.10.10.10 to the IP address of the remote MySQL server.

There is an environmental variable needed to be set for mysql. Edit /etc/my.cnf file, and add the following statement

explicit_defaults_for_timestamp=1

Then restart the VM and mysql, log in into mysql, type below command to check if timestamp varaible was set to be ON

SHOW GLOBAL VARIABLES LIKE '%timestamp%';

Install Redis as message broker

Download redis

# wget http://download.redis.io/redis-stable.tar.gz

Execute the following command to install the required utilities to build the source code

# yum install make tcl-devel gcc

Extract the redis source code using the following command

# tar -xvf redis-stable.tar.gz

Then Navigate to the redis folder, and execute the following command

# make
# sudo make install

Before start up the redis server, make sure to configure necessary setting in configuration file. If the servers are not exposed to outside internet, and there are multiple airflow worker nodes that need to connect to this broker, then either specify bind address or set protected-mode to no.

Then start the redis server, with customized configuration

# redis-server /path/to/redis.conf

Install Apache Airflow 2.1.0

If you plan to set up Airflow with LDAP authentication, then you’ll need to install dependent packages first:

# yum install python-devel openldap-devel

Execute following steps to install Airflow with redis and celery extras.

export AIRFLOW_HOME=/root/airflowAIRFLOW_VERSION=2.1.0PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"pip install "apache-airflow[redis,celery]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"pip install pymysqlpip install psycopg2pip install python-ldap (Optional for LDAP auth)

Setting Airflow cluster architecture

Below are articles regarding setting up Airflow cluster with RabbitMQ and Redis as broker.

Reference:

or

Troubleshooting

When executing airflow web server, if the system prompts error such as No such file or directory: ‘gunicorn’: ‘gunicorn’

Add the following line to .bashrc file

# export PATH=$PATH:/root/airflow/bin/

Setting LDAP Authentication

First, edit webserver_config.py file under your AIRFLOW_HOME directory. Here is a great reference: LINK

Airflow.cfg

Before running airflow as celery executor cluster, there certainly are some configuration needed to be configured in airflow.cfg file that is located under your AIRFLOW_HOME directory.

The following Settings are necessary if you want to run on Celery Executor. Base log folder setting is a little bit tricky. If you are running workers on different machines, it is a best practice to mount a share disk to all worker nodes and point airflow.cfg base log folder to that directory, so that the web server can access logs that run on different machines.

[core]
executor = CeleryExecutor
dags_folder = /path/to/your/dags_folder
sql_alchemy_conn = mysql+pymysql://user:password@hostname:3306/db
[logging]
base_log_folder = /to/your/share/folder/that/all/nodes/can/access
[webserver]
base_url = http://hostname:8080
[smtp]
smtp_host = Your smtp server hostname
smtp_starttls = False
smtp_ssl = False
smtp_port = Your smtp server port, usually 25
smtp_mail_from = Name of sender shows on notification email
smtp_timeout = 30
smtp_retry_limit = 5
[celery]
broker_url = redis://redis-hostname:6379/0
result_backend = db+mysql+pymysql://username:password@hostname:3306/db

Start Up Airflow

Create webserver user

airflow users create -r Admin -u username -e mail@mail.com -f firstname -l lastname-p password

Start up scheduler

airflow scheduler

Start up webserver

airflow webserver

Start up flower

airflow celery flower

Start up worker

airflow celery worker -q queue_1 -H hostname

Or, simply use nohup + & to start up all modules in background at once

# nohup airflow scheduler & nohup airflow webserver & nohup airflow celery flower & nohup airflow celery worker -q q_1 -H host_name &

A Life-long enthusiast of Data Engineering and Data Analysis.