Setting Up Apache Airflow Celery Executor Cluster

Kuan-Chih Wang
5 min readJun 8, 2021

--

Apache airflow is one of the most common tools for routine task execution such as data ETL pipeline and workflow orchestration. When installing in production environment, scalability and high availability will probably be the top two concerns, Apache airflow Celery Executor can be the one that takes care of them both.

This post consists of step-by-step process as well as external references to establish an Airflow 2.1.0 Cluster with python version 3.7.3 on Linux CentOS 7 virtual machines. The cluster will be in the structure shows in below diagram.

Install python3.7

Step 1 — Requirements

Python installation requires the GCC compiler on your system. Login to your server and use the following command to install prerequisites packages for Python.

# yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make libffi-devel wget

Step 2 — Download Python 3.7

Download Python from the Python official website by using the following command. You can also download the specific version by specifying it.

# cd /usr/src
# wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tar.xz

Now extract the downloaded package.

# tar Jxvf Python-3.7.3.tar.xz

# cd Python-3.7.3

Step 3 — Install Python 3.7

Use below commands to compile Python source code on your system. Python installation route can be specified by changing the path using prefix.

# ./configure prefix=/usr/local/python3# make && make install

Now remove downloaded source archive file for the sake of housekeeping.

# rm /usr/src/Python-3.7.3.tar.xz

Step 4 — Check Python Version

Check the latest version of installed python.

# python3 -V

Step 5 — Set python3 as default python

Add the following statement in your bashrc file to make python3 as your default python.

alias python='/usr/local/python3/bin/python3' 
alias pip='/usr/local/python3/bin/pip3'

After putting above statement in the bashrc file, use source command to make new setting in effect.

# source ./.bashrc

Python3 install reference:https://blog.jiebu-lang.com/centos-7-install-python-3-7/

Install virtualenv

Use pip to install Python virtualenv

# pip install --upgrade pip
# pip install virtualenv

And make the installed virtualenv as your default command, add following statement to .bashrc file, then source it

alias virtualenv='/usr/local/python3/bin/virtualenv'

Then create a virtual env for airflow

# virtualenv ~/airflow

Activate the virtual env

source ~/airflow/bin/activate

Install MySQL

Reference:

https://tecadmin.net/install-mysql-8-on-centos/

After installing MySQL, if this MySQL instance is expetect to be connected remotely from other servers, make sure to open the firewall for port 3306.

# firewall-cmd --zone=public --add-port=3306/tcp --permanent
success
# firewall-cmd --reload
success
# iptables-save | grep 3306
-A IN_public_allow -p tcp -m tcp --dport 3306 -m conntrack --ctstate NEW -j ACCEPT

Check port status with following command.

# netstat -na |grep 3306

Firewall open reference: https://www.thegeekdiary.com/centos-rhel-7-how-to-open-a-port-in-the-firewall-with-firewall-cmd/

To connect MySQL remotely, ssh to the other server, and put

# mysql -u username -p -h 10.10.10.10

make sure to substitute username to your pre-assigned username, and 10.10.10.10 to the IP address of the remote MySQL server.

There is an environmental variable needed to be set for MySQL in order to enable timestamp variable. Edit /etc/my.cnf file, and add the following statement

explicit_defaults_for_timestamp=1

Then reboot the system as well as MySQL, log in into MySQL, use below command to check if timestamp variable was set to ON

SHOW GLOBAL VARIABLES LIKE '%timestamp%';

Install Redis as message broker

Download Redis tar object

# wget http://download.redis.io/redis-stable.tar.gz

Execute the following command to install the required utilities for building the source code

# yum install make tcl-devel gcc

Extract the Redis source code using the following command

# tar -xvf redis-stable.tar.gz

Then Navigate to the redis folder, and execute the following command

# make
# sudo make install

Before spin up the Redis server, make sure to configure some necessary settings in configuration file. If the servers are not exposed to outside internet, and there are multiple airflow worker nodes that need to connect to this broker, then either specify bind address or set protected-mode to no.

Then start the Redis server, with customized configuration

# redis-server /path/to/redis.conf

Install Apache Airflow 2.1.0

If you plan to configure Airflow with LDAP authentication, then you’ll need to install dependent packages first:

# yum install python-devel openldap-devel

Execute following steps to install Airflow with Redis and celery extras.

export AIRFLOW_HOME=/root/airflowAIRFLOW_VERSION=2.1.0PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"pip install "apache-airflow[redis,celery]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"pip install pymysqlpip install psycopg2pip install python-ldap (Optional for LDAP auth)

Setting Airflow cluster architecture

Belows are articles regarding setting up Airflow cluster with RabbitMQ and Redis as broker.

Reference:

or

Troubleshooting

When executing airflow web server, if the system prompts error like No such file or directory: ‘gunicorn’: ‘gunicorn’

Add the following line to .bashrc file, which helps the systems find the correct gunicorn installed with Airflow

# export PATH=$PATH:/root/airflow/bin/

Setting LDAP Authentication

First, edit webserver_config.py file in your AIRFLOW_HOME directory. Contents regarding webserver config, refer to this article: LINK

Airflow.cfg

Before running airflow as celery executor cluster, there certainly are some configuration needed to be configured in airflow.cfg file that is located in your AIRFLOW_HOME directory.

The following Settings are necessary if you want to run on Celery Executor. Base log folder setting is a critical one. If you are running workers on different machines, the best practice is to mount a NFS server or any type of shared disk to all worker nodes, so that the web server can access logs that are produced by different worker nodes.

[core]
executor = CeleryExecutor
dags_folder = /path/to/your/dags_folder
sql_alchemy_conn = mysql+pymysql://user:password@hostname:3306/db
[logging]
base_log_folder = /to/your/share/folder/that/all/nodes/can/access
[webserver]
base_url = http://hostname:8080
[smtp]
smtp_host = Your smtp server hostname
smtp_starttls = False
smtp_ssl = False
smtp_port = Your smtp server port, usually 25
smtp_mail_from = Name of sender shows on notification email
smtp_timeout = 30
smtp_retry_limit = 5
[celery]
broker_url = redis://redis-hostname:6379/0
result_backend = db+mysql+pymysql://username:password@hostname:3306/db

Start Up Airflow

Create webserver user

airflow users create -r Admin -u username -e mail@mail.com -f firstname -l lastname-p password

Start up scheduler

airflow scheduler

Start up webserver

airflow webserver

Start up flower

airflow celery flower

Start up worker

airflow celery worker -q queue_1 -H hostname

Or, simply use nohup + & to start up all modules in background at once

# nohup airflow scheduler & nohup airflow webserver & nohup airflow celery flower & nohup airflow celery worker -q q_1 -H host_name &

Apache airflow are constantly updating and revising, the packages dependencies may vary based on the version of airflow, therefore, it is always recommended to check Airflow official website before you construct the packages list for installation.

--

--