Setting Up Apache Airflow Celery Executor Cluster
Apache airflow is one of the most common tools for routine task execution such as data ETL pipeline and workflow orchestration. When installing in production environment, scalability and high availability will probably be the top two concerns, Apache airflow Celery Executor can be the one that takes care of them both.
This post consists of step-by-step process as well as external references to establish an Airflow 2.1.0 Cluster with python version 3.7.3 on Linux CentOS 7 virtual machines. The cluster will be in the structure shows in below diagram.
Install python3.7
Step 1 — Requirements
Python installation requires the GCC compiler on your system. Login to your server and use the following command to install prerequisites packages for Python.
# yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make libffi-devel wget
Step 2 — Download Python 3.7
Download Python from the Python official website by using the following command. You can also download the specific version by specifying it.
# cd /usr/src
# wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tar.xz
Now extract the downloaded package.
# tar Jxvf Python-3.7.3.tar.xz
# cd Python-3.7.3
Step 3 — Install Python 3.7
Use below commands to compile Python source code on your system. Python installation route can be specified by changing the path using prefix.
# ./configure prefix=/usr/local/python3# make && make install
Now remove downloaded source archive file for the sake of housekeeping.
# rm /usr/src/Python-3.7.3.tar.xz
Step 4 — Check Python Version
Check the latest version of installed python.
# python3 -V
Step 5 — Set python3 as default python
Add the following statement in your bashrc file to make python3 as your default python.
alias python='/usr/local/python3/bin/python3'
alias pip='/usr/local/python3/bin/pip3'
After putting above statement in the bashrc file, use source command to make new setting in effect.
# source ./.bashrc
Python3 install reference:https://blog.jiebu-lang.com/centos-7-install-python-3-7/
Install virtualenv
Use pip to install Python virtualenv
# pip install --upgrade pip
# pip install virtualenv
And make the installed virtualenv as your default command, add following statement to .bashrc file, then source it
alias virtualenv='/usr/local/python3/bin/virtualenv'
Then create a virtual env for airflow
# virtualenv ~/airflow
Activate the virtual env
source ~/airflow/bin/activate
Install MySQL
Reference:
https://tecadmin.net/install-mysql-8-on-centos/
After installing MySQL, if this MySQL instance is expetect to be connected remotely from other servers, make sure to open the firewall for port 3306.
# firewall-cmd --zone=public --add-port=3306/tcp --permanent
success# firewall-cmd --reload
success# iptables-save | grep 3306
-A IN_public_allow -p tcp -m tcp --dport 3306 -m conntrack --ctstate NEW -j ACCEPT
Check port status with following command.
# netstat -na |grep 3306
Firewall open reference: https://www.thegeekdiary.com/centos-rhel-7-how-to-open-a-port-in-the-firewall-with-firewall-cmd/
To connect MySQL remotely, ssh to the other server, and put
# mysql -u username -p -h 10.10.10.10
make sure to substitute username to your pre-assigned username, and 10.10.10.10 to the IP address of the remote MySQL server.
There is an environmental variable needed to be set for MySQL in order to enable timestamp variable. Edit /etc/my.cnf file, and add the following statement
explicit_defaults_for_timestamp=1
Then reboot the system as well as MySQL, log in into MySQL, use below command to check if timestamp variable was set to ON
SHOW GLOBAL VARIABLES LIKE '%timestamp%';
Install Redis as message broker
Download Redis tar object
# wget http://download.redis.io/redis-stable.tar.gz
Execute the following command to install the required utilities for building the source code
# yum install make tcl-devel gcc
Extract the Redis source code using the following command
# tar -xvf redis-stable.tar.gz
Then Navigate to the redis folder, and execute the following command
# make
# sudo make install
Before spin up the Redis server, make sure to configure some necessary settings in configuration file. If the servers are not exposed to outside internet, and there are multiple airflow worker nodes that need to connect to this broker, then either specify bind address or set protected-mode to no.
Then start the Redis server, with customized configuration
# redis-server /path/to/redis.conf
Install Apache Airflow 2.1.0
If you plan to configure Airflow with LDAP authentication, then you’ll need to install dependent packages first:
# yum install python-devel openldap-devel
Execute following steps to install Airflow with Redis and celery extras.
export AIRFLOW_HOME=/root/airflowAIRFLOW_VERSION=2.1.0PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"pip install "apache-airflow[redis,celery]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"pip install pymysqlpip install psycopg2pip install python-ldap (Optional for LDAP auth)
Setting Airflow cluster architecture
Belows are articles regarding setting up Airflow cluster with RabbitMQ and Redis as broker.
Reference:
or
Troubleshooting
When executing airflow web server, if the system prompts error like No such file or directory: ‘gunicorn’: ‘gunicorn’
Add the following line to .bashrc file, which helps the systems find the correct gunicorn installed with Airflow
# export PATH=$PATH:/root/airflow/bin/
Setting LDAP Authentication
First, edit webserver_config.py file in your AIRFLOW_HOME directory. Contents regarding webserver config, refer to this article: LINK
Airflow.cfg
Before running airflow as celery executor cluster, there certainly are some configuration needed to be configured in airflow.cfg file that is located in your AIRFLOW_HOME directory.
The following Settings are necessary if you want to run on Celery Executor. Base log folder setting is a critical one. If you are running workers on different machines, the best practice is to mount a NFS server or any type of shared disk to all worker nodes, so that the web server can access logs that are produced by different worker nodes.
[core]
executor = CeleryExecutor
dags_folder = /path/to/your/dags_folder
sql_alchemy_conn = mysql+pymysql://user:password@hostname:3306/db[logging]
base_log_folder = /to/your/share/folder/that/all/nodes/can/access[webserver]
base_url = http://hostname:8080[smtp]
smtp_host = Your smtp server hostname
smtp_starttls = False
smtp_ssl = False
smtp_port = Your smtp server port, usually 25
smtp_mail_from = Name of sender shows on notification email
smtp_timeout = 30
smtp_retry_limit = 5[celery]
broker_url = redis://redis-hostname:6379/0
result_backend = db+mysql+pymysql://username:password@hostname:3306/db
Start Up Airflow
Create webserver user
airflow users create -r Admin -u username -e mail@mail.com -f firstname -l lastname-p password
Start up scheduler
airflow scheduler
Start up webserver
airflow webserver
Start up flower
airflow celery flower
Start up worker
airflow celery worker -q queue_1 -H hostname
Or, simply use nohup + & to start up all modules in background at once
# nohup airflow scheduler & nohup airflow webserver & nohup airflow celery flower & nohup airflow celery worker -q q_1 -H host_name &
Apache airflow are constantly updating and revising, the packages dependencies may vary based on the version of airflow, therefore, it is always recommended to check Airflow official website before you construct the packages list for installation.