The technicalities 16 April 2024

Monitoring Airflow jobs with TIG 1: system metrics

Author
Amadeusz Kosik
Amadeusz Kosik

Big Data Engineer

Share
Share:

Like many server applications, Airflow can – and should – be monitored for metrics and logs. In this article, we will look into the former and the integration with the TIG stack. The goal is to pull example metrics into a time series database and visualize it in a web application. This article will focus on AF system metrics, like its performance, load or health. We will cover the topic of reporting the data metrics into the same database in an upcoming article, “TITLE”. So, stay tuned. 

Why?

The Airflow itself finally got a Cluster status dashboard built-in. Therefore, it is valid to question the need for a separate dashboard that requires additional effort and maintenance. Why bother, then? There are several possible reasons to introduce a separate monitoring stack:

  • Aggregated view: looking into one app is easier than browsing through 10 of them. TIG (or any other monitoring stack) can be used for monitoring multiple instances and apps and is easily integrated with custom ones.
  • Security: there is no need to give access to the AF itself (or any other application) to be able to see the metrics and pinpoint errors. It is aligned with the data mesh and any other data democratization approach.
  • Support friendly: your 1st line support has a starting point to check the status of the data processing and be able to call the right people in case of a problem.

The toolkit

Out of the box (but with the right pip packages) Airflow supports sending its internal metrics to a statsd server. We can leverage that and set up such a server via Telegraf to proxy the metrics further into a time series database: InfluxDB. As a convenient UI, Grafana can be used.

TIG stack configuration

Firstly, let’s configure the TIG stack to accept the metrics:

  1. Install an InfluxDB instance. Nothing fancy here.
  2. Install the Telegraf. In the configuration, look for the statsd input (which has to be enabled) and InfluxDB output. The latter will require at least authentication. Note that the Telegraf supports many more inputs and allows monitoring of, e.g. system resources (CPU, memory, disk space, etc.). This is a good idea to configure in a production environment.
  3. Install Grafana and configure the data source to point it to the InfluxDB.

We have prepared a docker compose-based demo in the GitHub repository with a pre-configured environment. You can see there the example configuration for Telegraf (telegraf.conf), InfluxDB (influxdb.env) and Grafana (grafana-provisioning/datasources). Please note that this is only for dev/demo purposes, and real production environments need to be set up more securely (including the use of HTTPS and secure password handling).

Airflow

On the Airflow side, there are two significant points to be addressed.

  1. Airflow requires the Apache-airflow [statsd] package to have the statsd client available – you can do it via pip.
  2. In the airflow configuration file (usually: Airflow.cfg), the [metrics] section must be configured to enable the statsd, point it to the Telegraf instance and prefix the metrics (very useful in case of multiple Airflow instances).

There are several metrics available to be sent in Airflow. You can see the complete list in the Airflow documentation here. There, you can also limit which ones are actually reported to the Telegraf. Be aware that even though most of the metrics are said to be reported in seconds, you need to validate them yourself (as in this example).

Dashboards

The last step is to configure Grafana and set up some dashboards. The UI offers a WYSIWYG editor that you can use to tailor it to your needs. The example available on the GitHub might serve as a starting point, as it shows:

  • state of the Airflow executors (queues and tasks being run at the moment) to see whether any processing is going on,
  • state of the pools (default and the custom ones) to check potential bottlenecks if you use multiple pools,
  • task times to identify unexpected stragglers (and compare instances’ run times and find any performance challenges early on).
An example dashboard, available on our GitHub. It contains metrics useful to monitor the load on the Airflow instance and identify potential overload or bottlenecks in the data pipeline processing.

Summary

In conclusion, for Airflow monitoring, you can use specialized metrics tools like TIG stack and aggregate all the metrics from multiple AF instances. The stack can accommodate AF system metrics and data from other applications, including your custom ones. An example of sending data quality metrics is what we will look into in the second part. There is a demo environment on our GitHub to see whether this works for you.

Share
Share:

More insights