Implement Airflow DAG monitoring with DataDog integration for production workflows

Intermediate 45 min Apr 04, 2026 70 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive monitoring for Apache Airflow DAGs using DataDog integration. This tutorial covers DataDog agent installation, metrics collection configuration, custom dashboard creation, and alerting rules for production workflow observability.

Prerequisites

  • Apache Airflow 2.0+ installed
  • DataDog account and API key
  • Root or sudo access
  • Python 3.8+ environment

What this solves

Apache Airflow DAG monitoring becomes critical in production environments where workflow failures can impact business operations. DataDog integration provides comprehensive observability into DAG execution metrics, task duration, failure rates, and resource utilization. This monitoring solution helps identify bottlenecks, predict failures, and maintain SLA compliance for your data pipelines.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you get the latest versions of dependencies.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gnupg2 software-properties-common
sudo dnf update -y
sudo dnf install -y curl wget gnupg2

Install DataDog agent

Download and install the DataDog agent using the official installation script. Replace YOUR_API_KEY with your actual DataDog API key from your DataDog dashboard.

DD_API_KEY=YOUR_API_KEY DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

Configure DataDog agent for Airflow

Create the Airflow integration configuration file to enable metrics collection from your Airflow instance.

init_config:

instances:
  - url: http://localhost:8080
    username: admin
    password: admin
    tags:
      - environment:production
      - service:airflow
    collect_default_metrics: true
    dag_bag_timeout: 300
    dag_run_timeout: 300

Enable Airflow StatsD metrics

Configure Airflow to send StatsD metrics to DataDog by updating the airflow.cfg configuration file.

[metrics]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
statsd_allow_list = 
stat_name_handler = 

Configure DataDog DogStatsD

Enable DogStatsD in the DataDog agent to receive metrics from Airflow. This allows real-time metric collection.

# DogStatsD configuration
use_dogstatsd: true
dogstatsd_port: 8125
dogstatsd_non_local_traffic: false
dogstatsd_stats_enable: true
dogstatsd_queue_size: 1024
dogstatsd_buffer_size: 8192

Tags for all metrics

tags: - env:production - service:airflow - datacenter:us-east-1

Install Python DataDog library

Install the DataDog Python library in your Airflow environment to enable custom metrics and enhanced monitoring capabilities.

sudo -u airflow pip install datadog
sudo -u airflow pip install apache-airflow[statsd]
sudo -u airflow pip install datadog
sudo -u airflow pip install apache-airflow[statsd]

Create custom DAG monitoring script

Create a monitoring script that collects custom metrics about DAG performance and sends them to DataDog.

from datadog import initialize, statsd
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.models import DagRun, TaskInstance
from datetime import datetime, timedelta
import logging

Initialize DataDog

options = { 'statsd_host': '127.0.0.1', 'statsd_port': 8125, } initialize(**options) def collect_dag_metrics(): """Collect custom DAG metrics and send to DataDog""" try: # Get recent DAG runs recent_runs = DagRun.find( execution_date_gte=datetime.now() - timedelta(hours=1) ) success_count = len([r for r in recent_runs if r.state == 'success']) failed_count = len([r for r in recent_runs if r.state == 'failed']) # Send metrics to DataDog statsd.gauge('airflow.dag_runs.success', success_count, tags=['env:production']) statsd.gauge('airflow.dag_runs.failed', failed_count, tags=['env:production']) logging.info(f"Sent metrics: {success_count} success, {failed_count} failed") except Exception as e: logging.error(f"Error collecting metrics: {e}") statsd.increment('airflow.metrics.collection.error', tags=['env:production'])

Configure log collection

Enable DataDog log collection for Airflow to capture task logs, scheduler logs, and webserver logs.

logs:
  - type: file
    path: /opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
    service: airflow
    source: airflow
    sourcecategory: airflow
    tags:
      - env:production
      - component:dag_processor
  - type: file
    path: /opt/airflow/logs/scheduler/latest/*.log
    service: airflow
    source: airflow
    sourcecategory: airflow
    tags:
      - env:production
      - component:scheduler
  - type: file
    path: /opt/airflow/logs////.log
    service: airflow
    source: airflow
    sourcecategory: airflow
    tags:
      - env:production
      - component:task

Enable log collection in DataDog agent

Update the main DataDog agent configuration to enable log collection and processing.

logs_enabled: true
log_level: INFO

Log processing configuration

logs_config: container_collect_all: false processing_rules: - type: exclude_at_match name: exclude_debug pattern: "DEBUG" - type: mask_sequences name: mask_passwords pattern: "password=\\S+" replace_placeholder: "password=***"

Restart services

Restart both the DataDog agent and Airflow services to apply the new configuration.

sudo systemctl restart datadog-agent
sudo systemctl restart airflow-webserver
sudo systemctl restart airflow-scheduler

Configure custom dashboards

Create DAG performance dashboard

Use the DataDog web interface to create a comprehensive dashboard. Here's the JSON configuration for a production-ready dashboard.

{
  "title": "Airflow DAG Monitoring",
  "description": "Production Airflow DAG monitoring and performance metrics",
  "widgets": [
    {
      "definition": {
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:airflow.dag_run.duration{*} by {dag_id}",
            "display_type": "line"
          }
        ],
        "title": "DAG Execution Duration"
      }
    },
    {
      "definition": {
        "type": "query_value",
        "requests": [
          {
            "q": "sum:airflow.dag_runs.failed{*}",
            "aggregator": "last"
          }
        ],
        "title": "Failed DAG Runs (Last Hour)"
      }
    },
    {
      "definition": {
        "type": "toplist",
        "requests": [
          {
            "q": "top(avg:airflow.task_instance.duration{*} by {task_id}, 10, 'mean', 'desc')"
          }
        ],
        "title": "Slowest Tasks"
      }
    }
  ]
}

Configure alerting rules

Create DAG failure alerts

Set up alerts for DAG failures, long-running tasks, and scheduler health issues.

# DAG Failure Alert
name: "Airflow DAG Failures"
type: "metric alert"
query: "sum(last_5m):sum:airflow.dag_runs.failed{env:production} > 0"
message: |
  Airflow DAG Failure Detected
  
  One or more DAGs have failed in the last 5 minutes.
  
  Check the Airflow web interface: http://localhost:8080
  
  @slack-airflow-alerts @pagerduty

Long Running Task Alert

name: "Airflow Long Running Tasks" type: "metric alert" query: "avg(last_15m):avg:airflow.task_instance.duration{env:production} > 3600" message: | Long Running Task Detected Tasks are taking longer than 1 hour to complete. This may indicate performance issues or stuck processes. @slack-airflow-alerts

Scheduler Health Alert

name: "Airflow Scheduler Down" type: "service check" query: '"airflow.scheduler_heartbeat".over("env:production").last(2).count_by_status()' message: | Airflow Scheduler is Down The Airflow scheduler has stopped responding. Immediate action required to restore workflow processing. @pagerduty @slack-airflow-alerts

Verify your setup

# Check DataDog agent status
sudo datadog-agent status

Verify Airflow integration

sudo datadog-agent check airflow

Check if metrics are being sent

sudo datadog-agent flare

Test StatsD connectivity

echo "custom.metric:1|c" | nc -u -w1 127.0.0.1 8125

Check Airflow logs for metric collection

tail -f /opt/airflow/logs/scheduler/latest/*.log | grep -i statsd
Note: It may take 5-10 minutes for metrics to appear in DataDog after configuration. Check the DataDog Metrics Explorer for airflow.* metrics to confirm data is flowing.

Common issues

SymptomCauseFix
No metrics in DataDogStatsD not configuredVerify statsd_on=True in airflow.cfg and restart services
Agent check failsWrong Airflow URL/credentialsUpdate conf.yaml with correct webserver URL and credentials
Permission denied on logsDataDog agent can't read log filessudo chown -R dd-agent:dd-agent /opt/airflow/logs
High memory usageToo many metrics collectedAdd metric filtering in dogstatsd configuration
Missing task logsLog path pattern incorrectVerify log path matches your Airflow log structure
Never use chmod 777 on log directories. This gives every user full access to sensitive workflow logs. Instead, use proper ownership with chown and minimal permissions like 755 for directories.

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.