Configure InfluxDB 2.7 clustering for high availability with data replication and automated failover

Advanced 45 min May 24, 2026 19 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up a production-ready InfluxDB Enterprise cluster with automatic data replication, failover mechanisms, and comprehensive monitoring using Grafana dashboards for time-series workloads.

Prerequisites

  • 3 or more servers with 4GB+ RAM each
  • Root or sudo access on all nodes
  • Network connectivity between cluster nodes on ports 8086, 8088, 8091
  • Basic understanding of time-series databases

What this solves

InfluxDB clustering provides high availability and horizontal scaling for time-series data workloads. This tutorial sets up InfluxDB Enterprise 2.7 with automated data replication across multiple nodes, ensuring your time-series database remains available during node failures. You'll configure load balancing, automated failover, and monitoring to create a production-ready cluster that can handle enterprise-scale metrics and IoT data streams.

Step-by-step installation

Install InfluxDB Enterprise on all nodes

Start by installing InfluxDB Enterprise on each cluster node. This creates the foundation for your high-availability setup.

wget -q https://repos.influxdata.com/influxdata-archive_compat.key
echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c6b' influxdata-archive_compat.key | sha256sum -c && cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null
echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list
sudo apt update
sudo apt install -y influxdb2-enterprise chronograf kapacitor
cat <

Configure data nodes

Configure the first data node with clustering enabled. This node will store and replicate time-series data across the cluster.

[meta]
  dir = "/var/lib/influxdb/meta"
  hostname = "data-node-1"
  bind-address = ":8088"
  http-bind-address = ":8091"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = false
  raft-promotion-enabled = true
  logging-enabled = true

[data]
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  series-id-set-cache-size = 100
  query-log-enabled = true
  cache-max-memory-size = "1g"
  cache-snapshot-memory-size = "25m"
  cache-snapshot-write-cold-duration = "10m"
  compact-full-write-cold-duration = "4h"
  max-concurrent-compactions = 0
  compact-throughput = "48m"
  compact-throughput-burst = "48m"
  max-index-log-file-size = "1m"
  max-series-per-database = 1000000
  max-values-per-tag = 100000

[cluster]
  shard-writer-timeout = "5s"
  shard-mapper-timeout = "5s"
  write-timeout = "10s"
  max-remote-write-connections = 3
  pool-max-idle-streams = 100
  pool-max-idle-time = "1m"
  max-concurrent-queries = 0
  query-timeout = "0s"
  log-queries-after = "0s"
  max-select-point = 0
  max-select-series = 0
  max-select-buckets = 0

[retention]
  enabled = true
  check-interval = "30m"

[shard-precreation]
  enabled = true
  check-interval = "10m"
  advance-period = "30m"

[monitor]
  store-enabled = true
  store-database = "_internal"
  store-interval = "10s"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = true
  log-enabled = true
  write-tracing = false
  pprof-enabled = true
  pprof-auth-enabled = true
  debug-pprof-enabled = false
  ping-auth-enabled = false
  https-enabled = true
  https-certificate = "/etc/ssl/certs/influxdb.crt"
  https-private-key = "/etc/ssl/private/influxdb.key"
  max-row-limit = 0
  max-connection-limit = 0
  shared-secret = "your-cluster-shared-secret-change-this"
  realm = "InfluxDB"

[logging]
  format = "auto"
  level = "info"
  suppress-logo = false

Generate SSL certificates for secure communication

Create SSL certificates for encrypted communication between cluster nodes and clients.

sudo mkdir -p /etc/ssl/certs /etc/ssl/private
sudo openssl req -x509 -newkey rsa:4096 -keyout /etc/ssl/private/influxdb.key -out /etc/ssl/certs/influxdb.crt -days 365 -nodes -subj "/C=US/ST=State/L=City/O=Organization/CN=example.com"
sudo chmod 600 /etc/ssl/private/influxdb.key
sudo chmod 644 /etc/ssl/certs/influxdb.crt
sudo chown influxdb:influxdb /etc/ssl/private/influxdb.key /etc/ssl/certs/influxdb.crt

Configure additional data nodes

Set up the second and third data nodes with similar configuration but different hostnames.

[meta]
  dir = "/var/lib/influxdb/meta"
  hostname = "data-node-2"
  bind-address = ":8088"
  http-bind-address = ":8091"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = false
  raft-promotion-enabled = true
  logging-enabled = true

[data]
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  series-id-set-cache-size = 100
  query-log-enabled = true
  cache-max-memory-size = "1g"
  cache-snapshot-memory-size = "25m"
  cache-snapshot-write-cold-duration = "10m"
  compact-full-write-cold-duration = "4h"
  max-concurrent-compactions = 0
  compact-throughput = "48m"
  compact-throughput-burst = "48m"
  max-index-log-file-size = "1m"
  max-series-per-database = 1000000
  max-values-per-tag = 100000

[cluster]
  shard-writer-timeout = "5s"
  shard-mapper-timeout = "5s"
  write-timeout = "10s"
  max-remote-write-connections = 3
  pool-max-idle-streams = 100
  pool-max-idle-time = "1m"
  max-concurrent-queries = 0
  query-timeout = "0s"
  log-queries-after = "0s"
  max-select-point = 0
  max-select-series = 0
  max-select-buckets = 0

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = true
  log-enabled = true
  write-tracing = false
  pprof-enabled = true
  pprof-auth-enabled = true
  debug-pprof-enabled = false
  ping-auth-enabled = false
  https-enabled = true
  https-certificate = "/etc/ssl/certs/influxdb.crt"
  https-private-key = "/etc/ssl/private/influxdb.key"
  max-row-limit = 0
  max-connection-limit = 0
  shared-secret = "your-cluster-shared-secret-change-this"
  realm = "InfluxDB"

Initialize the cluster

Start InfluxDB on the first node and initialize the cluster with the first meta node.

sudo systemctl enable influxdb
sudo systemctl start influxdb
sudo systemctl status influxdb

Join additional nodes to the cluster

Add the remaining nodes to form a complete cluster with data replication.

sudo systemctl enable influxdb
sudo systemctl start influxdb
influx -host data-node-1:8086 -execute "CREATE USER admin WITH PASSWORD 'secure-password' WITH ALL PRIVILEGES"
influx -host data-node-1:8086 -username admin -password 'secure-password' -execute "SHOW SERVERS"

Configure HAProxy for load balancing

Set up HAProxy to distribute client connections across cluster nodes with health checks.

sudo apt install -y haproxy
sudo dnf install -y haproxy
global
    daemon
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    log stdout local0

defaults
    mode http
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend influxdb_frontend
    bind *:8086
    default_backend influxdb_backend

backend influxdb_backend
    balance roundrobin
    option httpchk GET /ping
    server data-node-1 data-node-1:8086 check
    server data-node-2 data-node-2:8086 check
    server data-node-3 data-node-3:8086 check

frontend influxdb_stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Enable HAProxy and test load balancing

Start HAProxy and verify it properly distributes connections across your InfluxDB cluster nodes.

sudo systemctl enable haproxy
sudo systemctl start haproxy
sudo systemctl status haproxy
curl -k https://localhost:8086/ping

Configure data replication and retention policies

Set up automatic data replication across cluster nodes with appropriate retention policies for different data types.

influx -host localhost:8086 -username admin -password 'secure-password' -execute "CREATE DATABASE metrics"
influx -host localhost:8086 -username admin -password 'secure-password' -execute "CREATE RETENTION POLICY \"one_hour\" ON \"metrics\" DURATION 1h REPLICATION 3 DEFAULT"
influx -host localhost:8086 -username admin -password 'secure-password' -execute "CREATE RETENTION POLICY \"one_day\" ON \"metrics\" DURATION 24h REPLICATION 3"
influx -host localhost:8086 -username admin -password 'secure-password' -execute "CREATE RETENTION POLICY \"one_week\" ON \"metrics\" DURATION 168h REPLICATION 2"
influx -host localhost:8086 -username admin -password 'secure-password' -execute "SHOW RETENTION POLICIES ON metrics"

Install and configure Telegraf for metrics collection

Set up Telegraf to collect system and InfluxDB cluster metrics for monitoring.

sudo apt install -y telegraf
sudo dnf install -y telegraf
[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["https://localhost:8086"]
  database = "telegraf"
  username = "admin"
  password = "secure-password"
  skip_database_creation = false
  insecure_skip_verify = true

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.influxdb]]
  urls = [
    "https://data-node-1:8086/debug/vars",
    "https://data-node-2:8086/debug/vars",
    "https://data-node-3:8086/debug/vars"
  ]
  username = "admin"
  password = "secure-password"
  insecure_skip_verify = true

Setup automated failover with Kapacitor

Configure Kapacitor for automated alerting and failover actions when cluster nodes become unavailable.

hostname = "kapacitor-server"
data_dir = "/var/lib/kapacitor"

[http]
  bind-address = ":9092"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  shutdown-timeout = "10s"

[logging]
  file = "STDOUT"
  level = "INFO"

[replay]
  dir = "/var/lib/kapacitor/replay"

[storage]
  boltdb = "/var/lib/kapacitor/kapacitor.db"

[task]
  dir = "/var/lib/kapacitor/tasks"
  snapshot-interval = "1m0s"

[[influxdb]]
  enabled = true
  name = "influxdb-cluster"
  default = true
  urls = ["https://localhost:8086"]
  username = "admin"
  password = "secure-password"
  ssl-ca = ""
  ssl-cert = ""
  ssl-key = ""
  insecure-skip-verify = true
  timeout = "0s"
  disable-subscriptions = false
  subscription-protocol = "http"
  kapacitor-hostname = ""
  http-port = 0
  udp-bind = ""
  udp-buffer = 1000
  udp-read-buffer = 0
  startup-timeout = "5m0s"
  subscriptions-sync-interval = "1m0s"

[smtp]
  enabled = true
  host = "localhost"
  port = 587
  username = ""
  password = ""
  no-verify = false
  global = false
  state-changes-only = false
  from = "kapacitor@example.com"
  idle-timeout = "30s"

Create failover alerting script

Set up a TICKscript for monitoring node health and triggering alerts when nodes fail.

stream
    |from()
        .measurement('influxdb_httpd')
        .groupBy('host')
    |window()
        .period(1m)
        .every(30s)
    |mean('requests_per_sec')
    |alert()
        .id('influxdb-node-health')
        .message('InfluxDB node {{ index .Tags "host" }} may be down - requests per second: {{ .Level }}')
        .warn(lambda: "mean" < 1.0)
        .crit(lambda: "mean" < 0.1)
        .post('http://localhost:9093/api/v1/alerts')
        .email()
            .to('admin@example.com')
        .exec('/usr/local/bin/influxdb-failover.sh', '{{ index .Tags "host" }}')

Start cluster services

Enable and start all services required for the InfluxDB cluster with monitoring.

sudo systemctl enable telegraf kapacitor
sudo systemctl start telegraf kapacitor
sudo systemctl status telegraf kapacitor
kapacitor define node_health -tick /var/lib/kapacitor/node_health.tick
kapacitor enable node_health

Install and configure Grafana for monitoring

Set up Grafana to visualize cluster health and performance metrics with automated dashboards.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo tee /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
sudo dnf install -y grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Verify your setup

Test your InfluxDB cluster configuration with these verification commands.

# Check cluster status
influx -host localhost:8086 -username admin -password 'secure-password' -execute "SHOW SERVERS"

Test data replication

influx -host localhost:8086 -username admin -password 'secure-password' -execute "USE metrics; INSERT cpu,host=server01 value=0.85" influx -host data-node-2:8086 -username admin -password 'secure-password' -execute "USE metrics; SELECT * FROM cpu LIMIT 5"

Check HAProxy status

curl http://localhost:8404/stats

Verify Telegraf is collecting metrics

influx -host localhost:8086 -username admin -password 'secure-password' -execute "USE telegraf; SHOW MEASUREMENTS"

Check Kapacitor tasks

kapacitor list tasks

Test Grafana connectivity

curl http://localhost:3000

Configure monitoring dashboards

This tutorial integrates with our existing monitoring setup. For comprehensive cluster monitoring, you can enhance your Grafana dashboards by following our advanced Grafana configuration guide and combine it with Telegraf custom plugins for deeper InfluxDB metrics collection.

Common issues

SymptomCauseFix
Nodes can't join clusterNetwork connectivity or shared secret mismatchCheck firewall rules on ports 8086, 8088, 8091 and verify shared-secret matches
SSL certificate errorsSelf-signed certificate or hostname mismatchUse insecure_skip_verify = true for testing or generate proper certificates
Data not replicatingReplication factor higher than available nodesAdjust replication factor to match or be less than node count
High memory usageCache settings too high for available RAMReduce cache-max-memory-size in influxdb.conf
Query timeout errorsHeavy queries overwhelming clusterIncrease query-timeout or optimize queries with better indexing
HAProxy health checks failingAuthentication required for /ping endpointSet ping-auth-enabled = false or configure HAProxy basic auth
Kapacitor alerts not firingWrong measurement name or field in TICKscriptCheck measurement names with SHOW MEASUREMENTS and verify field names

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle high availability infrastructure for businesses that depend on uptime. From initial setup to ongoing operations.