Skip to main content

Upgrading Traefik Hub API Gateway In Nomad

High‑Availability Upgrade Strategy

Nomad provides native rolling‑upgrade semantics via the update stanza. To achieve zero‑downtime upgrades you must:

  • Run ≥ 2 Traefik Hub allocations (count >= 2).
  • Add an update stanza with canary, max_parallel, and stagger.
  • Expose a health‑check (/ping) so Nomad only shifts traffic once the new allocation is healthy.
  • Drain existing connections gracefully with Traefik’s lifecycle grace‑timeout or a prestop hook.

Below is a minimal, production‑ready job spec that fulfils those requirements:

Nomad Job
job "traefik-hub" {
datacenters = ["dc1"]
type = "service"

#––– Rolling‑upgrade policy –––––––––––––––––––––––––––––––––––––––
update {
stagger = "30s" # wait 30 s between replacements
max_parallel = 1 # replace one allocation at a time
canary = 1 # spin up a single canary allocation first
}

group "traefik" {
count = 3 # run three Hub instances for HA

# spread them across different nodes
spread {
attribute = "${node.unique.name}"
weight = 100
}

network {
mode = "bridge"
port "web" { static = 8080 }
}

service {
name = "traefik"
provider = "nomad"
port = "web"
check {
type = "http"
path = "/ping" # Hub readiness endpoint
interval = "10s"
timeout = "2s"
}
# standard tags for the Nomad provider
tags = [
"traefik.enable=true",
"traefik.http.routers.api.entrypoints=web",
"traefik.http.routers.api.rule=PathPrefix(`/api`) || PathPrefix(`/dashboard`)",
"traefik.http.routers.api.service=api@internal",
"traefik.http.services.dummy-svc.loadbalancer.server.port=9999",
]
}

task "traefik" {
driver = "docker"

config {
image = "ghcr.io/traefik/traefik-hub:v3.16.0" # You can update the tag here if needed

args = [
"traefik-hub",
"--entrypoints.web.address=:8080/tcp",
"--entrypoints.web.transport.lifecycle.gracetimeout=20s", # connection draining
"--api.dashboard=true",
"--providers.nomad.endpoint.address=${NOMAD_ADDR}",
"--providers.nomad.exposedByDefault=false",
"--hub.token=${HUB_TOKEN}",
"--log.level=INFO",
]

ports = ["web"]
cap_add = ["NET_BIND_SERVICE"]
cap_drop = ["ALL"]
}

resources {
cpu = 500
memory = 256
}

lifecycle {
hook = "prestop"
sidecar = false
command = "sleep"
args = ["25"] # ≥ gracetimeout to let in‑flight reqs finish
}
}
}
}

Zero‑Downtime Upgrade Procedure

# 1 Update the image tag in the job file, e.g. v3.17.0
sed -i 's/v3\.16\.0/v3.17.0/' traefik-hub.nomad

# 2 Run a rolling upgrade – Nomad will start a canary and continue only if healthy
nomad job run -detach traefik-hub.nomad

# 3 Watch progress (Ctrl‑C to exit)
nomad job status -watch traefik-hub

If the canary allocation reports an unhealthy status, Nomad aborts the deployment and rolls back automatically, ensuring continuous availability.

Rollback:

nomad job revert -yes traefik-hub <previous_version>

Blue‑Green With reusePort

On Linux you can combine --entrypoints.<name>.reusePort=true with --entrypoints.<name>.transport.lifecycle.gracetimeout=<seconds> to implement true blue‑green upgrades:

  • Deploy the new job in parallel on the same host/port.
  • Because reusePort uses the kernel load‑balancer, connections are automatically distributed between old and new processes.
  • After verifying the new job, stop the old one; existing connections drain gracefully.
Kernel caveat

reusePort relies on the SO_REUSEPORT socket option. Some older Linux kernels may trigger sporadic TCP resets (see https://lwn.net/Articles/853637/). Upgrade the kernel or disable the flag if you observe anomalies.