Skip to main content

Troubleshooting

Quick reference

SymptomLikely causeJump to
TLS cert not issuing after first docker compose upDNS not propagated, or port 443 not openTLS certificate not issuing
Subscribers getting 401 on a key that should workKey scope mismatch, revoked key, or auth service downSubscribers getting 401
Subscribers getting 404 on a path that should existComponent name in URL not provisioned via the components APISubscribers getting 401
All requests returning 503Auth service is down (fail-closed by design)All requests returning 503
Admin API returns connection refused / 000SSH tunnel not openAdmin API unreachable
Container shows unhealthy or exitedCrash loop, disk full, or DB permissionsContainer unhealthy or crashing
Promotion workflow fails at GPG steplts.asc is still a placeholder, or secrets not setPromotion pipeline failures

TLS certificate not issuing

Traefik uses the TLS-ALPN-01 challenge — it completes over port 443 with no dependency on port 80.

Check 1 — DNS propagation

The TLS-ALPN-01 challenge requires pkg.example.org to resolve to the VM's IP before Traefik can obtain a certificate. Verify from an external host:

dig +short pkg.example.org A
# must return the VM's public IP

If DNS has not propagated, wait and retry. Do not restart Traefik repeatedly — Let's Encrypt rate limits allow 5 failed validation attempts per domain per hour. Exhausting this limit will block cert issuance for up to an hour.

Check 2 — Port 443 open

Confirm the VM's firewall allows inbound TCP 443:

# From an external host
curl -v https://pkg.example.org/gpg/lts.asc
# Connection refused → firewall blocking 443
# Certificate error → port open, cert not yet issued (normal on first boot)

Check 3 — Traefik logs

docker compose logs traefik | grep -i "acme\|certificate\|error"

Successful issuance looks like:

msg="Certificate obtained successfully" domain=pkg.example.org

Check 4 — acme.json in the certs volume

If the traefik-certs volume was deleted (e.g. docker compose down -v), Traefik will attempt to re-issue. After a successful issuance, the cert is stored in the volume — do not delete it with -v unless you intend to re-request.

docker run --rm -v traefik-certs:/certs alpine sh -c 'cat /certs/acme.json'

Subscribers getting 401

A 401 response can come from three independent causes. Work through them in order.

Step 1 — Is the auth service running?

docker compose ps auth
# State must be: running (healthy)

If the state is unhealthy or exited, see Container unhealthy or crashing. If auth is down, Traefik returns 503, not 401 — but some HTTP clients or package managers may display this differently.

Step 2 — Does the key exist and is it active?

Open an SSH tunnel to the admin API, then look up the key:

ssh -L 8088:127.0.0.1:8088 deploy@pkg.example.org -N &
curl -s http://127.0.0.1:8088/api/v1/keys | jq '.[] | select(.id == "<KEY>")'

If the key is absent, it was created after the last backup and lost during a keystore restore. Re-provision it:

curl -s -X POST http://127.0.0.1:8088/api/v1/keys \
-H 'Content-Type: application/json' \
-d '{"component": "core", "label": "re-provisioned"}'

If the key is present but "active": false, it has been revoked. Issue a new key.

Step 3 — Is the key scoped to the right component?

Each key is scoped to a single component. A core key cannot access /rpm/minion/ — that is the expected behaviour, not a bug. Confirm the subscriber is using a key whose component matches the path they are requesting.

curl -s http://127.0.0.1:8088/api/v1/keys | jq '.[] | {id, component, label, active}'

Step 4 — Is the component marked public?

If a component has visibility: public (as set via POST /api/v1/components or updated via PATCH /api/v1/components/{name}), the auth service allows requests to its paths without inspecting credentials. Forward-auth reads visibility from the database on every request — changes take effect immediately without a restart. If a subscriber reports getting 401 on a public component path, confirm the component's current visibility:

curl -s http://127.0.0.1:8088/api/v1/components/core | jq .visibility

Restart semantics — when a restart is and is not required:

  • Key creation (POST /api/v1/keys) and forward-auth decisions both query the database on every request — a restart is not needed for new components, deleted components, or visibility changes to take effect.
  • Key list filter (GET /api/v1/keys?component=<name>) validates the component name against an in-memory map loaded at startup — it returns 400 INVALID_COMPONENT for a newly provisioned component until the service is restarted.
  • component_visibility in key responses is also derived from the startup-loaded map — it may show a stale value after a PATCH /api/v1/components/{name} visibility change until the service is restarted. This is cosmetic only; forward-auth always uses the live value.

POST /api/v1/components returns 500 RPM_INIT_FAILED

The auth service failed to create the RPM directory tree for the new component. The component record was rolled back — the name is safe to reuse.

Common causes:

CauseHow to confirmFix
rpm-data volume not mounted to auth containerdocker compose exec auth ls /data/rpm — should list a rpm/ subdirectoryVerify compose.yml has rpm-data:/data/rpm under the auth service volumes
RPM_DATA_ROOT mismatchdocker compose exec auth env | grep RPM_DATA_ROOTEnsure RPM_DATA_ROOT matches the volume mount point (default: /data/rpm)
Wrong permissions on the volumedocker compose exec auth ls -la /data/rpmdocker compose exec rpm chown -R nobody:nobody /usr/share/nginx/html/rpm or adjust mount ownership
Disk fulldocker compose exec auth df -h /data/rpmFree disk space

After fixing the underlying cause, retry POST /api/v1/components with the same body — the name is available again.


All requests returning 503

The auth service is configured fail-closed: if it is unreachable or returns an unexpected error, Traefik returns 503 rather than allowing the request through.

Diagnose:

docker compose ps auth
docker compose logs auth --tail=50

Common causes:

CauseIndicator in logsAction
Auth container exitedexited (1) in psdocker compose start auth then check logs
SQLite DB corrupt or missingunable to open databaseFollow Restore Keystore
Out of disk spaceno space left on deviceFree disk space, then docker compose restart auth
OOM killedexit code 137Increase VM RAM or reduce other workloads

Admin API unreachable

The admin API listens on 127.0.0.1:8088 (loopback only). It is not reachable from outside the VM without an SSH tunnel.

Open the tunnel:

ssh -L 8088:127.0.0.1:8088 deploy@pkg.example.org -N &

Verify:

curl -s http://127.0.0.1:8088/api/v1/keys
# Expected: JSON array (empty if no keys)
# Connection refused → tunnel not open, or auth service down

Note: port 8088 serves plain HTTP (not HTTPS) — do not use -k or https://.

If the admin API returns 404 on port 443 (the public entrypoint), that is correct — the admin route is intentionally not exposed publicly.


Container unhealthy or crashing

Check all container states:

docker compose ps

Get logs for a specific service:

docker compose logs <service> --tail=100
# services: traefik, auth, rpm, deb, zot, aptly, rustfs, static, backup

Auth container crash loop — common causes:

docker compose logs auth | tail -20
Log messageCauseFix
unable to open database fileauth-db volume missing or wrong permissionsdocker compose down && docker compose up -d
no space left on deviceDisk fullFree space, then docker compose restart auth
exit code 137OOM killedIncrease VM RAM

Restart a single service without affecting others:

docker compose restart <service>

Promotion pipeline failures

ERROR: lts.asc is still a placeholder

The promotion workflows verify static/content/gpg/lts.asc contains a real GPG public key before signing. This error means the placeholder has not been replaced.

Follow §4.1 of Production Deployment to generate a key and commit lts.asc.

GPG signing fails — secret key not available

The GPG_PRIVATE_KEY or GPG_KEY_ID secret is missing or incorrect in GitHub Actions repository settings. Verify all 10 secrets are set (§3 of Production Deployment).

SSH connection to VM fails

The promotion workflows SSH into the VM to run docker exec commands. If the connection fails:

  1. Verify SSH_PRIVATE_KEY and SSH_KNOWN_HOST secrets match the current VM.
  2. Check the VM's deploy user still has the correct authorized key:
    cat /home/deploy/.ssh/authorized_keys
  3. If the VM was rebuilt, re-run ssh-keyscan pkg.example.org and update SSH_KNOWN_HOST.

RustFS artifact not found during promotion

The promotion workflow downloads a staged artifact from RustFS before signing. If the artifact is not found:

  1. Confirm the artifact was staged with the correct component, series, and os values that match the workflow inputs.
  2. Open an SSH tunnel to RustFS and list the staging bucket:
    ssh -L 9000:localhost:9000 deploy@pkg.example.org -N &
    AWS_ACCESS_KEY_ID=<key> AWS_SECRET_ACCESS_KEY=<secret> \
    aws s3 ls s3://staging/ \
    --endpoint-url http://localhost:9000 \
    --region us-east-1 \
    --recursive