Troubleshooting
Quick reference
| Symptom | Likely cause | Jump to |
|---|---|---|
TLS cert not issuing after first docker compose up | DNS not propagated, or port 443 not open | TLS certificate not issuing |
Subscribers getting 401 on a key that should work | Key scope mismatch, revoked key, or auth service down | Subscribers getting 401 |
Subscribers getting 404 on a path that should exist | Component name in URL not provisioned via the components API | Subscribers getting 401 |
All requests returning 503 | Auth service is down (fail-closed by design) | All requests returning 503 |
Admin API returns connection refused / 000 | SSH tunnel not open | Admin API unreachable |
Container shows unhealthy or exited | Crash loop, disk full, or DB permissions | Container unhealthy or crashing |
| Promotion workflow fails at GPG step | lts.asc is still a placeholder, or secrets not set | Promotion pipeline failures |
TLS certificate not issuing
Traefik uses the TLS-ALPN-01 challenge — it completes over port 443 with no dependency on port 80.
Check 1 — DNS propagation
The TLS-ALPN-01 challenge requires pkg.example.org to resolve to the VM's IP before Traefik can obtain a certificate. Verify from an external host:
dig +short pkg.example.org A
# must return the VM's public IP
If DNS has not propagated, wait and retry. Do not restart Traefik repeatedly — Let's Encrypt rate limits allow 5 failed validation attempts per domain per hour. Exhausting this limit will block cert issuance for up to an hour.
Check 2 — Port 443 open
Confirm the VM's firewall allows inbound TCP 443:
# From an external host
curl -v https://pkg.example.org/gpg/lts.asc
# Connection refused → firewall blocking 443
# Certificate error → port open, cert not yet issued (normal on first boot)
Check 3 — Traefik logs
docker compose logs traefik | grep -i "acme\|certificate\|error"
Successful issuance looks like:
msg="Certificate obtained successfully" domain=pkg.example.org
Check 4 — acme.json in the certs volume
If the traefik-certs volume was deleted (e.g. docker compose down -v), Traefik will attempt to re-issue. After a successful issuance, the cert is stored in the volume — do not delete it with -v unless you intend to re-request.
docker run --rm -v traefik-certs:/certs alpine sh -c 'cat /certs/acme.json'
Subscribers getting 401
A 401 response can come from three independent causes. Work through them in order.
Step 1 — Is the auth service running?
docker compose ps auth
# State must be: running (healthy)
If the state is unhealthy or exited, see Container unhealthy or crashing. If auth is down, Traefik returns 503, not 401 — but some HTTP clients or package managers may display this differently.
Step 2 — Does the key exist and is it active?
Open an SSH tunnel to the admin API, then look up the key:
ssh -L 8088:127.0.0.1:8088 deploy@pkg.example.org -N &
curl -s http://127.0.0.1:8088/api/v1/keys | jq '.[] | select(.id == "<KEY>")'
If the key is absent, it was created after the last backup and lost during a keystore restore. Re-provision it:
curl -s -X POST http://127.0.0.1:8088/api/v1/keys \
-H 'Content-Type: application/json' \
-d '{"component": "core", "label": "re-provisioned"}'
If the key is present but "active": false, it has been revoked. Issue a new key.
Step 3 — Is the key scoped to the right component?
Each key is scoped to a single component. A core key cannot access /rpm/minion/ — that is the expected behaviour, not a bug. Confirm the subscriber is using a key whose component matches the path they are requesting.
curl -s http://127.0.0.1:8088/api/v1/keys | jq '.[] | {id, component, label, active}'
Step 4 — Is the component marked public?
If a component has visibility: public (as set via POST /api/v1/components or updated via PATCH /api/v1/components/{name}), the auth service allows requests to its paths without inspecting credentials. Forward-auth reads visibility from the database on every request — changes take effect immediately without a restart. If a subscriber reports getting 401 on a public component path, confirm the component's current visibility:
curl -s http://127.0.0.1:8088/api/v1/components/core | jq .visibility
Restart semantics — when a restart is and is not required:
- Key creation (
POST /api/v1/keys) and forward-auth decisions both query the database on every request — a restart is not needed for new components, deleted components, or visibility changes to take effect. - Key list filter (
GET /api/v1/keys?component=<name>) validates the component name against an in-memory map loaded at startup — it returns400 INVALID_COMPONENTfor a newly provisioned component until the service is restarted. component_visibilityin key responses is also derived from the startup-loaded map — it may show a stale value after aPATCH /api/v1/components/{name}visibility change until the service is restarted. This is cosmetic only; forward-auth always uses the live value.
POST /api/v1/components returns 500 RPM_INIT_FAILED
The auth service failed to create the RPM directory tree for the new component. The component record was rolled back — the name is safe to reuse.
Common causes:
| Cause | How to confirm | Fix |
|---|---|---|
rpm-data volume not mounted to auth container | docker compose exec auth ls /data/rpm — should list a rpm/ subdirectory | Verify compose.yml has rpm-data:/data/rpm under the auth service volumes |
RPM_DATA_ROOT mismatch | docker compose exec auth env | grep RPM_DATA_ROOT | Ensure RPM_DATA_ROOT matches the volume mount point (default: /data/rpm) |
| Wrong permissions on the volume | docker compose exec auth ls -la /data/rpm | docker compose exec rpm chown -R nobody:nobody /usr/share/nginx/html/rpm or adjust mount ownership |
| Disk full | docker compose exec auth df -h /data/rpm | Free disk space |
After fixing the underlying cause, retry POST /api/v1/components with the same body — the name is available again.
All requests returning 503
The auth service is configured fail-closed: if it is unreachable or returns an unexpected error, Traefik returns 503 rather than allowing the request through.
Diagnose:
docker compose ps auth
docker compose logs auth --tail=50
Common causes:
| Cause | Indicator in logs | Action |
|---|---|---|
| Auth container exited | exited (1) in ps | docker compose start auth then check logs |
| SQLite DB corrupt or missing | unable to open database | Follow Restore Keystore |
| Out of disk space | no space left on device | Free disk space, then docker compose restart auth |
| OOM killed | exit code 137 | Increase VM RAM or reduce other workloads |
Admin API unreachable
The admin API listens on 127.0.0.1:8088 (loopback only). It is not reachable from outside the VM without an SSH tunnel.
Open the tunnel:
ssh -L 8088:127.0.0.1:8088 deploy@pkg.example.org -N &
Verify:
curl -s http://127.0.0.1:8088/api/v1/keys
# Expected: JSON array (empty if no keys)
# Connection refused → tunnel not open, or auth service down
Note: port 8088 serves plain HTTP (not HTTPS) — do not use -k or https://.
If the admin API returns 404 on port 443 (the public entrypoint), that is correct — the admin route is intentionally not exposed publicly.
Container unhealthy or crashing
Check all container states:
docker compose ps
Get logs for a specific service:
docker compose logs <service> --tail=100
# services: traefik, auth, rpm, deb, zot, aptly, rustfs, static, backup
Auth container crash loop — common causes:
docker compose logs auth | tail -20
| Log message | Cause | Fix |
|---|---|---|
unable to open database file | auth-db volume missing or wrong permissions | docker compose down && docker compose up -d |
no space left on device | Disk full | Free space, then docker compose restart auth |
exit code 137 | OOM killed | Increase VM RAM |
Restart a single service without affecting others:
docker compose restart <service>
Promotion pipeline failures
ERROR: lts.asc is still a placeholder
The promotion workflows verify static/content/gpg/lts.asc contains a real GPG public key before signing. This error means the placeholder has not been replaced.
Follow §4.1 of Production Deployment to generate a key and commit lts.asc.
GPG signing fails — secret key not available
The GPG_PRIVATE_KEY or GPG_KEY_ID secret is missing or incorrect in GitHub Actions repository settings. Verify all 10 secrets are set (§3 of Production Deployment).
SSH connection to VM fails
The promotion workflows SSH into the VM to run docker exec commands. If the connection fails:
- Verify
SSH_PRIVATE_KEYandSSH_KNOWN_HOSTsecrets match the current VM. - Check the VM's
deployuser still has the correct authorized key:cat /home/deploy/.ssh/authorized_keys - If the VM was rebuilt, re-run
ssh-keyscan pkg.example.organd updateSSH_KNOWN_HOST.
RustFS artifact not found during promotion
The promotion workflow downloads a staged artifact from RustFS before signing. If the artifact is not found:
- Confirm the artifact was staged with the correct
component,series, andosvalues that match the workflow inputs. - Open an SSH tunnel to RustFS and list the staging bucket:
ssh -L 9000:localhost:9000 deploy@pkg.example.org -N &AWS_ACCESS_KEY_ID=<key> AWS_SECRET_ACCESS_KEY=<secret> \aws s3 ls s3://staging/ \--endpoint-url http://localhost:9000 \--region us-east-1 \--recursive