Five signatures of installs that stop right after success
Official troubleshooting still centers gateway.mode, non-loopback binds, and gateway.auth.token, and those remain frequent in 2026. Yet on VPS images you often meet a quieter class: the user session never really exists, so systemctl --user fails before any unit is written, or units exist but die when the SSH session ends because linger is off and XDG_RUNTIME_DIR was only exported interactively. Another cluster is stale listeners: EADDRINUSE, duplicate gateway installs, or mismatched ports between the supervisor and the JSON config. Upgrades add drift when defaults tighten auth and suddenly refuse a LAN bind without a token. Security layers add SELinux denials or cloud security groups blocking outbound webhooks. Finally there is the policy plane where channels look connected yet messages never reach the agent, which belongs to a different article in this blog series.
The pain list below separates process health from message policy so you do not spend nights tuning ports when systemd is the real blocker. If you only care about silent drops after connected badges, read the channel troubleshooting guide first, then return here to validate the base host.
Install ends with systemctl --user unavailable: common on minimal Ubuntu or vendor images without a working user manager, often misread as a broken installer.
Gateway dies when SSH closes: classic missing linger or missing persistent XDG_RUNTIME_DIR in the service user shell profile.
Logs refuse non-loopback without token: tighten bind or add auth instead of reinstalling channels repeatedly.
Doctor reports CLI versus service config mismatch: align with gateway install --force and restart instead of letting two ports coexist.
Desktop-class side effects dominate: patching browsers and signing stacks on Linux may exceed moving to macOS bare metal.
Once signatures live in your runbook, triage moves from hours to minutes: confirm gateway runtime, confirm the user session, then descend into channels and pairing. For distributed teams, also draw which region hosts the gateway relative to members and model API endpoints, or latency will still feel broken even when the process is running.
Add a cold-start check after reboots: wait two minutes before channels probe so transient DNS or certificate jitter is not mistaken for policy failure. Diff doctor output across two upgrades to spot creeping drift before it becomes an outage.
Capacity planning also intersects with init behavior: a two-vCPU VPS can host a gateway for a small team, yet the moment you add scheduled jobs, browser-backed scrapers, and always-on sub-agents, CPU starvation looks like flaky transports because probes time out before the event loop answers. Capturing a short `top` or `pidstat` sample during peak hours helps you separate saturation from misconfiguration. Likewise, disk pressure from verbose logs can rotate credentials or truncate state files; pairing log rotation with disk alerts prevents silent corruption that only surfaces during pairing resets.
Same budget: Linux VPS or bare-metal cloud Mac for Gateway
The matrix avoids a single price column because session stability, desktop dependencies, and isolation effort dominate total cost when OpenClaw runs seven by twenty four with frequent browser automation.
| Dimension | Headless Linux VPS | Bare-metal cloud Mac host |
|---|---|---|
| Session and supervision | Depends on user systemd, linger, and XDG paths behaving | launchd and macOS session stack mature for long-lived agents |
| Typical fit | Light relays, webhook ingress, CLI-only flows | Browser automation, desktop permissions, shared team isolation |
| Operational load | Wide distro variance to maintain | More uniform Apple stack, fewer surprise images |
| Multi-region | Many clouds but uneven compliance and images | Singapore, Tokyo, Seoul, Hong Kong, US East, US West options near users |
| Hidden cost | Engineer minutes on SSH repair loops | Higher rent, often lower than firefighting TCO |
Run the ladder before picking a host; do not tune channel mood while systemd is still flapping.
If Lobster canvas, frequent browser opens, or macOS keychain-class needs already appear in your workload, stacking more packages on a VPS only delays migration. A week-long trial on a weekly billed bare-metal node in Singapore or US West usually settles the debate with evidence.
Observability should include synthetic webhook checks from outside your office network, because corporate VPN paths can mask broken public listeners. A tiny cron job that curls your health endpoint from another region costs almost nothing and catches security-group regressions early. Pair that with alerting on TLS expiry for any reverse proxy you terminate in front of the gateway so renewals do not land on the same weekend as a major release.
Keep a pinned note with the exact package versions you used when the ladder last passed; replaying upgrades in a staging VPS before production reduces surprises.
The five-command ladder and a minimal log baseline
The documented order is deliberate: status gives the overview, gateway status proves runtime and probes, logs capture signatures, doctor scans unit and config drift, channels status --probe advances from process health into transports. Skipping gateway while chasing model errors wastes tokens and relogs.
openclaw status openclaw gateway status openclaw logs --follow openclaw doctor openclaw channels status --probe
On VPS hosts, archive a healthy baseline snippet for Runtime, Connectivity probe, and Capability lines from gateway status. After upgrades, if only one line changes, rollback paths shorten. When doctor warns about duplicate system and user units, follow repair guidance instead of hand-deleting files that can leave half listeners behind.
Before moving to cloud Mac, rerun the same ladder and compare baselines to prove whether pain follows the host or the configuration. That experiment beats arguing about distro purity.
If root and normal users both installed OpenClaw, verify HOME and OPENCLAW_STATE_DIR point at one state tree; split brains trigger Config (cli) versus Config (service) warnings and deserve consolidation before more JSON edits.
Note: When logs mention gateway.mode or auth blocks, cross-check the long Gateway deployment article for bind and token sections before widening exposure.
Six steps to keep a VPS gateway maintainable
Freeze distro and Node baselines: record image name, kernel, and Node major in the repo README to avoid mystery variance.
Validate user systemd: run systemctl --user status under the service user; fix linger and dbus before gateway install.
Persist XDG_RUNTIME_DIR: export XDG_RUNTIME_DIR=/run/user/$(id -u) in profiles loaded by non-interactive shells too.
Capture ladder baselines: store the five outputs before upgrades as rollback triggers.
Add probes separate from channel badges: monitor TCP listen, process liveness, and disk watermark independently.
Quarterly host review: tally incident minutes from session-class failures and compare to the matrix for migration timing.
Three checks reviewers actually ask for
Listener matches unit metadata: gateway status JSON ports must match ExecStart in the installed unit or doctor repair loops forever.
Non-loopback exposure: any LAN or public bind pairs with token or reverse-proxy policy and needs firewall confirmation.
Post-upgrade channel probe: rerun channels status --probe within twenty four hours and archive output as a rollback condition.
Caution: Complete security review before public bind; this article documents auditable fixes, not auth bypass tricks.
Overall, Linux VPS fits light ingress or experiments, yet production message buses with desktop side effects outgrow stripped images quickly. Bare-metal cloud Mac across major hubs gives a predictable Apple session model so effort returns to workflows instead of init debugging. MESHLAUNCH Mac Mini cloud rental is usually the stronger operational choice for dedicated compute, elastic daily-to-quarterly terms, and keeping Gateway beside long-lived agents in one auditable footprint.
Update firewall allowlists and on-call contacts whenever public exposure changes to avoid the second incident where the gateway is healthy but the security group still targets an old address; attach a rollback command to the change ticket for faster recovery.
Finally, document which person owns the break-glass SSH key and which owns the cloud console login; gateway incidents at two in the morning fail when credentials live only on the laptop that is offline. A short RACI table in the same folder as your ladder output closes that gap without expanding scope into full ITIL.
Start with connected but no reply for policy-layer triage, then return here for systemd and bind checks.
See the pricing page and the help center for access notes.
Use install and Lobster workflow for orchestration; this page focuses on host and systemd foundations.