2026 antirez ds4 on Mac
DeepSeek V4, the 96GB Wall, Cloud Rental

DwarfStar 4 · unified memory · Metal · high-RAM cloud Mac

ds4 local DeepSeek V4 inference on cloud Mac
If you want frontier-class open weights offline on a Mac, software is no longer the blocker—RAM is. Redis author antirez shipped ds4 (DwarfStar 4) in May 2026: pure C, Metal-first, built only for DeepSeek V4 Flash. This post is for AI engineers who hit the 96GB unified-memory floor: what ds4 actually does, a quant/memory matrix, and a six-step runbook to compile, load weights, and wire ds4-server to Cursor on a high-RAM cloud Mac without buying a six-figure Studio.
01

What is ds4 in 2026 and why antirez went single-model

llama.cpp, Ollama, and MLX already load many GGUFs. ds4 does the opposite: one model family, end-to-end—Metal graph execution, asymmetric quants, on-disk KV snapshots, tool calling, and ds4-server with OpenAI- and Anthropic-compatible endpoints. In his write-up, antirez argues the gap was never “another runtime,” but “weights fast enough to replace daily Claude calls on personal gear.”

01

Momentum: github.com/antirez/ds4 crossed 10k+ stars within days—developers want depth on one checkpoint, not another generic loader.

02

Self-contained: no llama.cpp dependency; macOS production path is Metal (CPU is debug-only; README warns macOS VM bugs can kernel-panic on CPU inference).

03

Agent-ready: point Cursor, opencode, or Claude Code at your instance—data stays on your disk, not a hosted API.

04

Long context: design targets up to ~1M tokens with compressed KV plus ds4 disk snapshots so sessions survive restarts.

05

Real blocker: 96GB–512GB unified memory—that is what cloud Mac rental is meant to unblock.

02

Metal, disk KV, and 2-bit routing quants: how ds4 differs

Community reports on M-series Max machines cite roughly 463 tok/s prefill and 34 tok/s generation for Flash—always benchmark on your own box before signing SLAs.

Capabilityds4Generic Ollama / llama.cpp
ScopeDeepSeek V4 Flash pathhundreds of GGUF architectures
macOS GPUMetal as primary targetmulti-backend, less DS-specific tuning
KV stateRAM + disk snapshotsoften lost on process exit
Quant2-bit on routed experts onlysingle global quant tier
Coding agentsbuilt-in tools + compatible APIsextra gateway assembly

Apple Silicon unified memory (UMA) lets CPU and GPU share one pool—why ds4 pairs Metal with fast NVMe for KV persistence instead of treating Mac as an afterthought.

Citable baseline: official docs bind production inference to Metal/CUDA; asymmetric 2/8-bit Flash weights expect 96GB or 128GB UMA—below that is outside the supported path.

03

How much RAM for DeepSeek V4 Flash and PRO: 2026 matrix

Model / quantMin unified RAMTypical hardwareBuy-side order of magnitude
V4 Flash · q296 GBMacBook Pro M3/M4/M5 Max~$4k+ USD class
V4 Flash · q4256 GBMac Studio Ultra~$8k+ USD class
V4 PRO · q2512 GBMac Studio M3 Ultra maxed~$15k+ USD class
A

Pilot tier (96–128GB): enough for Flash q2 plus Cursor tool-calling smoke tests—ideal for daily cloud rental.

B

Production coding (128–256GB): parallel agents plus long context—keep ~20% RAM headroom to avoid swap thrash.

C

PRO experiments (512GB): rent by the week on cloud metal instead of capitalizing a one-off purchase.

04

Six steps to run ds4 on a cloud Mac end to end

01

Pick RAM for your quant: Flash pilot → 128GB instance; q4 or PRO → 256GB / 512GB to avoid re-downloading weights mid-project.

02

Validate Metal: system_profiler SPDisplaysDataType; ensure Command Line Tools via xcode-select -p.

03

Build ds4: git clone https://github.com/antirez/ds4.git && cd ds4 && make inside tmux so SSH drops do not kill the compile.

04

Stage weights on local NVMe: follow the repo for official vectors/GGUF paths—hundreds of GB; never use iCloud-synced folders.

05

Start ds4-server: bind loopback or private IP; curl /v1/models to confirm Metal, not CPU debug backend.

06

Agent acceptance: tunnel or Tailscale Serve; run a tool-calling coding task; verify KV snapshots survive reconnect without full prefill.

SSH port forward
ssh -N -L 8080:127.0.0.1:PORT user@your-cloud-mac.example.com
export OPENAI_BASE_URL=http://127.0.0.1:8080/v1
05

Skip the six-figure Mac: rent Flash, burst to PRO when needed

Buying locks capital and depreciation; cloud bare-metal turns RAM into a dial—128GB this week for Flash plugins, 512GB next week for PRO benchmarks, then power off.

DimensionBuy Studio UltraHigh-RAM cloud Mac
Cash upfrontfive-figure purchasehourly / daily / monthly
Elasticitynew machine = new purchaseresize 128GB ↔ 512GB
Team sharingone laptop per personone instance, SSH roles, shift inference
Privacyphysical controldedicated bare metal—weights never leave your disk

Generic Linux GPU VPS paths are a poor fit: ds4’s supported macOS story is Metal. Pair ds4 with our parallel agent workflow post—use a 64GB cloud Mac as the control plane and a 128GB+ box as the heavy inference worker.

For teams that need stable Metal inference without a six-figure CapEx line, MESHLAUNCH high-RAM Mac mini / M4 Pro / Max bare-metal rental is usually the pragmatic path: day-rent Flash, month-lock long-context production, burst PRO on demand—all inside your dedicated instance, not a third-party model API. See the pricing page and help center.

FAQ

Not on the supported path—Flash q2 needs 96GB UMA minimum. Day-rent 128GB first, then decide on hardware.

No—ds4-server runs on your rented instance; point your IDE base URL there. We do not proxy model payloads.

Yes, but avoid loading two large models at full duty. Reserve 96GB+ for ds4; keep small models on Ollama—memory tables in the help center.