A systematic benchmark tool designed for the agentic workload profile. Evaluates 80+ configurations across nine inference backends, eight models, six quantization formats, seven KV cache strategies, and four context tiers on Apple Silicon hardware.
The project exists to answer production questions that short-prompt benchmarks hide: which runtime keeps time-to-first-token under control at long context, which optimizations actually move end-to-end latency, and which setups stay viable once a multi-turn agent starts accumulating context and tool state.
Key features
- Full server lifecycle management per configuration (start, health check, warmup, measurement, teardown)
- Wall-clock TTFT and decode throughput measurement at realistic context depths (up to 128k tokens)
- Process-tree memory monitoring via recursive psutil sampling
- YAML-driven configuration with constraint validation
- OpenAI SSE and Ollama JSON-line streaming parsers
- Backend version tracking across runs for longitudinal comparison
- Speculative decoding support (llama.cpp draft model pipeline)
- Download-bench-purge workflow for large GGUF files (conserves disk space)
- 900+ row dataset released for reproducibility
- Focus on deployment trade-offs for agentic workloads, not only headline tokens-per-second figures
Backends tested
| Backend | Format | Notes |
|---|---|---|
| mlx-lm | MLX (4bit, 8bit, MXFP4, NVFP4) | Fastest MLX decode for Qwen models |
| Ollama | GGUF, MLX (since 0.19) | Prefix caching, easiest setup |
| llama.cpp (llama-server) | GGUF | Speculative decoding, portable |
| mlx-vlm | MLX | VLM architectures (Gemma4) |
| vllm-mlx | MLX | Highest raw decode, poor TTFT scaling |
| oMLX | MLX + TurboQuant | KV cache compression, SSD offload |
| LM Studio | GGUF / MLX | GUI wrapper |
| Docker Model Runner | GGUF | Container-based |
Key findings
- TTFT diverges by 100x across backends at 32k context
- Prefix caching achieves 626x TTFT reduction
- MoE architecture is required for 128k-token viability on 64 GB
- Framework-level optimization (2.4x) exceeds quantization improvements (<3%)
- Speculative decoding (llama.cpp): 206 t/s decode via Gemma4-26B + E2B draft (3.2x over standalone)
- Unsloth UD-Q2_K_XL matches or beats Q4_K_M on decode speed (less bandwidth per token)
- llama.cpp wins on Gemma 4; MLX wins on Qwen (architecture-specific kernel optimizations)
Speculative decoding
Pairing Gemma4-E2B as a draft model with Gemma4-26B via llama.cpp produced the fastest decode speed measured: 206 tokens per second. The draft model proposes tokens that the main model verifies in parallel.
Backend comparison
Which backend is fastest depends on the model architecture. llama.cpp leads on Gemma 4, while MLX backends (mlx-lm, Ollama 0.19+) lead on Qwen models.
llama.cpp decode speed across all models
29 GGUF configurations tested via llama-server b9020, spanning MoE and dense architectures at multiple Unsloth Dynamic quantization levels.
Interactive results
All 900+ measurements, sortable by any column, filterable by model, backend, quantization, and context depth.