AH
Type to search...
3 min read
LLM Inference Benchmark Harness

A systematic benchmark tool designed for the agentic workload profile. Evaluates 80+ configurations across nine inference backends, eight models, six quantization formats, seven KV cache strategies, and four context tiers on Apple Silicon hardware.

The project exists to answer production questions that short-prompt benchmarks hide: which runtime keeps time-to-first-token under control at long context, which optimizations actually move end-to-end latency, and which setups stay viable once a multi-turn agent starts accumulating context and tool state.

Key features

  • Full server lifecycle management per configuration (start, health check, warmup, measurement, teardown)
  • Wall-clock TTFT and decode throughput measurement at realistic context depths (up to 128k tokens)
  • Process-tree memory monitoring via recursive psutil sampling
  • YAML-driven configuration with constraint validation
  • OpenAI SSE and Ollama JSON-line streaming parsers
  • Backend version tracking across runs for longitudinal comparison
  • Speculative decoding support (llama.cpp draft model pipeline)
  • Download-bench-purge workflow for large GGUF files (conserves disk space)
  • 900+ row dataset released for reproducibility
  • Focus on deployment trade-offs for agentic workloads, not only headline tokens-per-second figures

Backends tested

BackendFormatNotes
mlx-lmMLX (4bit, 8bit, MXFP4, NVFP4)Fastest MLX decode for Qwen models
OllamaGGUF, MLX (since 0.19)Prefix caching, easiest setup
llama.cpp (llama-server)GGUFSpeculative decoding, portable
mlx-vlmMLXVLM architectures (Gemma4)
vllm-mlxMLXHighest raw decode, poor TTFT scaling
oMLXMLX + TurboQuantKV cache compression, SSD offload
LM StudioGGUF / MLXGUI wrapper
Docker Model RunnerGGUFContainer-based

Key findings

  • TTFT diverges by 100x across backends at 32k context
  • Prefix caching achieves 626x TTFT reduction
  • MoE architecture is required for 128k-token viability on 64 GB
  • Framework-level optimization (2.4x) exceeds quantization improvements (<3%)
  • Speculative decoding (llama.cpp): 206 t/s decode via Gemma4-26B + E2B draft (3.2x over standalone)
  • Unsloth UD-Q2_K_XL matches or beats Q4_K_M on decode speed (less bandwidth per token)
  • llama.cpp wins on Gemma 4; MLX wins on Qwen (architecture-specific kernel optimizations)

Speculative decoding

Pairing Gemma4-E2B as a draft model with Gemma4-26B via llama.cpp produced the fastest decode speed measured: 206 tokens per second. The draft model proposes tokens that the main model verifies in parallel.

Bar chart comparing Gemma4-26B standalone at 64 t/s to speculative decoding at 197-206 t/s, a 3.1x speedup.

Backend comparison

Which backend is fastest depends on the model architecture. llama.cpp leads on Gemma 4, while MLX backends (mlx-lm, Ollama 0.19+) lead on Qwen models.

Grouped bar chart comparing llama.cpp, mlx-lm, mlx-vlm, and Ollama decode speeds across five models at 4-bit quantization.

llama.cpp decode speed across all models

29 GGUF configurations tested via llama-server b9020, spanning MoE and dense architectures at multiple Unsloth Dynamic quantization levels.

Horizontal bar chart showing llama.cpp decode speed for all 29 GGUF configurations, from dense models at 16 t/s to speculative decoding at 197 t/s.

Interactive results

All 900+ measurements, sortable by any column, filterable by model, backend, quantization, and context depth.

Open full screen ↗