LLM Inference Benchmark Harness

A systematic benchmark tool designed for the agentic workload profile. Evaluates 248 configurations across 13 inference backends, 32 models, 6 model formats, and 17 KV cache strategies on Apple Silicon hardware: 679 measurement rows in total, last updated 2026-07-07. All numbers on this page are derived from the live dataset at build time.

The project exists to answer production questions that short-prompt benchmarks hide: which runtime keeps time-to-first-token under control at long context, which optimizations actually move end-to-end latency, and which setups stay viable once a multi-turn agent starts accumulating context and tool state.

Key features

Full server lifecycle management per configuration (start, health check, warmup, measurement, teardown)
Wall-clock TTFT and decode throughput measurement at realistic context depths (up to 128k tokens)
Process-tree memory monitoring via recursive psutil sampling
YAML-driven configuration with constraint validation
OpenAI SSE and Ollama JSON-line streaming parsers
Backend version tracking across runs for longitudinal comparison
Speculative decoding support (llama.cpp draft model pipeline)
Download-bench-purge workflow for large GGUF files (conserves disk space)
Full results dataset released for reproducibility (248 configs, 679 measurement rows)
Focus on deployment trade-offs for agentic workloads, not only headline tokens-per-second figures

Backends tested

Backend	Format	Notes
mlx-lm	MLX (4bit, 8bit, MXFP4, NVFP4)	Fastest MLX decode for Qwen models
Ollama	GGUF, MLX (since 0.19)	Prefix caching, easiest setup
llama.cpp (llama-server)	GGUF	Speculative decoding, portable
mlx-vlm	MLX	VLM architectures (Gemma4)
vllm-mlx	MLX	Highest raw decode, poor TTFT scaling
oMLX	MLX + TurboQuant	KV cache compression, SSD offload
LM Studio	GGUF / MLX	GUI wrapper
Docker Model Runner	GGUF	Container-based

The dataset now spans 13 backends in total; the interactive bench has the full list.

Key findings

TTFT diverges by 100x across backends at 32k context
Prefix caching achieves 626x TTFT reduction
MoE architecture is required for 128k-token viability on 64 GB
Framework-level optimization (2.4x) exceeds quantization improvements (<3%)
Speculative decoding (llama.cpp): 206 t/s decode via Gemma4-26B + E2B draft (3.2x over standalone)
Unsloth UD-Q2_K_XL matches or beats Q4_K_M on decode speed (less bandwidth per token)
llama.cpp wins on Gemma 4; MLX wins on Qwen (architecture-specific kernel optimizations)

Paper

The initial 57-configuration study is written up formally: Beyond Tokens per Second: Time-to-First-Token Scaling and Memory Constraints Across Eight Inference Backends on Apple Silicon (PDF).

Speculative decoding

Pairing Gemma4-E2B as a draft model with Gemma4-26B via llama.cpp produced one of the fastest decode speeds measured: 206 tokens per second. The draft model proposes tokens that the main model verifies in parallel.

Backend comparison

Which backend is fastest depends on the model architecture. llama.cpp leads on Gemma 4, while MLX backends (mlx-lm, Ollama 0.19+) lead on Qwen models.

Grouped bar chart comparing llama.cpp, mlx-lm, mlx-vlm, and Ollama decode speeds across five models at 4-bit quantization.

llama.cpp decode speed across all models

29 GGUF configurations tested via llama-server b9020, spanning MoE and dense architectures at multiple Unsloth Dynamic quantization levels.

Horizontal bar chart showing llama.cpp decode speed for all 29 GGUF configurations, from dense models at 16 t/s to speculative decoding at 197 t/s.

Interactive results

The full dataset (248 configurations, 679 measurement rows) lives in the interactive bench, sortable by any column and filterable by model, backend, quantization, and context depth. The fastest decode result per model:

Model	Backend	Quant	Decode t/s	Peak RSS
DiffusionGemma-26B-A4B	mlx-vlm	4bit	299.2	15.3 GB
LFM2.5-8B	llama.cpp	Q4_K_M	233.0	6.4 GB
Gemma4-26B-A4B	llama.cpp + spec-e2b	UD-Q4_K_M	206.4	31.0 GB
Gemma4-E2B-it	litert-lm	default	135.9	—
LFM2-24B	llama.cpp	Q4_K_M	122.0	16.0 GB

Fastest decode per model, top 5 of 32. Apple M3 Max · 64 GB, median of 3 runs, data updated 2026-07-07.

Explore all 679 results →