2 min read

Apple Silicon VLM Benchmark Roundup

Table of Contents

[YOUR VOICE] The Claim

The Apple Silicon VLM ecosystem is maturing faster than the documentation. Models that didn’t work six months ago now run reliably through MLX. But there’s no single source that tells you which models to try, on which hardware, with which settings.


The Mechanism

We run the vLLM-MLX fork as a local serving backend on Apple Silicon. The benchmark roundup covers:

  • Hardware tested β€” M1, M2, M3 (and variants: Pro, Max, Ultra where available)
  • Models tested β€” MISSING β€” full list with versions
  • Serving configuration β€” MISSING β€” quantization formats, context lengths, batch sizes
  • Metrics β€” tokens/sec, time-to-first-token, memory usage, stability over long sessions

MISSING β€” Methodology: how tests were run, how many iterations, warm-up protocol


The Evidence

MISSING β€” Full benchmark results table

MISSING β€” Stability observations: which models degrade over time, which maintain consistent performance


[YOUR VOICE] Implications

MISSING β€” Practical recommendations: what to run for different use cases (fast chat, UI agents, code generation, multimodal reasoning).


Reference Documents

DocumentWhat it covers
vLLM-MLX fork _docs/MISSING β€” Benchmark methodology and raw data
Client compatibility matrixMISSING β€” Which clients work with which models