When working on distributed serving systems, one of the key questions is: how do different hardware platforms compare for inference workloads? To answer this, I recently ran a series of benchmarks comparing GPU acceleration (via Modal’s cloud GPUs) against CPU inference using MLX on Apple Silicon.

The results were interesting, and I think they highlight some important trade-offs for developers choosing between cloud GPU services and local CPU inference.

Results Summary

Platform Hardware Model Throughput Total Tokens Time
Modal A10G GPU Qwen/Qwen3-0.6B 3,037.56 tok/s 133,966 44.10s
MLX M1 CPU mlx-community/Qwen3-0.6B-4bit 137.98 tok/s 140,435 1,017.77s

The GPU setup achieved ~22x higher throughput than the CPU setup, as expected. However, the MLX CPU implementation provides a cost-effective option for local development and testing.

Setup:

uv pip install modal
modal token new

Optionally add your HuggingFace token to tests/sysperf/modal/.env:

HF_TOKEN=your_token_here

Run:

modal run tests/sysperf/modal/modal_app.py

The benchmark processes 256 sequences with random input/output lengths, achieving 3,037.56 tok/s on an A10G GPU.

MLX CPU Benchmark

Setup:

cd tests/sysperf/mlx
uv sync

Run:

uv run benchmark.py

The MLX benchmark uses a 4-bit quantized model (mlx-community/Qwen3-0.6B-4bit) and runs on Apple Silicon CPUs. It achieved 137.98 tok/s on an M1 chip.

Key Takeaways

Conclusion

The ~22x performance difference between GPU and CPU is significant, but both platforms serve different purposes. For production workloads requiring high throughput, Modal’s GPU acceleration is clearly the way to go. However, for local development, prototyping, and cost-sensitive applications, MLX on Apple Silicon provides a surprisingly capable alternative.

The fact that we can run a 4-bit quantized model locally at 137 tok/s opens up interesting possibilities for on-device inference, especially for applications where latency is acceptable but cost and privacy are priorities.

For more details on the implementation, check out the benchmark PR.