About

Why only 70B-700B models?

70B is the threshold where deployment becomes an infrastructure challenge, requiring multiple consumer GPUs or datacenter hardware. 700B excludes experimental ultra-large MoE models that aren't production-ready.

Why only 8 vendors?

We focus on tier-1 organizations (Google, Anthropic, OpenAI, Qwen, DeepSeek, NVIDIA, Apple, XiaomiMiMo) with proven track records. This excludes community fine-tunes and smaller research labs to maintain quality standards.

How often is data updated?

Automatically 3 times per week (Monday, Wednesday, Friday at 02:00 UTC) via GitHub Actions. Data is fetched from Hugging Face Hub API and benchmarks are validated against Artificial Analysis.

How are requirements calculated?

VRAM: Model weights + KV cache + workspace overhead

Bandwidth: Memory read/write per token (decode phase is memory-bound)

Performance: Based on GPU memory bandwidth and model size

Default scenario: 8K context, batch size 1, INT8/BF16 precision

What's NOT included?

Models below 70B (different deployment profile)
Pre-quantized variants (GPTQ, AWQ, GGUF - we calculate quantization)
Community fine-tunes (quality varies)
Training costs (inference only)
Models older than 24 months (superseded by newer versions)

How accurate are the calculations?

Calculations are physics-based estimates using transformer architecture formulas, not empirical benchmarks. Actual performance varies based on inference framework, kernel optimizations, and hardware configuration. Typical accuracy: ±10-15% for VRAM, ±20-30% for performance.

How are parameter counts determined?

Priority order: (1) Safetensors metadata, (2) Model card statements, (3) Physics-based estimation from architecture, (4) Manual verification. Each model displays its data source.

How much VRAM do I need to run an LLM locally?

VRAM requirements depend on model size and precision. For open source LLM deployment, a 70B model typically needs 140GB in BF16 or 70GB in INT8. Use our LLM VRAM calculator to estimate requirements based on your specific context length and batch size. Consumer GPUs like RTX 4090 (24GB) can run smaller models, while datacenter GPUs like H100 (80GB) handle larger deployments.

What are the GPU requirements for DeepSeek models?

DeepSeek models range from 70B to 685B parameters. GPU requirements for LLM of this scale vary: DeepSeek 70B needs 2-4x A100 or H100 GPUs, while the 685B MoE architecture requires 8-16x H100 for efficient open source LLM deployment. The LLM VRAM calculator provides exact estimates based on your workload configuration.

Can I run Llama on a consumer GPU?

Yes, you can run LLM locally on consumer hardware with quantization. Llama 70B in INT8 requires approximately 70GB VRAM, achievable with 2-3x RTX 4090 GPUs. For running LLM locally on a single GPU, consider Llama 30B or smaller models. Our GPU requirements calculator helps you determine the optimal quantization level for your hardware.

What's the difference between running LLM in FP16 vs INT8?

FP16 (BF16) provides maximum quality but doubles VRAM requirements for open source LLM deployment. INT8 quantization reduces memory by 50% with minimal quality loss (~1-3% degradation). For running LLM locally on limited hardware, INT8 is recommended. The LLM VRAM calculator shows both configurations so you can balance quality and GPU requirements for LLM inference.

How do I calculate compute requirements for open source LLM deployment?

GPU requirements for LLM include VRAM capacity, memory bandwidth, and compute throughput. Our LLM VRAM calculator estimates: (1) Total VRAM based on model size and context, (2) Bandwidth needs for token generation, (3) Expected tokens/second based on your GPU. For running LLM locally in production, you'll also need to factor in concurrent users and response time targets.

Is my data collected?

No. All calculations run entirely in your browser using JavaScript. No data is sent to servers, no analytics, no tracking. The site is 100% client-side.

Frequently Asked Questions