Technical documentation for the model database and calculation engine.
70B is the threshold where deployment becomes an infrastructure challenge, requiring multiple consumer GPUs or datacenter hardware. 700B excludes experimental ultra-large MoE models that aren't production-ready.
We focus on tier-1 organizations (Google, Anthropic, OpenAI, Qwen, DeepSeek, NVIDIA, Apple, XiaomiMiMo) with proven track records. This excludes community fine-tunes and smaller research labs to maintain quality standards.
Automatically 3 times per week (Monday, Wednesday, Friday at 02:00 UTC) via GitHub Actions. Data is fetched from Hugging Face Hub API and benchmarks are validated against Artificial Analysis.
VRAM: Model weights + KV cache + workspace overhead
Bandwidth: Memory read/write per token (decode phase is memory-bound)
Performance: Based on GPU memory bandwidth and model size
Default scenario: 8K context, batch size 1, INT8/BF16 precision
Calculations are physics-based estimates using transformer architecture formulas, not empirical benchmarks. Actual performance varies based on inference framework, kernel optimizations, and hardware configuration. Typical accuracy: ±10-15% for VRAM, ±20-30% for performance.
Priority order: (1) Safetensors metadata, (2) Model card statements, (3) Physics-based estimation from architecture, (4) Manual verification. Each model displays its data source.
VRAM requirements depend on model size and precision. For open source LLM deployment, a 70B model typically needs 140GB in BF16 or 70GB in INT8. Use our LLM VRAM calculator to estimate requirements based on your specific context length and batch size. Consumer GPUs like RTX 4090 (24GB) can run smaller models, while datacenter GPUs like H100 (80GB) handle larger deployments.
DeepSeek models range from 70B to 685B parameters. GPU requirements for LLM of this scale vary: DeepSeek 70B needs 2-4x A100 or H100 GPUs, while the 685B MoE architecture requires 8-16x H100 for efficient open source LLM deployment. The LLM VRAM calculator provides exact estimates based on your workload configuration.
Yes, you can run LLM locally on consumer hardware with quantization. Llama 70B in INT8 requires approximately 70GB VRAM, achievable with 2-3x RTX 4090 GPUs. For running LLM locally on a single GPU, consider Llama 30B or smaller models. Our GPU requirements calculator helps you determine the optimal quantization level for your hardware.
FP16 (BF16) provides maximum quality but doubles VRAM requirements for open source LLM deployment. INT8 quantization reduces memory by 50% with minimal quality loss (~1-3% degradation). For running LLM locally on limited hardware, INT8 is recommended. The LLM VRAM calculator shows both configurations so you can balance quality and GPU requirements for LLM inference.
GPU requirements for LLM include VRAM capacity, memory bandwidth, and compute throughput. Our LLM VRAM calculator estimates: (1) Total VRAM based on model size and context, (2) Bandwidth needs for token generation, (3) Expected tokens/second based on your GPU. For running LLM locally in production, you'll also need to factor in concurrent users and response time targets.
No. All calculations run entirely in your browser using JavaScript. No data is sent to servers, no analytics, no tracking. The site is 100% client-side.