LLM Model Size → Task Guide

A practitioner's reference for matching local LLM parameter counts to real-world jobs. What each tier can and can't do — with offensive security context.

All VRAM estimates assume Q4 quantization unless noted

Nano
Small
Medium
Large
XL
Frontier
Nano
1B – 3B Parameters
Fastest inference · Runs on anything · Very limited reasoning
~1–3 GB VRAM
Min GPUIntegrated / Any 4GB+
30 SeriesAny (all fit)
40 SeriesAny (all fit)
50 SeriesAny (all fit)
CPU-OnlyYes, usable
Tokens/sec50–120+
Context4K–8K typical

Well Suited For

  • Text autocomplete and inline code completion
  • Simple classification (spam, sentiment, log-level tagging)
  • Grammar and spelling correction
  • Basic entity extraction from short text
  • Template-based text formatting and reformatting
  • Single-turn Q&A on narrow, well-defined topics
  • Fast local embedding generation (with embedding models)
  • Edge device / IoT deployment where latency matters

Not Suited For

  • Multi-step reasoning or chain-of-thought
  • Code generation beyond single functions
  • Summarization of long documents
  • Anything requiring factual accuracy on niche topics
  • Structured JSON output (unreliable adherence)
  • Tool-use / function-calling workflows
  • Multi-turn conversation with context retention
Offensive Security Context

Essentially unusable for security tooling. Can't reliably parse scan output, hallucinates flags and tool syntax, can't chain reasoning steps. Only viable use is as a fast classifier for log triage or alert categorization where you've fine-tuned on your own labeled data.

Qwen 2.5 1.5B Gemma 2 2B Phi-3 Mini 3.8B Llama 3.2 1B/3B TinyLlama 1.1B StableLM 2 1.6B
Small
7B – 8B Parameters
The "sweet spot" entry point · Fast · Decent at focused tasks
~4–6 GB VRAM
Min GPU8 GB VRAM
30 Series3060 12GB ✅ · 3070 8GB ⚠️
40 Series4060 8GB ⚠️ · 4060 Ti 16GB ✅ · 4070+ ✅
50 Series5060 8GB ⚠️ · 5060 Ti 16GB ✅ · 5070+ ✅
CPU-OnlySlow but works
Tokens/sec30–80
Context8K–32K

Well Suited For

  • Single-function code generation and debugging
  • Short document summarization (under ~2K words)
  • Translation between common languages
  • Chatbot / conversational assistant (limited context)
  • Simple RAG queries against a knowledge base
  • Regex generation and string manipulation
  • Commit message and changelog generation
  • Basic structured output (JSON) with careful prompting
  • Log parsing and pattern extraction (single format)

Not Suited For

  • Complex multi-file code generation or refactoring
  • Multi-step planning or autonomous agent loops
  • Long-form content (reports, writeups, documentation)
  • Nuanced analysis requiring broad domain knowledge
  • Reliable tool-use chains (50-60% JSON compliance)
  • Cross-referencing multiple data sources in a single prompt
  • Complex reasoning about system architecture
Offensive Security Context

Marginal for security work. Can assist with simple tasks like parsing a single nmap scan or explaining a known CVE, but struggles with anything requiring judgment — choosing between attack paths, chaining findings, or generating reliable exploit code. Fine-tuned coding variants (e.g., CodeLlama 7B, DeepSeek Coder 6.7B) are better for script assistance but still hallucinate tool flags regularly. Generally not usable for agentic security loops — expect ~50-60% JSON reliability.

Llama 3.1 8B Mistral 7B Gemma 2 9B Qwen 2.5 7B DeepSeek Coder 6.7B CodeLlama 7B Phi-4 Mini
Medium
13B – 14B Parameters
Noticeable quality jump · Reasonable reasoning · Good single-GPU fit
~8–12 GB VRAM
Min GPU12 GB VRAM
30 Series3060 12GB ⚠️ · 3080 12GB ⚠️ · 3090 24GB ✅
40 Series4070 12GB ⚠️ · 4070 Ti Super 16GB ✅ · 4080+ ✅
50 Series5070 12GB ⚠️ · 5070 Ti 16GB ✅ · 5080+ ✅
CPU-OnlyPainful
Tokens/sec20–50
Context8K–32K

Well Suited For

  • Multi-function code generation with context
  • Moderate-length content writing (blog posts, emails)
  • Document summarization up to ~5K words
  • Structured output with reasonable reliability (~70-80%)
  • Simple tool-use and function-calling patterns
  • RAG with moderate complexity retrieval
  • Data transformation and ETL script generation
  • Basic chain-of-thought reasoning (2-3 steps)
  • Multi-turn conversations with decent coherence
  • Configuration file generation and validation

Not Suited For

  • Autonomous multi-step agent workflows
  • Complex code refactoring across large codebases
  • Deep technical analysis requiring expert knowledge
  • Reliable exploit development or payload crafting
  • Long document analysis (legal, research papers)
  • Tasks requiring strong self-correction / reflection
Offensive Security Context

Starting to become useful. Can parse nmap/Nessus output with decent accuracy, generate basic enumeration scripts, and explain CVEs with acceptable fidelity. Can handle simple structured output for tool integration. Still struggles with multi-step attack path reasoning and will confidently suggest wrong flags or non-existent tool features. Fine-tuned variants on security data could be viable for focused tasks like log analysis or IOC extraction. Considered a bare minimum for agentic security workflows — handles simple targets with some retries.

Llama 2 13B Phi-4 14B Qwen 2.5 14B Mistral Nemo 12B DeepSeek Coder V2 Lite 16B
Large
27B – 32B Parameters
Strong reasoning · Reliable structured output · The local workhorse
~16–24 GB VRAM
Min GPU24 GB VRAM
30 Series3090 / 3090 Ti 24GB ✅
40 Series4090 24GB ✅
50 Series5090 32GB ✅
CPU-OnlyNot practical
Tokens/sec15–35
Context16K–128K

Well Suited For

  • Complex code generation, debugging, and refactoring
  • Multi-step reasoning and chain-of-thought (4-6 steps)
  • Reliable structured JSON output (~85-90%)
  • Autonomous agent loops with tool-use
  • Long-form technical writing and documentation
  • Code review and vulnerability pattern recognition
  • Multi-source data analysis and correlation
  • Complex RAG with reasoning over retrieved context
  • Script generation for multi-tool workflows
  • Technical report generation with structured findings
  • Multi-turn planning conversations with memory

Not Suited For

  • Deep novel research requiring frontier knowledge
  • Very long context analysis (100K+ tokens)
  • Tasks requiring near-perfect accuracy on first attempt
  • Complex creative writing at publishable quality
  • Highly nuanced ethical/legal reasoning
Offensive Security Context

This is the practical sweet spot for local security tooling. Can reliably parse complex scan output, generate working enumeration scripts, reason about attack paths across multiple findings, and maintain context in agentic loops. AD attack chain reasoning becomes viable — the model can connect Kerberoasting results to delegation abuse opportunities with reasonable accuracy. Exploit code generation is functional but should be validated. Fits a single RTX 3090 at Q4, making it ideal for homelab setups. The recommended tier for local agentic security platforms.

Qwen 2.5 32B Gemma 2 27B DeepSeek Coder V2 33B Command R 35B Mistral Small 24B Codestral 22B
XL
70B+ Parameters
Near-frontier local quality · Requires serious hardware · Best local reasoning
~40–48+ GB VRAM
Min GPU48 GB+ VRAM (multi-GPU)
30 Series2× 3090 48GB ⚠️
40 Series2× 4090 48GB ⚠️
50 Series2× 5090 64GB ✅
CPU-OnlyNo
Tokens/sec8–20
Context32K–128K

Well Suited For

  • Complex multi-step reasoning (6+ steps)
  • Highly reliable structured output (~95%+)
  • Sophisticated autonomous agent workflows
  • Full codebase understanding and refactoring
  • Technical writing approaching professional quality
  • Nuanced analysis with competing considerations
  • Long document comprehension and synthesis
  • Complex tool orchestration and planning
  • Cross-domain knowledge application
  • Detailed code review with security implications
  • Advanced data science and statistical reasoning

Not Suited For

  • Real-time / low-latency applications
  • Deployment without significant GPU infrastructure
  • Tasks where cloud API would be faster and cheaper
  • Scenarios requiring the absolute latest training data
Offensive Security Context

Best local option for serious autonomous security tooling. Complex AD attack chain reasoning, multi-step exploit development, and nuanced vulnerability analysis are all viable. Can reason about environmental context during engagements — understanding network topology, making pivot decisions, and adapting exploitation strategy based on partial information. A dual RTX 3090 setup can run this tier comfortably at Q4. The tradeoff is speed: inference is significantly slower, so real-time interactive use during a live engagement can feel sluggish. Best deployed for pre/post-engagement analysis or overnight autonomous scanning.

Llama 3.1 70B Qwen 2.5 72B DeepSeek V2.5 Mixtral 8×22B (MoE) Command R+ 104B Llama 3.1 405B (multi-node)
Frontier
Cloud / Frontier Models
Highest capability · API-dependent · Undisclosed parameter counts
N/A (cloud-hosted)
Min GPUNone (API)
LatencyNetwork-bound
Tokens/sec30–100+ (provider-dependent)
Context128K–1M+

Well Suited For

  • State-of-the-art reasoning and analysis
  • Complex agentic workflows with high reliability
  • Very long context processing (100K+ tokens)
  • Professional-grade code generation across languages
  • Nuanced creative and technical writing
  • Multi-modal analysis (code + images + documents)
  • Complex tool orchestration with near-perfect JSON
  • Research-grade analysis and synthesis
  • Production-grade report and document generation
  • Real-time interactive analysis during engagements
  • Fine-grained instruction following

Not Suited For

  • Air-gapped or classified environments
  • Scenarios where data cannot leave your network
  • High-volume batch processing (cost scales linearly)
  • Offline or unreliable network conditions
  • Processing client data under strict NDA without API DPA
Offensive Security Context

Maximum capability, but with operational constraints. Frontier models like Claude Opus, GPT-4, and Gemini Pro excel at everything from complex exploit chain reasoning to report generation. The critical consideration is data sensitivity: sending client network data, credentials, or engagement details through a cloud API requires careful evaluation of your ROE, client agreements, and the provider's data handling policies. Best suited for: pre-engagement planning, methodology development, tooling creation, report writing (with sanitized data), training/learning, and CTF work where data sensitivity isn't a concern. For live engagement data, a local model or a provider with a signed DPA is more appropriate.

Claude Opus 4.6 Claude Sonnet 4.6 GPT-4o GPT-4 Turbo Gemini 2.5 Pro DeepSeek V3 Grok

Key Considerations

These tiers are guidelines, not hard rules. Several factors shift performance significantly:

GPU Compatibility Matrix (Q4 Quantization)

✅ = Comfortable fit with KV cache headroom   ⚠️ = Fits but tight at longer contexts   ❌ = Won't fit / requires offloading

GPU VRAM 1-3B 7-8B 13-14B 27-32B 70B
NVIDIA 30 SERIES
RTX 306012 GB⚠️
RTX 30708 GB⚠️
RTX 3070 Ti8 GB⚠️
RTX 308010/12 GB⚠️
RTX 3080 Ti12 GB⚠️
RTX 309024 GB
RTX 3090 Ti24 GB
2× RTX 309048 GB⚠️
NVIDIA 40 SERIES
RTX 40608 GB⚠️
RTX 4060 Ti8/16 GB⚠️ (16GB)
RTX 407012 GB⚠️
RTX 4070 Ti12 GB⚠️
RTX 4070 Ti Super16 GB
RTX 408016 GB
RTX 4080 Super16 GB
RTX 409024 GB
2× RTX 409048 GB⚠️
NVIDIA 50 SERIES
RTX 50608 GB⚠️
RTX 5060 Ti (8GB)8 GB⚠️
RTX 5060 Ti (16GB)16 GB
RTX 507012 GB⚠️
RTX 5070 Ti16 GB
RTX 508016 GB
RTX 509032 GB
2× RTX 509064 GB
QUICK VRAM REFERENCE
Model Size~1-2 GB~4-6 GB~8-10 GB~16-22 GB~40-48 GB

Note: VRAM estimates are for model weights only at Q4 quantization. The KV cache grows with context length and will consume additional VRAM during inference. A card marked ⚠️ may work fine with short prompts but degrade or offload to CPU during longer sessions. The 50 series specs reflect announced/released configurations as of early 2026.