LLM Model Size Task Guide

Nano

Small

Medium

Large

Frontier

Nano

1B – 3B Parameters

Fastest inference · Runs on anything · Very limited reasoning

~1–3 GB VRAM

▼

Min GPUIntegrated / Any 4GB+

30 SeriesAny (all fit)

40 SeriesAny (all fit)

50 SeriesAny (all fit)

CPU-OnlyYes, usable

Tokens/sec50–120+

Context4K–8K typical

Well Suited For

Text autocomplete and inline code completion
Simple classification (spam, sentiment, log-level tagging)
Grammar and spelling correction
Basic entity extraction from short text
Template-based text formatting and reformatting
Single-turn Q&A on narrow, well-defined topics
Fast local embedding generation (with embedding models)
Edge device / IoT deployment where latency matters

Not Suited For

Multi-step reasoning or chain-of-thought
Code generation beyond single functions
Summarization of long documents
Anything requiring factual accuracy on niche topics
Structured JSON output (unreliable adherence)
Tool-use / function-calling workflows
Multi-turn conversation with context retention

Offensive Security Context

Essentially unusable for security tooling. Can't reliably parse scan output, hallucinates flags and tool syntax, can't chain reasoning steps. Only viable use is as a fast classifier for log triage or alert categorization where you've fine-tuned on your own labeled data.

Qwen 2.5 1.5B Gemma 2 2B Phi-3 Mini 3.8B Llama 3.2 1B/3B TinyLlama 1.1B StableLM 2 1.6B

Small

7B – 8B Parameters

The "sweet spot" entry point · Fast · Decent at focused tasks

~4–6 GB VRAM

▼

Min GPU8 GB VRAM

30 Series3060 12GB ✅ · 3070 8GB ⚠️

40 Series4060 8GB ⚠️ · 4060 Ti 16GB ✅ · 4070+ ✅

50 Series5060 8GB ⚠️ · 5060 Ti 16GB ✅ · 5070+ ✅

CPU-OnlySlow but works

Tokens/sec30–80

Context8K–32K

Well Suited For

Single-function code generation and debugging
Short document summarization (under ~2K words)
Translation between common languages
Chatbot / conversational assistant (limited context)
Simple RAG queries against a knowledge base
Regex generation and string manipulation
Commit message and changelog generation
Basic structured output (JSON) with careful prompting
Log parsing and pattern extraction (single format)

Not Suited For

Complex multi-file code generation or refactoring
Multi-step planning or autonomous agent loops
Long-form content (reports, writeups, documentation)
Nuanced analysis requiring broad domain knowledge
Reliable tool-use chains (50-60% JSON compliance)
Cross-referencing multiple data sources in a single prompt
Complex reasoning about system architecture

Offensive Security Context

Marginal for security work. Can assist with simple tasks like parsing a single nmap scan or explaining a known CVE, but struggles with anything requiring judgment — choosing between attack paths, chaining findings, or generating reliable exploit code. Fine-tuned coding variants (e.g., CodeLlama 7B, DeepSeek Coder 6.7B) are better for script assistance but still hallucinate tool flags regularly. Generally not usable for agentic security loops — expect ~50-60% JSON reliability.

Llama 3.1 8B Mistral 7B Gemma 2 9B Qwen 2.5 7B DeepSeek Coder 6.7B CodeLlama 7B Phi-4 Mini

Medium

13B – 14B Parameters

Noticeable quality jump · Reasonable reasoning · Good single-GPU fit

~8–12 GB VRAM

▼

Min GPU12 GB VRAM

30 Series3060 12GB ⚠️ · 3080 12GB ⚠️ · 3090 24GB ✅

40 Series4070 12GB ⚠️ · 4070 Ti Super 16GB ✅ · 4080+ ✅

50 Series5070 12GB ⚠️ · 5070 Ti 16GB ✅ · 5080+ ✅

CPU-OnlyPainful

Tokens/sec20–50

Context8K–32K

Well Suited For

Multi-function code generation with context
Moderate-length content writing (blog posts, emails)
Document summarization up to ~5K words
Structured output with reasonable reliability (~70-80%)
Simple tool-use and function-calling patterns
RAG with moderate complexity retrieval
Data transformation and ETL script generation
Basic chain-of-thought reasoning (2-3 steps)
Multi-turn conversations with decent coherence
Configuration file generation and validation

Not Suited For

Autonomous multi-step agent workflows
Complex code refactoring across large codebases
Deep technical analysis requiring expert knowledge
Reliable exploit development or payload crafting
Long document analysis (legal, research papers)
Tasks requiring strong self-correction / reflection

Offensive Security Context

Starting to become useful. Can parse nmap/Nessus output with decent accuracy, generate basic enumeration scripts, and explain CVEs with acceptable fidelity. Can handle simple structured output for tool integration. Still struggles with multi-step attack path reasoning and will confidently suggest wrong flags or non-existent tool features. Fine-tuned variants on security data could be viable for focused tasks like log analysis or IOC extraction. Considered a bare minimum for agentic security workflows — handles simple targets with some retries.

Llama 2 13B Phi-4 14B Qwen 2.5 14B Mistral Nemo 12B DeepSeek Coder V2 Lite 16B

Large

27B – 32B Parameters

Strong reasoning · Reliable structured output · The local workhorse

~16–24 GB VRAM

▼

Min GPU24 GB VRAM

30 Series3090 / 3090 Ti 24GB ✅

40 Series4090 24GB ✅

50 Series5090 32GB ✅

CPU-OnlyNot practical

Tokens/sec15–35

Context16K–128K

Well Suited For

Complex code generation, debugging, and refactoring
Multi-step reasoning and chain-of-thought (4-6 steps)
Reliable structured JSON output (~85-90%)
Autonomous agent loops with tool-use
Long-form technical writing and documentation
Code review and vulnerability pattern recognition
Multi-source data analysis and correlation
Complex RAG with reasoning over retrieved context
Script generation for multi-tool workflows
Technical report generation with structured findings
Multi-turn planning conversations with memory

Not Suited For

Deep novel research requiring frontier knowledge
Very long context analysis (100K+ tokens)
Tasks requiring near-perfect accuracy on first attempt
Complex creative writing at publishable quality
Highly nuanced ethical/legal reasoning

Offensive Security Context

This is the practical sweet spot for local security tooling. Can reliably parse complex scan output, generate working enumeration scripts, reason about attack paths across multiple findings, and maintain context in agentic loops. AD attack chain reasoning becomes viable — the model can connect Kerberoasting results to delegation abuse opportunities with reasonable accuracy. Exploit code generation is functional but should be validated. Fits a single RTX 3090 at Q4, making it ideal for homelab setups. The recommended tier for local agentic security platforms.

Qwen 2.5 32B Gemma 2 27B DeepSeek Coder V2 33B Command R 35B Mistral Small 24B Codestral 22B

70B+ Parameters

Near-frontier local quality · Requires serious hardware · Best local reasoning

~40–48+ GB VRAM

▼

Min GPU48 GB+ VRAM (multi-GPU)

30 Series2× 3090 48GB ⚠️

40 Series2× 4090 48GB ⚠️

50 Series2× 5090 64GB ✅

CPU-OnlyNo

Tokens/sec8–20

Context32K–128K

Well Suited For

Complex multi-step reasoning (6+ steps)
Highly reliable structured output (~95%+)
Sophisticated autonomous agent workflows
Full codebase understanding and refactoring
Technical writing approaching professional quality
Nuanced analysis with competing considerations
Long document comprehension and synthesis
Complex tool orchestration and planning
Cross-domain knowledge application
Detailed code review with security implications
Advanced data science and statistical reasoning

Not Suited For

Real-time / low-latency applications
Deployment without significant GPU infrastructure
Tasks where cloud API would be faster and cheaper
Scenarios requiring the absolute latest training data

Offensive Security Context

Best local option for serious autonomous security tooling. Complex AD attack chain reasoning, multi-step exploit development, and nuanced vulnerability analysis are all viable. Can reason about environmental context during engagements — understanding network topology, making pivot decisions, and adapting exploitation strategy based on partial information. A dual RTX 3090 setup can run this tier comfortably at Q4. The tradeoff is speed: inference is significantly slower, so real-time interactive use during a live engagement can feel sluggish. Best deployed for pre/post-engagement analysis or overnight autonomous scanning.

Llama 3.1 70B Qwen 2.5 72B DeepSeek V2.5 Mixtral 8×22B (MoE) Command R+ 104B Llama 3.1 405B (multi-node)

Frontier

Cloud / Frontier Models

Highest capability · API-dependent · Undisclosed parameter counts

N/A (cloud-hosted)

▼

Min GPUNone (API)

LatencyNetwork-bound

Tokens/sec30–100+ (provider-dependent)

Context128K–1M+

Well Suited For

State-of-the-art reasoning and analysis
Complex agentic workflows with high reliability
Very long context processing (100K+ tokens)
Professional-grade code generation across languages
Nuanced creative and technical writing
Multi-modal analysis (code + images + documents)
Complex tool orchestration with near-perfect JSON
Research-grade analysis and synthesis
Production-grade report and document generation
Real-time interactive analysis during engagements
Fine-grained instruction following

Not Suited For

Air-gapped or classified environments
Scenarios where data cannot leave your network
High-volume batch processing (cost scales linearly)
Offline or unreliable network conditions
Processing client data under strict NDA without API DPA

Offensive Security Context

Maximum capability, but with operational constraints. Frontier models like Claude Opus, GPT-4, and Gemini Pro excel at everything from complex exploit chain reasoning to report generation. The critical consideration is data sensitivity: sending client network data, credentials, or engagement details through a cloud API requires careful evaluation of your ROE, client agreements, and the provider's data handling policies. Best suited for: pre-engagement planning, methodology development, tooling creation, report writing (with sanitized data), training/learning, and CTF work where data sensitivity isn't a concern. For live engagement data, a local model or a provider with a signed DPA is more appropriate.

Claude Opus 4.6 Claude Sonnet 4.6 GPT-4o GPT-4 Turbo Gemini 2.5 Pro DeepSeek V3 Grok

Key Considerations

These tiers are guidelines, not hard rules. Several factors shift performance significantly:

Quantization: A Q4-quantized 32B model often outperforms a full-precision 14B model while using similar VRAM. Q4 is the practical sweet spot — Q2/Q3 degrades output quality noticeably; Q5/Q6 offers marginal improvement at significantly higher VRAM cost.
Fine-tuning: A 7B model fine-tuned on security data (CVEs, tool output, exploit patterns) can outperform a general-purpose 14B for that specific domain. Fine-tuning moves models up roughly half a tier for their trained tasks.
Architecture (MoE): Mixture-of-Experts models like Mixtral 8×7B activate only a subset of parameters per token, achieving 32B-class quality at 14B-class speed. Check active vs. total parameter counts.
Context length vs. quality: Advertised context windows don't mean the model uses all that context well. Most models degrade in the middle of long contexts ("lost in the middle" problem). For long input, consider chunking over stuffing.
Task-specific models: Coding-focused variants (CodeLlama, DeepSeek Coder, Codestral) consistently outperform their general-purpose equivalents at the same parameter count for code tasks.
Speed vs. quality tradeoff: For interactive use during an engagement, a fast 14B may be more practical than a slow 70B. For overnight batch analysis, the 70B wins.

GPU Compatibility Matrix (Q4 Quantization)

✅ = Comfortable fit with KV cache headroom ⚠️ = Fits but tight at longer contexts ❌ = Won't fit / requires offloading

GPU	VRAM	1-3B	7-8B	13-14B	27-32B	70B
NVIDIA 30 SERIES
RTX 3060	12 GB	✅	✅	⚠️	❌	❌
RTX 3070	8 GB	✅	⚠️	❌	❌	❌
RTX 3070 Ti	8 GB	✅	⚠️	❌	❌	❌
RTX 3080	10/12 GB	✅	✅	⚠️	❌	❌
RTX 3080 Ti	12 GB	✅	✅	⚠️	❌	❌
RTX 3090	24 GB	✅	✅	✅	✅	❌
RTX 3090 Ti	24 GB	✅	✅	✅	✅	❌
2× RTX 3090	48 GB	✅	✅	✅	✅	⚠️
NVIDIA 40 SERIES
RTX 4060	8 GB	✅	⚠️	❌	❌	❌
RTX 4060 Ti	8/16 GB	✅	✅	⚠️ (16GB)	❌	❌
RTX 4070	12 GB	✅	✅	⚠️	❌	❌
RTX 4070 Ti	12 GB	✅	✅	⚠️	❌	❌
RTX 4070 Ti Super	16 GB	✅	✅	✅	❌	❌
RTX 4080	16 GB	✅	✅	✅	❌	❌
RTX 4080 Super	16 GB	✅	✅	✅	❌	❌
RTX 4090	24 GB	✅	✅	✅	✅	❌
2× RTX 4090	48 GB	✅	✅	✅	✅	⚠️
NVIDIA 50 SERIES
RTX 5060	8 GB	✅	⚠️	❌	❌	❌
RTX 5060 Ti (8GB)	8 GB	✅	⚠️	❌	❌	❌
RTX 5060 Ti (16GB)	16 GB	✅	✅	✅	❌	❌
RTX 5070	12 GB	✅	✅	⚠️	❌	❌
RTX 5070 Ti	16 GB	✅	✅	✅	❌	❌
RTX 5080	16 GB	✅	✅	✅	❌	❌
RTX 5090	32 GB	✅	✅	✅	✅	❌
2× RTX 5090	64 GB	✅	✅	✅	✅	✅
QUICK VRAM REFERENCE
Model Size	—	~1-2 GB	~4-6 GB	~8-10 GB	~16-22 GB	~40-48 GB

Note: VRAM estimates are for model weights only at Q4 quantization. The KV cache grows with context length and will consume additional VRAM during inference. A card marked ⚠️ may work fine with short prompts but degrade or offload to CPU during longer sessions. The 50 series specs reflect announced/released configurations as of early 2026.

LLM Model Size → Task Guide

Well Suited For

Not Suited For

Well Suited For

Not Suited For

Well Suited For

Not Suited For

Well Suited For

Not Suited For

Well Suited For

Not Suited For

Well Suited For

Not Suited For

Key Considerations

GPU Compatibility Matrix (Q4 Quantization)