Advanced AI coding
Check out my last blog:
If you enjoyed my last blog, Getting Started in AI Coding
You’re probably wondering how to take things further. One common question is: what models can you run locally, and what hardware do you need?
For example, you can run GPT-20B on a single 16GB GPU, which is a strong everyday coding model. With dual cards (32GB VRAM) or a workstation with 128GB of unified RAM, you can handle much larger models. The goal here is to help you decide which models are best for your workflow, since the landscape is changing fast. Before diving into model sizing and benchmarks, let’s cover subscription vs. non-subscription access and a few advanced usage notes.
Subscription vs. Non-Subscription Access
Non-Subscription (Pay-as-You-Go / OpenRouter Style)
Non-subscription providers like OpenRouter often offer free access to high-quality, fast models.
⚠️ Privacy note: Many free models use your data for training, so sensitive code might be at risk.
Occasionally, new anonymous models appear on OpenRouter that are surprisingly capable. Many tools and VSCode extensions can use OpenRouter; my favorite is Kilo Code (other popular options: Cline, Roo Code).
OpenRouter also lets you set spending caps. You define a dollar limit, and the provider charges only for usage. This flexibility lets you switch to a new model immediately when it appears, without waiting out a subscription cycle.
Example pay-as-you-go pricing:
Qwen3 Next 80B: $0.14 /M input, $1.40 /M output
Qwen3 Coder 480B: $0.22 /M input, $0.95 /M output
Claude: starts at $3 /M input, $15 /M output
For my typical workflow (5–10M input tokens/day, ~100k output tokens/day), this works out to roughly $25–70/month depending on model.
You can also mix free big models as they appear to reduce costs. Most OpenRouter providers run on datacenter-grade hardware and are very fast.
Speeds Matter
Here’s a rough guide from my testing (tokens per second, TPS):
GPT-20B
~60TPS Fast, excellent for interactive coding
Qwen3-30B
~55TPS Reliable for coding, 30B cannot run on 16GB VRAM
GPT-120B
~20TPS Slower, stronger reasoning
Qwen3-235B
~11TPS Heavy, often requires multi-GPU setup
LLaMA-4 109B
~14TPS Good reasoning, slower than 30B
Larger models are generally more capable but slower on the same hardware. A beefy workstation can run 120B+, but the marginal gains may not be worth the speed and cost penalty.
Model Sizing: Trade-offs in Practice
20B–30B (~50–80 TPS)
Near-instant for 200–500 token outputs. Great for interactive coding, bug fixes, and quick Q&A. Cheap enough to iterate frequently. Best when speed and cost matter more than perfect correctness.
109B (~15 TPS)
About 3× slower than 30B. Stronger reasoning, fewer mistakes. Too slow for daily interactive use by many developers.
235B (~10–11 TPS)
Often requires multi-GPU setups (e.g., two 96GB cards). Useful for cross-file reasoning or long-context analysis. Not practical for rapid back-and-forth coding.
480B (~5 TPS)
Very slow for interactive use. Best for final audits or overnight analysis.
The Qwen3 Family
Qwen is trained by the Alibaba group. China is certainly leading the way with open models you can run at home. They are doing amazing work.
4B — Runs on most machines; fine for simple tasks.
8B / 14B — Can handle basic coding; limited context.
30B — Coding becomes reliably useful. Great for iterative workflows.
235B — Fantastic all-rounder, but slow and heavy. Two 96GB pro cards recommended. For most users, pay-as-you-go models are more practical.
Benchmarks: Take Them With a Grain of Salt
Benchmarks matter, but they’re not the whole story. Some models are tuned for benchmark datasets and may overperform on scores relative to real-world utility.
Aider leaderboard is my favourite benchmark. This really shows the best of the best. Yes you have to go down the board to find stuff that runs at home.
Livecodebench @ AI shows tight clustering: e.g., GPT-5-High ~84.6%, Gemini-2.5-Pro ~80.1%.
Some lower-scoring models may still be highly practical for day-to-day coding.
Examples from private testing:
Magistral Small 1.2 — ~20 TPS, ~72% score
GPT-20B — ~57%, strong value for 16GB GPU
Bottom line: speed, cost, privacy, and workload matter as much as raw scores.
Recommendations
Starting out: Buy 16GB VRAM capable hardware and start with GPT-20B. Strong for everyday coding, plus it opens access to many other models.
More capability without huge speed penalties: Dual 32GB VRAM cards let you run 30B models, great for coding.
Deep, long-context work: Invest in a multi-GPU workstation (128GB unified RAM or multiple pro cards) for 109B–235B depending on latency tolerance.
Don’t rush hardware: Models and hardware options keep evolving. Qwen3 80B is said to rival 235B in capability on certain setups; but work is being done to get it working as a GGUF.