Habr AI→ original

Nvidia Nemotron-Cascade-2 was run at home on a GeForce RTX 3090 at up to 150 tokens/s

Nemotron-Cascade-2-30B-AWQ was successfully run locally in a home setup with a GeForce RTX 3090, reaching 120–150 tokens per second and up to 210+ including…

AI-processed from Habr AI; edited by Hamidun News
Nvidia Nemotron-Cascade-2 was run at home on a GeForce RTX 3090 at up to 150 tokens/s
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

An enthusiast of local LLMs demonstrated that the 30-billion Nemotron-Cascade-2 can be used at home on a GeForce RTX 3090. In his configuration, the model delivered 120–150 tokens per second and handled not only coding but also tasks in physics, biology, and web agent scenarios.

Why Nemotron Was Chosen

The author was looking for more than just a local model to experiment with — he wanted a permanent assistant for daily work. The requirements were practical: high response speed, long and stable context, and logic that could be trusted without rechecking every step. The home setup for this was fairly typical for an advanced enthusiast: a compact PC with 64 GB of RAM, Windows 11, WSL2, and an external GeForce RTX 3090 with 24 GB.

Against this backdrop, Nemotron-Cascade-2-30B-A3B-AWQ turned out to be a compromise that actually works. The choice is explained by the Mamba + MoE architecture: one part helps process long requests faster, the other maintains high generation speed. The model was run through vLLM, which allowed using FP8 for the KV-cache and getting noticeably more from a home graphics card than simpler local deployment scenarios offer.

  • Qwen 3.5-35B did not fit in 24 GB of memory with a comfortable context margin
  • GGUF variants through Llama.cpp and LM Studio turned out to be noticeably slower
  • NIM in a suitable AWQ configuration could not be found
  • Nemotron-Cascade-2 in quantized form provided the best balance of speed and quality

What the Tests Showed

To verify the model, it was run through a series of tasks in AnythingLLM with connected vLLM. The set was not a synthetic benchmark but a mix of real-world scenarios: a thermodynamics calculation, a biology task on DNA strand direction, writing a numpy function to calculate diffraction angles, and web agent requests via Playwright. This mix demonstrates well whether a local LLM is suitable for everyday work rather than just short chat answers.

Nemotron-Cascade-2 performed best where it needed to maintain a chain of reasoning rather than just recall a fact. In the ice problem, the model correctly separated heating, melting, and subsequent water heating, and in the biology test it noticed an error in intermediate logic itself and corrected it during the response. In the Python task, it didn't resort to slow nested loops but immediately proposed vectorization through numpy and accounted for rounding errors.

Even web agent scenarios worked, though noticeably slower than typical Q&A.

Where Limitations Appeared

The main technical problem turned out to be not memory or speed but the reasoning mode. When trying to disable internal reasoning for cleaner output, the model sharply lost quality on complex tasks. This was especially apparent where it needed to maintain several logical steps at once, for example in biology and agent tasks.

"Don't do that. The model instantly becomes 'dumber'."

As a result, the optimal solution was not to cut out the thinking blocks but to parse them correctly. The author first assembled a simple Python proxy for this, then found a cleaner option: the parameter `--reasoning-parser deepseek_r1` in vLLM. After that, the extra layer was no longer needed. The final result for the home setup looks strong: 120–150 tokens per second in generation and up to 210+ tokens per second including reasoning. At the same time, attempting to accelerate the context further through `--enforce-eager` has the opposite effect — speed drops so much that such a mode loses its purpose.

What It Means

The case shows that local 30B models are ceasing to be toys for enthusiasts with a few GPUs. If you correctly select the architecture, quantization, and runtime stack, a single RTX 3090 is already capable of providing a working tool for code, RAG, scientific tasks, and simple agent scenarios without a cloud subscription.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…