Habr AI→ original

Saiga Llama 3 8B on 10 GB VRAM: How Habr Achieved 93% Accuracy on War and Peace

Saiga Llama 3 8B was successfully run on an RTX 3080 with 10 GB VRAM and condensed two volumes of War and Peace into an 18,000-word summary. The main…

AI-processed from Habr AI; edited by Hamidun News
Saiga Llama 3 8B on 10 GB VRAM: How Habr Achieved 93% Accuracy on War and Peace
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

On Habr AI, a practical breakdown of running Saiga Llama 3 8B on a home RTX 3080 with 10GB VRAM for summarizing the first two volumes of "War and Peace" was published. The experiment showed that the main problem with local LLM in such a task is not only limited memory, but also hallucinations at the level of facts, names, and chronology.

Running on 10GB

The author built a pipeline around IlyaGusev/saiga_llama3_8b with 4-bit quantization and ran the model on a home RTX 3080 with 10GB VRAM. The full text of two volumes couldn't fit in memory, so the novel had to be cut by chapters and the size of each fragment had to be limited. After a series of runs, a working compromise became approximately 7500 characters per chunk: less was lost with too much context, more grew the risk of crashes and VRAM overflow.

The stack used transformers and bitsandbytes, and the author checked the accuracy of summaries through Gemini. Along the way, unexpected side effects emerged: Qwen2.5-7B-Instruct once produced a long piece of Python code with library recommendations instead of a summary.

The idea of a "sliding window," where the model summarizes an already prepared summary, was quickly abandoned: quality degraded according to the broken telephone principle, and processing time ultimately became noticeably longer.

Where Did Hallucinations Come From

A naive prompt initially seemed to work: the model produced short summaries of 3-5 sentences, but quickly began confusing surnames, family relationships, and chronology. Pierre Bezukhov could suddenly become the son of the Rostovs, and Prince Vasily Kuragin—his father. When a character database with strict rules was added to the system prompt, errors didn't disappear; they shifted: the network began more confidently formulating factually incorrect conclusions about individual chapters.

The most striking failure occurred with Nikolai Rostov. In the episode after the Schengrabern battle, the model decided that the hero had died, although in the text he was only wounded and later continues the plot. The author explains this as a skew in probabilities: Tolstoy long describes pain, blood, and the feeling of impending death, while the brief confirmation that Rostov is alive appears later and weighs less for the model.

Checking logits showed that the prompt could indeed radically shift the choice of the next token.

"Do not kill the heroes!

Nikolai Rostov survives at Schengrabern".

What Actually Helped

In the working version of the pipeline, the rules became extremely direct: match surnames with the character database, don't invent romantic lines, remember that the action takes place in 1805, and honestly write if an excerpt ends before the resolution. In parallel, the author reduced generation parameters—temperature 0.1, top_p 0.85, and repetition_penalty 1.15. The idea was simple: less creativity, less temptation to continue Tolstoy on your own. And the more stable the answer.

  • 4-bit quantization instead of full-size loading
  • Text cutting by chapters with a limit of about 7500 characters
  • Hard system prompt with character database
  • Low temperature and limited top_p
  • Post-processing rare errors in surnames

Such a set of measures did not make the system error-free, but sharply reduced the number of critical hallucinations. The final evaluation through Gemini 3 Flash gave an average factual accuracy of about 93%, with most chapters staying in the 90-98% range. The most striking mistakes remained at the level of tokens and morphemes: in one place "Pierre Bezdarovsky" appeared, a hybrid of the surname Bezukhov and the word "untalented." The author believes that such rare failures are easier to catch in post-processing than to further complicate the prompt.

What This Means

This case shows an important thing for local LLMs: even on a consumer graphics card, you can build a useful pipeline for long texts, but success depends not only on the model and amount of VRAM. Often hard instructions, generation control, and post-processing decide—that is, engineering around LLM, not one magical button "read the book for me."

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…