How DeepSeek and Wordstat turned manual keyword collection into a multi-agent SEO system
What started as an attempt to stop copying data from Wordstat into Excel turned into an SEO pipeline with DeepSeek, SERP Veto, and ensemble voting. The…
AI-processed from Habr AI; edited by Hamidun News
An author on Habr demonstrated how the routine transfer of data from Wordstat to Excel turned into a full-fledged SEO tool with multiple agents, model voting, and a separate arbiter. As a result, the system processes about 3000 keywords in 20–30 minutes, leaving only about 5% of disputed queries for manual review.
How the pipeline grew
Initially, the task was straightforward: stop copying keywords manually. The first version of the script simply pulled data from Bukvarix and saved it to a file. Then the author realized that for advertising campaigns this wasn't enough: frequency data in such sources could become outdated within months, meaning budget decisions were based on old data.
So the system added XMLRiver as a fresher data source from the Yandex ecosystem, along with it—basic, exact, and refined frequency. Later, the project moved away from simple parsing toward full semantic processing. For clustering, the author connected SentenceTransformers, but quickly ran into a typical NLP problem: semantically similar queries don't always belong on one page.
To avoid mixing, for example, geo-dependent queries like repairs in Moscow and Voronezh, on top of the embeddings came SERP Veto—a check for URL overlap in the search results. Before that, the list was still cleaned of garbage with regex and collapsed through fuzzy deduplication, which removed 30–40% of duplicates before expensive SERP requests.
Why one LLM isn't enough
Once the spreadsheet could already gather, frequency, and cluster keywords, the most unpleasant part remained: filtering out garbage. This means vacancies, aggregators, queries with informational intent, and other edge cases that are difficult to remove with simple minus-words. The author tried giving the task to a single DeepSeek LLM model with a simple prompt, but quickly discovered that without context the model guesses the niche too freely.
The word "repair" for it could mean apartments, phones, or an engine. To reduce the chaos, a PlannerAgent appeared before the classification. It receives a description of the niche and generates guidelines for the next step: who is the target customer, which examples to consider relevant, which traps to cut off, how to handle geography.
In parallel, the author optimized cost: instead of returning full rows, the model began answering only with keyword IDs. This reduced the response volume from about 400 to 80 tokens per batch and gave savings of 30–40% on large runs.
Why voting was needed
Even after these improvements, the same set of 671 keywords across three runs showed only 37.7% completely stable decisions. The reason turned out to be not in the temperature, but in the process itself: PlannerAgent each time slightly changed the few-shot examples, and edge queries ended up in different categories. Then the author created Ensemble Voting: each batch of 20 keywords runs three times in parallel, and the result is determined by majority vote. If all three answers disagree, the query is sent to the "Review" list, and later it is analyzed by a separate arbitration agent.
- classification stability increased from 37.7% to roughly 85%
- manual review was reduced to about 5% of queries
- 3000 keywords are processed in 20–30 minutes instead of 3–4 hours
- the cost of a run with three votes is about $0.30
"Fix three lines in the prompt.
Three lines. And I spent a week building the architecture before that."
This phrase captures the author's main conclusion well. After all the architectural work, it turned out that one obvious commercial query consistently ended up in the disputed category simply because the prompt rules didn't include the phrase "turnkey." In other words, the complex multi-agent scheme did indeed improve quality, but didn't eliminate the basic need to validate the prompt itself on edge cases before decking it out with ensembles and arbiters.
What this means
This case is useful not only for SEO specialists. It shows how applied automation driven by real user pain can quickly grow into an agent system if you hit the limits of quality, cost, and output instability. And simultaneously, it reminds us of a more uncomfortable truth: sometimes the main gain comes not from a new agent, but from three well-written lines in the prompt.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.