DeepSeek and Qwen tried to get through "Everlasting Summer," but got stuck in the labyrinth
Local LLMs were tested in "Everlasting Summer": the Ren’Py game was connected to Ollama via a TCP bridge, and the models were made to choose dialogue options…
AI-processed from Habr AI; edited by Hamidun News
A Selectel blog post on Habr described an experiment in which local large language models were connected to the visual novel "Infinite Summer" and forced to make plot decisions instead of the player. Technically the integration worked, but in practice even strong models often got confused by the answers, slowed down on long context, and led the story to unsuccessful endings.
How the test was set up
They chose "Infinite Summer" specifically because the visual novel consists almost entirely of text, which means it plays to the strong suit of LLMs. The game has 13 endings, and relationships with characters change depending on dialogue and actions, so this format turned out to be a convenient testing ground for checking how the model behaves in a long plot dialogue. An additional advantage was that some of the local models didn't know this game beforehand and couldn't simply "remember" the correct moves.
The technical side was built around Ren'Py, the engine the game uses. The author added a bridge.rpy file to the project, launched a TCP server inside the game, and intercepted key functions: dialogue output through say, choice menus through display_menu, and map interactions through store. This way the game began sending all dialogue outward, while an external coordinator made decisions instead of a human. Models ran locally on a cloud server with 12 vCPU, 128 GB RAM, one H100, and 300 GB SSD through av/harbor, Docker, and Ollama. They had to work around a plot card mini-game separately so the model didn't have to learn additional mechanics unrelated to text-based choices.
Where things broke
After the patch, the game began being controlled externally through Ollama. The coordinator collected dialogue, tagged it with roles system, tool, user, and assistant, and sent the model a simple request: you're faced with a choice, suggest the right option and answer with one digit. On paper the scheme looked straightforward, but even in the first scenes the models started answering like humans: adding explanations, repeating the list of options, choosing a non-existent number, or issuing phrases in the wrong format. Because of this they had to introduce an additional request that separately extracted the answer number from the text.
The second problem turned out to be even more painful: context grew too quickly. In the prologue alone there were 134 lines, on the first day — 862, and the whole game contains tens of thousands of lines. After the first third of the playthrough, each branch point could take 5–7 minutes to process. The solution turned out crude but workable: old messages began to be collapsed into brief summaries in batches of one hundred so that the active dialogue would contain no more than two hundred messages. This noticeably sped up responses and reduced the proportion of strange reactions.
How the test runs ended
After calibration, five local models were sent to the final test: DeepSeek-R1:70b, Qwen3.5:9b, Qwen2.5:3b, gpt-oss:20b, and Gemma3:27b. They all played through the game from the beginning without access to ready-made playthroughs, and the coordinator recorded the choices made, intermediate answers, and reasoning.
The idea was simple: test not the theory but the model's real ability to maintain a plot, navigate branches, and bring a long story to a coherent ending.
- DeepSeek-R1:70b reached the main bad ending in tests but got stuck in a loop in the maze.
- gpt-oss:20b consistently reached the main bad ending without notable surprises.
- Qwen3.5:9b moved quickly but spent more than twenty minutes on one choice.
- Qwen2.5:3b managed to reach a bad ending on Lena's route.
- Gemma3:27b got lost in the maze twice and came to Alice's bad ending in tests.
"The most expensive pseudorandom number generator," the author
described the system after the runs.
The overall result turned out weak not only because of the endings themselves. The key problem manifested in the maze, where it was necessary to account for turns already made and not repeat the same choice. That's where models most often got stuck on the old pattern and reproduced the previous answer even when it already led to a dead end. Given that the game has 13 endings and many storylines break with a single wrong decision near the finale, even a formally working agent remains too unreliable a player.
What this means
The experiment showed that local LLMs can already be connected quite quickly to a text-based game through Ren'Py, Ollama, and a simple network bridge. But this isn't yet a story about an autonomous agent that confidently understands a long plot and strategically plays through a visual novel: without strict answer normalization, context compression, and manual workarounds, such models easily fall into loops, hesitate on choices, and more often come to bad outcomes than good ones.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.