Habr AI→ original

Veai showed how to test AI agent in JetBrains IDE without model dependency

Veai explained how it tests an AI agent inside JetBrains IDE without binding UI test stability to LLM behavior. The team separated test runs into smoke and…

AI-processed from Habr AI; edited by Hamidun News
Veai showed how to test AI agent in JetBrains IDE without model dependency
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Veai shared how it built UI tests for its AI agent inside JetBrains IDE so that tests would not depend on the whims of the language model. The team separated testing of the interface, business logic, and the LLM itself, then assembled a stable pipeline for PRs and overnight runs.

Why this is difficult

The main problem with such products is that the same request to an LLM can return different phrasings, answer length, and even different output formats. If you test an AI plugin like a regular chatbot and expect specific text, UI tests quickly become a lottery.

At the same time, a significant part of the product is quite deterministic: user settings, transitions between chat states, buttons, skill badges, conversation history, and the plugin's internal business logic. This is exactly what Veai decided to separate into a dedicated automation layer.

The team operates on a simple principle: model answer quality and interface functionality are separate concerns, and mixing them in a single test is not advisable. The project already had checks at other levels, including IDE model tests, scenarios for different LLM providers, and a separate agent benchmark. Therefore, UI automation became the top of the test pyramid, rather than an attempt to replace everything else.

Such an approach is important for an IDE plugin: the agent responds both in the chat window and through the terminal, which means the number of places where a test might randomly fail is significantly higher than with a regular web interface.

How they structured the tests

To avoid confusion between goals, Veai divided test runs into several modes. Smoke tests run with every pull request and check basic interface functionality without actual LLM calls. Full runs start at night, work with real servers, and go through the entire chain from plugin to model response. There is also a separate benchmark: not UI tests, but an evaluation of user scenarios and agent quality using the LLM-as-a-Judge principle.

As a result, the team catches UI regressions faster during the day and doesn't lose end-to-end product verification at night.

  • Smoke with every PR: UI and basic logic check without LLM load
  • Full at night: work with real server and wait for response in interface
  • Benchmark separately: evaluation of agent scenarios and result quality
  • Parallel runs across different IDEs: broader coverage without increasing manual effort

In a full run, developers use a very short prompt — 2+2=? Keep it brief. But the goal of the test is not to see the number 4 specifically. The point is different: after necessary preconditions, the agent must reach a Ready state, receive a response from the server, properly stream tokens, and display the result in the interface. This scenario doesn't test model creativity, but rather whether the combination of IDE plugin, license server, LLM server, and internal libraries hasn't broken after the latest change.

"We're not aiming to get answer 4"

What helped in CI

From a technical perspective, Veai relies on JetBrains Starter and Driver libraries. Starter prepares the IDE, configures the test project, and collects run artifacts, while Driver works with the real interface and allows describing elements using Page Object and DSL approaches. If locators are insufficient, the team adds accessibleName to the product code or adjusts obfuscation rules to keep elements discoverable.

Some state is prepared in advance through XML plugin settings, so tests don't have to go through the welcome screen and onboarding each time. Another important element is access to the IDE's internal state through JMX. This allows not only clicking the interface but also verifying what's actually written to the agent's chat, which provider is selected, and what the JSON state looks like from the plugin's perspective.

For CI, the team maintains a matrix of different IntelliJ Platform versions, multiple JetBrains IDEs, feature flags, and obfuscated builds. Heavy checks run at night, Xvfb is used on Linux, and to reduce noise, limited retries are enabled: no more than one test restart and no more than three failures per run.

The practical effect is already visible: dozens of UI tests cover plugin settings, transitions between chat states, skill calls, conversation history, and new user login. Within the first few months, these checks found quite practical bugs: after adding file drag and drop, copy and paste broke; the context progress bar didn't account for all LLM providers; agent selection on the UI didn't work for all configurations; and the error recovery scenario started visually flickering along with the tests. Overnight runs even highlighted instability in one of the LLM server test configurations.

What this means

For products with AI features, this is a good guideline: you shouldn't force UI tests to judge model quality when it's sufficient to check interface resilience and the integration chain. Separating deterministic and non-deterministic parts of the system makes tests faster, more useful, and notably more honest for both developers and QA.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…