Habr AI→ original

PSB outlined its approach to RAG in fintech: architecture, metrics, and the testing cycle

PSB shared its approach to evaluating RAG in fintech and showed that the fight against hallucinations starts not with the prompt, but with architecture and…

AI-processed from Habr AI; edited by Hamidun News
PSB outlined its approach to RAG in fintech: architecture, metrics, and the testing cycle
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

PSB published a practical breakdown of how it evaluates and tests the RAG approach in tasks where the cost of error is particularly high. Instead of relying on the "intelligence" of the model, the bank bets on a combination of its own knowledge base, vector search, quality metrics, and regular manual verification.

How RAG Works

PSB reminds that the main problem with LLMs is not only weak answers, but also confident errors. This is precisely where RAG comes in: the model first searches for information in a trusted data array, and only then generates an answer. The knowledge base can be anything — documents, a website, an internal repository, or a structured database.

But for search to work quickly, materials must first be split into fragments and converted into vectors through an embedding model. The quality of chunk division often determines success. For HTML and plain text, material can be divided by paragraphs; for formalized documents — by punctuation; for complex data arrays — by token count.

The article emphasizes separately that a token is not a character or a word, but a unit of division that depends on the specific model's tokenizer. After vectorization, the system retrieves relevant fragments from the index, adds them to the context, and only then asks the model to generate an answer.

Measuring Quality

PSB suggests viewing RAG not through a single metric, but across three dimensions: search quality, answer accuracy, and presentation quality. If the system doesn't find the necessary document, no strong LLM will save the result. If the document is found, the next problem is whether the model understood it correctly and didn't add anything unnecessary. And only after that does it make sense to evaluate how readable, useful, and relevant the answer is to the user's question.

  • Hit Rate — does the system find relevant documents at all
  • MRR — how highly does the best document rank in the results
  • Factual Accuracy — how many factually correct statements are in the answer
  • Usefulness and clarity — does the answer solve the task without unnecessary digressions

To verify accuracy, PSB uses both automatic calculation and comparison with a "gold standard" — answers prepared by humans. Another layer of control is an LLM-arbiter, where a separate model evaluates the main model's result. But in fintech, automation runs into limitations: personal data must be cleaned from the knowledge base, and recognizing such data doesn't provide 100% guarantee. That's why manual verification remains a mandatory part of the process.

"RAG is technology, not magic."

How Testing Works at PSB

In testing, PSB applies the classic quality pyramid to RAG, but adjusted for the architecture of such systems. At the bottom level, they check not individual code chunks, but components: the LLM itself, the vector database, extraction settings, and document chunking. At the next level are API tests — here you can look at load, responses, volume of returned chunks, and token count.

Higher up are E2E scenarios, where the system's behavior in real user queries matters. And separately, manual testing, which is still unavoidable in sensitive domains. The evaluation cycle itself is also described as a continuous process.

First, a test dataset is collected: with the help of an LLM, you can generate from hundreds to thousands of questions. Then RAG is run through this set, responses and found documents are saved, metrics are calculated, bottlenecks are identified, and the system is refined. For automatic evaluation, PSB currently uses RAGAS, and in the future considers its own tools — including dashboards, CI/CD integration, A/B version comparison, and heatmaps of problem areas.

This approach is needed not for academic purity, but to track degradations and improvements over time.

What This Means

For companies not ready to spend large budgets on model fine-tuning, RAG remains the most practical way to quickly improve the accuracy of corporate AI services. But PSB's article shows an important point well: retrieval alone guarantees nothing. You need discipline in data preparation, clear metrics, regular tests, and a human in the loop — especially where an error in the answer can affect money, compliance, or client security.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…