RAGAS and RAG metrics: how to stop guessing and start measuring quality
RAG systems often introduce hallucinations or lose relevant context. RAGAS provides four automated metrics: Faithfulness (faithfulness to the context), Answer R

RAG systems are gaining popularity, but they often produce incorrect answers, add fabricated facts, or ignore relevant context. In the third part of our quality engineering cycle, we'll explore how to measure RAG instead of guessing, and how to use RAGAS — a framework that replaces manual verification with automation.
Why RAG metrics are harder than LLM metrics
A standard LLM can be evaluated on benchmark datasets: MMLU, HumanEval, TruthfulQA. RAG adds a retrieval layer — new failure modes emerge. A question like "What was Gates' salary in 1997?" can fail in several ways:
- The search engine didn't find a relevant document — the search returned noise
- The model found the document but ignored the relevant fact — inattention
- The model found the fact but added hallucination on top — mixing source and fabrication
- The context was relevant, but the answer doesn't match the question — logical error
To manage these scenarios, specialized metrics are needed. Manual verification of each answer is expensive and doesn't scale to thousands of queries.
RAGAS: four metrics for all cases
RAGAS — a framework by Basis AI for automatic RAG evaluation. Here are its core metrics:
- Faithfulness — is the generated answer true to the context? The model checks whether the LLM added facts not present in the sources. Score ranges from 0 to 1.
- Answer Relevance — does the answer match the question? RAGAS generates a reverse question from the answer and compares by semantics.
- Context Precision — are the retrieved fragments relevant? It checks whether the search engine mixed up direction or returned noise.
- Context Recall — completeness of context? Was all the information in the documents necessary for a complete answer to the question.
Each metric points to a bottleneck: poor retrieval, poor generation, or both.
How RAGAS evaluates internally
There's no magic here — RAGAS uses the LLM itself as a judge. For Faithfulness, it takes the generated answer and context, asks the model to identify statements that can be verified (factual claims), then checks each against the context one by one. If a statement is supported by a fact in the source — the score increases.
For Answer Relevance, it generates a hypothetical question from the answer (reverse), then calculates cosine similarity (semantic distance) to the original question. The higher the match, the more relevant the answer.
"If your LLM knows how to lie, it knows how to detect lies," — the
framework's logic.
The entire process requires LLM calls (for each answer — at least 2–3 calls), so RAGAS is considered expensive in tokens. But the alternative — hiring people for labeling — is even more expensive and slower.
What this means
RAGAS makes RAG engineering reproducible and trackable. Instead of the qualitative "seems to work," you get quantitative metrics that track how each update (new documents, new model, new prompt) affects quality.
For small pet projects, RAGAS might be overkill. For enterprise solutions, where mistakes cost money and client trust, it's the periodic table that RAG engineers have been missing for a long time.