Evaluations | ISM CyberRAG

Sprint Comparison

Each sprint was evaluated on the same set of 100 queries (90 in-scope ISM questions, 10 out-of-scope). Scores are RAGAS metric averages from the final evaluation notebooks.

Metric	Sprint 1	Sprint 2	Sprint 3
Faithfulness	0.6834	0.7341	0.8351
Answer Relevancy	0.7216	0.7678	0.9078
Context Precision	0.7885	0.8598	0.8590
Context Recall	0.8224	0.8659	0.9249
Answer Similarity	N/A	0.9057	0.9179

Sprint 1 answer similarity was not measured. Sprint 3 improves answer relevancy, context recall, and answer similarity, while overall context precision stays level with Sprint 2; in-scope context precision is 0.9545.

Answer Relevancy

0.9078

Up 0.1400 vs Sprint 2

Context Recall

0.9249

Improved with multi-query retrieval

In-Scope Precision

0.9545

Retrieval quality on answerable questions

OOS Refusals

10 / 10

Blocked by guardrail path

What We Built

Each sprint added new components to the RAG pipeline, improving retrieval quality and system reliability.

Sprint 1

Baseline RAG

Fixed-size chunking (1000 char), vector-only search, Llama 3.1 8B via Groq. 900 chunks.

Sprint 2

Improved Pipeline

ISM-aware chunking (643 chunks), hybrid search (BM25 + vector + RRF), cross-encoder reranking, FastAPI web app.

Sprint 3

Production Ready

Multi-query expansion, two-stage OOS guardrail, pipeline explorer, CI/CD, Hugging Face Spaces deployment.

Evaluation Charts

The table gives exact scores. The selected charts below provide visual evidence for sprint progression, guardrail behaviour, category performance, and latency.