Each sprint was evaluated on the same set of 100 queries (90 in-scope ISM questions, 10 out-of-scope). Scores are RAGAS metric averages from the final evaluation notebooks.
| Metric | Sprint 1 | Sprint 2 | Sprint 3 |
|---|---|---|---|
| Faithfulness | 0.6834 | 0.7341 | 0.8351 |
| Answer Relevancy | 0.7216 | 0.7678 | 0.9078 |
| Context Precision | 0.7885 | 0.8598 | 0.8590 |
| Context Recall | 0.8224 | 0.8659 | 0.9249 |
| Answer Similarity | N/A | 0.9057 | 0.9179 |
Sprint 1 answer similarity was not measured. Sprint 3 improves answer relevancy, context recall, and answer similarity, while overall context precision stays level with Sprint 2; in-scope context precision is 0.9545.
Each sprint added new components to the RAG pipeline, improving retrieval quality and system reliability.
Fixed-size chunking (1000 char), vector-only search, Llama 3.1 8B via Groq. 900 chunks.
ISM-aware chunking (643 chunks), hybrid search (BM25 + vector + RRF), cross-encoder reranking, FastAPI web app.
Multi-query expansion, two-stage OOS guardrail, pipeline explorer, CI/CD, Hugging Face Spaces deployment.