Engineering reliable RAG systems: lessons from a Hackathon

In October 2025, we organized the Argusa AI Challenge, a competition bringing together students and data enthusiasts to tackle a realistic Retrieval-Augmented Generation use-case. The event highlighted how quickly interest in RAG is growing, yet also how easily early initiatives underestimate the engineering depth required for reliable, secure, and business-aligned systems.
This work presents insights from a rigorous development process: we first built and stress-tested the challenge internally across five iterations, exploring multiple architectures and development tools, also requiring the simultaneous development of a formal evaluation method. This phase proved the challenge's feasibility while exposing critical constraints, particularly around automated evaluation, and established a formal framework for RAG development in enterprise environments. The resulting process was then stress-tested through a public competition with dozens of external participants during the Argusa AI Challenge
The hidden complexity of enterprise RAG
Although RAG promises grounded answers and faster decision-making, real-world conditions quickly complicate implementation. Enterprise data is noisy, contradictory, and scattered across heterogeneous systems governed by strict permissions. Most early efforts focus narrowly on retrieval or prompting, overlooking how ingestion, preprocessing, indexing, retrieval, and evaluation form an interdependent chain. Weakness in any stage propagates downstream. The experiment surfaced this systemic nature immediately. Incomplete ingestion logic, inconsistent metadata, suboptimal chunking, and retrieval fragility each undermined answer reliability. The core lesson is that RAG must be treated as an engineered system, not an isolated component.
Building a realistic yet safe corpus
Within this hackathon, it was impossible for us to publicly expose any internal data. To explore these challenges safely, we built a synthetic corpus representing a fictional company with coherent processes, staff, and products. This dataset reproduced real enterprise complexity: PDFs, slide decks, structured and semi-structured files, emails, logs, archives, and images. It intentionally included ambiguity, redundancy, and partial inconsistencies. This forced teams to design ingestion pipelines capable of handling diverse formats, recovering text reliably, and preserving traceability. The process revealed a critical insight. Automated ingestion cannot be fully trusted. Human oversight remains necessary to enforce consistency and correct deviations before data becomes part of the system’s semantic memory.
Exploring multiple RAG architectures during internal testing
Internally, during the elaboration of the hackathon, we implemented end-to-end pipelines using different technology stacks, including cloud-native toolkits, vector databases, and custom Python orchestrations. No single approach proved universally optimal. Performance depended on data characteristics, chunking quality, retrieval strategy, and embedding choices.Accuracy scores varied between 65% and 80%, even under controlled conditions. This confirmed that RAG performance emerges from compounded engineering decisions rather than any isolated technology choice. It also highlighted the need for flexible architectures that adapt to data, infrastructure, and business constraints.
The RAG Pipeline as an Engineering Workflow
This exploratory work on RAG solutions revealed six foundational steps that form the unavoidable scaffold of any performant system: regardless of the tools or technical stack, every effective RAG pipeline ultimately rests on these stages, even if implementations introduce refinements or added complexity.
- Ingestion converts heterogeneous documents into text while ensuring full coverage and complete traceability. Missing files at this stage result in permanent knowledge gaps.
- Preprocessing normalises content, segments it into coherent units, and enriches each segment with metadata. Over-fragmentation dilutes meaning, while weak metadata reduces interpretability and harms retrieval.
- Indexation embeds chunks into vector space. Choices around embedding models and chunk size directly influence retrieval quality, cost, and system scalability.
- Retrieval determines what information the model sees. Balancing semantic similarity, literal matching, and chunk selection is critical to avoid missing relevant context or overwhelming the model with noise.
- Generation synthesises retrieved information into grounded answers while avoiding hallucination, maintaining traceability, and signalling uncertainty where necessary.
- Auto-evaluation closes the loop by identifying defects in retrieval, reasoning, or stability before systems reach end users.
This blog provides a high-level synthesis, and readers seeking a detailed breakdown of each of the six steps can download the full white paper at the end.
Understanding failure modes to improve retrieval
Some question types consistently exposed structural weaknesses, such as:
- Exhaustive list questions (“List all the projects that…”) revealed retrieval blind spots, where only the most salient items surfaced.
- Queries requiring implicit understanding exposed semantic mismatches in embedding-based search
- Reasoning traps showed how missing context invites confident but incorrect conclusions.
These patterns underline retrieval as (one of) the main performance lever.
Systematic evaluation to ensure trust
Evaluating RAG systems is intrinsically difficult, because it requires converting open-ended answers into quantitative signals that can be audited and trusted. No single metric captures accuracy, completeness, and reasoning quality, and LLM-based evaluators introduce variability. Self-consistency checks across prompt variants and dual-model scoring help stabilise evaluations but cannot remove ambiguity. Automated scoring provides scale and structure, yet remains insufficient on its own. It must be paired with targeted human oversight to interpret edge cases and correct misjudgements. Enterprise trust arises from this interplay: automation ensures repeatability, while human review safeguards against hidden failure modes and maintains governance.
Conclusion
Reliable RAG systems require rigorous pipeline engineering and structured evaluation methods that scale beyond proof-of-concept. This rigorous process systematically developed expertise in both technical implementation and the governance complexities that emerge in enterprise production, constraints that isolated experiments typically miss. Furthermore, the ability to generate realistic synthetic corpora proved strategically valuable, enabling stress-testing without exposing sensitive data.
All this aspects are presented more in details in our white paper. Download the full paper for a deeper dive into the architecture, technical insights and lessons learned across the entire development process.
Author
Solange Flatt
In October 2025, we organized the Argusa AI Challenge, a competition bringing together students and data enthusiasts to tackle a realistic Retrieval-Augmented Generation use-case. The event highlighted how quickly interest in RAG is growing, yet also how easily early initiatives underestimate the engineering depth required for reliable, secure, and business-aligned systems.
This work presents insights from a rigorous development process: we first built and stress-tested the challenge internally across five iterations, exploring multiple architectures and development tools, also requiring the simultaneous development of a formal evaluation method. This phase proved the challenge's feasibility while exposing critical constraints, particularly around automated evaluation, and established a formal framework for RAG development in enterprise environments. The resulting process was then stress-tested through a public competition with dozens of external participants during the Argusa AI Challenge
The hidden complexity of enterprise RAG
Although RAG promises grounded answers and faster decision-making, real-world conditions quickly complicate implementation. Enterprise data is noisy, contradictory, and scattered across heterogeneous systems governed by strict permissions. Most early efforts focus narrowly on retrieval or prompting, overlooking how ingestion, preprocessing, indexing, retrieval, and evaluation form an interdependent chain. Weakness in any stage propagates downstream. The experiment surfaced this systemic nature immediately. Incomplete ingestion logic, inconsistent metadata, suboptimal chunking, and retrieval fragility each undermined answer reliability. The core lesson is that RAG must be treated as an engineered system, not an isolated component.
Building a realistic yet safe corpus
Within this hackathon, it was impossible for us to publicly expose any internal data. To explore these challenges safely, we built a synthetic corpus representing a fictional company with coherent processes, staff, and products. This dataset reproduced real enterprise complexity: PDFs, slide decks, structured and semi-structured files, emails, logs, archives, and images. It intentionally included ambiguity, redundancy, and partial inconsistencies. This forced teams to design ingestion pipelines capable of handling diverse formats, recovering text reliably, and preserving traceability. The process revealed a critical insight. Automated ingestion cannot be fully trusted. Human oversight remains necessary to enforce consistency and correct deviations before data becomes part of the system’s semantic memory.
Exploring multiple RAG architectures during internal testing
Internally, during the elaboration of the hackathon, we implemented end-to-end pipelines using different technology stacks, including cloud-native toolkits, vector databases, and custom Python orchestrations. No single approach proved universally optimal. Performance depended on data characteristics, chunking quality, retrieval strategy, and embedding choices.Accuracy scores varied between 65% and 80%, even under controlled conditions. This confirmed that RAG performance emerges from compounded engineering decisions rather than any isolated technology choice. It also highlighted the need for flexible architectures that adapt to data, infrastructure, and business constraints.
The RAG Pipeline as an Engineering Workflow
This exploratory work on RAG solutions revealed six foundational steps that form the unavoidable scaffold of any performant system: regardless of the tools or technical stack, every effective RAG pipeline ultimately rests on these stages, even if implementations introduce refinements or added complexity.
- Ingestion converts heterogeneous documents into text while ensuring full coverage and complete traceability. Missing files at this stage result in permanent knowledge gaps.
- Preprocessing normalises content, segments it into coherent units, and enriches each segment with metadata. Over-fragmentation dilutes meaning, while weak metadata reduces interpretability and harms retrieval.
- Indexation embeds chunks into vector space. Choices around embedding models and chunk size directly influence retrieval quality, cost, and system scalability.
- Retrieval determines what information the model sees. Balancing semantic similarity, literal matching, and chunk selection is critical to avoid missing relevant context or overwhelming the model with noise.
- Generation synthesises retrieved information into grounded answers while avoiding hallucination, maintaining traceability, and signalling uncertainty where necessary.
- Auto-evaluation closes the loop by identifying defects in retrieval, reasoning, or stability before systems reach end users.
This blog provides a high-level synthesis, and readers seeking a detailed breakdown of each of the six steps can download the full white paper at the end.
Understanding failure modes to improve retrieval
Some question types consistently exposed structural weaknesses, such as:
- Exhaustive list questions (“List all the projects that…”) revealed retrieval blind spots, where only the most salient items surfaced.
- Queries requiring implicit understanding exposed semantic mismatches in embedding-based search
- Reasoning traps showed how missing context invites confident but incorrect conclusions.
These patterns underline retrieval as (one of) the main performance lever.
Systematic evaluation to ensure trust
Evaluating RAG systems is intrinsically difficult, because it requires converting open-ended answers into quantitative signals that can be audited and trusted. No single metric captures accuracy, completeness, and reasoning quality, and LLM-based evaluators introduce variability. Self-consistency checks across prompt variants and dual-model scoring help stabilise evaluations but cannot remove ambiguity. Automated scoring provides scale and structure, yet remains insufficient on its own. It must be paired with targeted human oversight to interpret edge cases and correct misjudgements. Enterprise trust arises from this interplay: automation ensures repeatability, while human review safeguards against hidden failure modes and maintains governance.
Conclusion
Reliable RAG systems require rigorous pipeline engineering and structured evaluation methods that scale beyond proof-of-concept. This rigorous process systematically developed expertise in both technical implementation and the governance complexities that emerge in enterprise production, constraints that isolated experiments typically miss. Furthermore, the ability to generate realistic synthetic corpora proved strategically valuable, enabling stress-testing without exposing sensitive data.
All this aspects are presented more in details in our white paper. Download the full paper for a deeper dive into the architecture, technical insights and lessons learned across the entire development process.
Author
Solange Flatt
White Paper - RAG Hackathon

