Skip to content
Alexandre Courtiol
Menu
← Back to writing

What production RAG actually cost us

Published 30 June 2026AILLMsRAGCost

Everyone asks the wrong first question about generative AI. They ask which model. The interesting question is where the money and the effort actually go once the thing is in production. We ran production retrieval-augmented generation at Chantelle two ways, first on a hosted API and then on a model we trained and hosted ourselves, and both times the model was the cheapest part of the bill.

Here is the short answer, because it is the whole point: the dominant costs of production RAG are ingestion, data quality, evaluation, and integration. The model, whether a metered API call or a self-hosted fine-tune, is the smallest recurring line. The model was the cheapest line right up until we chose to own it, and even then the real cost was never the model.

Where does the money actually go in production RAG?

Into getting documents in, keeping them clean, evaluating the answers, and wiring the thing into the systems people already use. Raw model inference is usually the smallest recurring cost, not the largest.

A demo makes the model look like the product. Production reveals it as a component, and a cheap one, sitting on top of a much more expensive apparatus that nobody put on the slide. Getting documents in was the real project: source formats are inconsistent, content is duplicated or stale or contradicts itself, and deciding what to ingest, how to chunk it, and how to keep it fresh as the underlying data changes is where the engineer-months went. The integrations were the second sink. A retrieval system that cannot reach the tools people use is a science project, so orchestrating the flows in n8n, reaching into Magento, handling failures and retries, that unglamorous plumbing was most of the work.

We started on a hosted API, and the model was the cheapest line

In 2025 the stack was deliberately boring: PostgreSQL with pgvector for retrieval, Gemini for generation, n8n to orchestrate ingestion and workflows, Valkey for caching. Inference was a metered API call. You can forecast it, cap it, and cache around it, and we put Valkey in front of the workload so we were not paying to re-answer the same questions. It flattened the bill and sped up responses at the same time.

Set that against ingestion and data quality, which are human costs, recurring, and much harder to cap. An engineer maintaining an ingestion pipeline does not get cheaper the way an API call does. If you are budgeting an AI project and the model line is the big one, you have either not shipped yet or you are measuring the wrong thing.

So why build and host our own model?

Not because the API was expensive. It was cheap. We moved to our own fine-tuned Gemma for three reasons that the API could not give us. Quality: a model fine-tuned on our own ideal task examples beat the generic endpoint on our actual questions. Control: we owned the latency, the versioning, and we were not exposed to someone else’s model changing under us. And, honestly, to prove what was possible in-house, which matters when you are trying to move an organisation.

Picking the model was its own small exercise in not chasing size. I benchmarked Gemma against Qwen and against larger Gemmas on the job we actually had, navigating multi-language e-commerce sites, and the 12B won: stronger multilingual behaviour than Qwen, where the bigger models were not worth their price for our workload. Size is the easiest thing to over-buy in this field.

Owning a model is not free. You trade a metered API call for a GPU bill and the work of serving it, in our case fine-tuning with QLoRA and serving with vLLM on a spot-instance fleet. On its own, for one assistant, that trade is hard to justify. What justified it was that we stopped treating the model as one app’s dependency and started treating it as a platform.

Owning the model turned it into a platform

Once the fine-tuned Gemma was ours, the assistant was only the first thing we ran on it. We pointed Playwright agents at it to walk the critical user-journey paths across all our production sites, the way a customer would, and tell us when something broke. And when a ticket moved into QA in ClickUp, an n8n workflow had the model draft the test case and Playwright run it against staging, then posted the results back onto the ticket as a comment, what it did and what it found, before a human tester ever picked it up. One model we owned, several workloads, all sharing the same GPU fleet we had already paid for.

That is the part you cannot rent by the call. A metered API priced per request does not get cheaper when you find a second and third use for it; owning the model does. The economics flip the moment the thing you own is a shared capability instead of a single feature. That is also the honest answer to “was self-hosting worth it”: for one chatbot, probably not; for a platform three teams build on, easily.

AssistantSynthetic monitoringQA agentGoogle Chat · n8nRAG (pgvector)CAG (Valkey) · MCPPlaywright agents onproduction user journeysClickUp ticket, test draft,Playwright run on stagingFine-tuned Gemma 12BQLoRA · served with vLLMGPU spot fleetG4 / G5 / G6 · ~€150 / month
One owned model, three workloads, one spot fleet. The amortisation a per-call API cannot match.

What actually made it work?

Retrieval quality and honest evaluation, not prompt cleverness. This did not change when the model did.

A mediocre prompt over excellent retrieval beats a brilliant prompt over mediocre retrieval, every time. If the system pulls the right passages, the model does its job; if it pulls the wrong ones, no amount of prompt engineering saves you, and you have built a confident machine for being wrong. Evaluation was the other half. We measured retrieval and answer quality against a test set instead of trusting the vibe of a good demo, and we ran that evaluation in CI so a change that made answers worse failed the build. Vibes are how AI projects get greenlit and how they quietly fail six months later.

So what did it really cost?

Even self-hosted, the model was not the expensive part. We ran the fine-tuned Gemma 12B around the clock for about €150 a month: spot instances only, in an AWS region with deep G4, G5 and G6 capacity, with a Packer image built to run on any of those GPU generations so we almost always caught cheap spot capacity, and multi-token prediction to speed up generation. €150 a month, for a model that served the assistant and the test agents alike. The expensive part was still ingestion, data quality, evaluation, and integration, plus the engineering around them. Owning the model added a small, predictable GPU bill and removed the per-call cost, and it paid off because we ran more than one thing on it. The line everyone obsesses over stayed the small one in both regimes.

If you are a founder or an engineering leader being asked to “do something with AI,” budget your effort and your money where they actually go. Assume the model is the cheapest part. Put your people on ingestion, data quality, evaluation, and integration, because that is what decides whether you ship something dependable or something that only shines in a demo. And if you do decide to own a model, do it because you have more than one use for it, because a model you own is a platform and a model you rent is a feature.

That discipline is what I bring to the teams I advise, whether as their engineering leader or as a second opinion when they are deciding what an AI project is really going to cost them.