Back to blog

Blog

Your AI can't go where your data can't

· 12 min read · Evan Ritter

aillmcomplianceinfrastructure

On local LLM inference as a compliance posture

The default story for adding AI to a product, in 2026, goes something like this. You pick an API — OpenAI, Anthropic, Google, whoever's having the best week — wire it into your application, write a system prompt, and ship. The latency is fine. The quality is excellent. You're not in the model-hosting business and you don't want to be. You pay by the token, the bill is small enough at first that nobody asks questions, and you get on with building the actual product.

For most consumer-facing applications this is correct. I don't want to talk anyone out of it. The model providers are competing hard on price and quality, the SDKs are mature, and the operational burden of running your own inference is real. If your users are individuals, the data they send is theirs to send, and your only job is to be a useful pipe between them and a good model — by all means, pick an API and get on with it.

But there's a class of business where this default quietly doesn't work, and the people building those businesses are reaching for it anyway because nobody's articulated the alternative properly. This is a post about that class of business and that alternative.

The constraint nobody quite says out loud

Here's the thing about a lot of regulated data: where it can go is part of what makes it regulated. Healthcare records, certain categories of financial data, defence-adjacent material, legal e-discovery, audience measurement, regulated telecoms metadata, anything covered by a strict data-processing agreement with a counterparty — the contracts and the law don't just say "be careful with this." They say "this data lives on these systems, processed by these people, and any movement of it off those systems is a notifiable event that probably wasn't notified."

The convenient mental model for most of us is that an API call to OpenAI or Anthropic is a kind of function call — you put data in, you get a response back, nothing really moves. But that's not how data-protection law or most commercial data agreements see it. The API call is a data egress event. The data leaves your network, travels across the public internet, is processed on a third party's infrastructure, and is held there for at least as long as it takes to return a response. Whatever the API provider's data-handling terms say — no training on your data, deletion within thirty days, EU residency, all of it — none of that is the same thing as the data not leaving your network in the first place.

For a lot of compliance regimes this distinction matters enormously. Contractual data-handling promises are not equivalent to data residency. They might be enough for your lawyers. They might be enough for your auditors. But they often aren't enough for the regulator who actually wrote the rules, and they almost certainly aren't enough for the client who made you sign the data-processing agreement in the first place.

If you've ever sat in a meeting where someone asked "but does the data leave our infrastructure?" and the honest answer is yes, briefly, every time we call the AI feature, you know the feeling.

What this forces

The constraint is simple to state. If your data can't leave your network, your inference can't either. That means the model has to come to the data, rather than the other way round.

A few years ago this would have been a sentence to make you wince. Self-hosting an LLM in 2022 meant data-centre-class GPUs, a research engineer who knew how to babysit them, and a tolerance for outputs that were noticeably worse than what you could get from a hosted API. The maths just didn't work for most companies.

In 2026 the picture is genuinely different. The maths still doesn't work for everyone — for general-purpose chatbots competing on raw quality, the hyperscaler APIs are still in front — but for the regulated-data class of business, the gap has closed far enough that local inference is now the obvious answer rather than the expensive workaround.

Concretely, here's what an entry-level production setup looks like. One workstation-class server. A single mid-range consumer GPU — something like an RTX 5060 Ti with 16GB of VRAM, available right now at UK retailers for around £390. Ollama as the inference server, which is one shell command to install and runs as a systemd service like anything else. Llama 3.1 8B Instruct, quantised to 4-bit, which fits comfortably in the GPU's memory with room to spare for context. A small Node or Python service in front of it to handle prompt construction and post-processing.

That's it. Total hardware cost, well under £2,000 all-in. Total setup time, an afternoon. The thing actually works. It generates a few hundred tokens of competent prose in a few seconds, holds enough context for any realistic prompt-plus-payload combination, and never makes an outbound network call.

The quantisation is worth a paragraph because it's the part most people haven't thought about. "Q4" means the model's weights have been compressed from 16-bit floating point down to 4-bit integers. You lose some precision — there's a measurable quality gap between an 8B model at full precision and the same model at Q4 — but the gap is much smaller than you'd expect, and the compression is what makes the model fit on a £390 GPU instead of needing a £5,000 one. For most internal-tool workloads, Q4 is the right point on the price-quality curve, and you can move up to Q8 or full precision later if a specific use case demands it.

The honest tradeoffs

I'd be doing the argument a disservice if I didn't talk about what you give up. There are real tradeoffs and pretending otherwise is how you end up with a regretful blog post in twelve months.

Latency per token will be slower than a hyperscaler API. Not catastrophically — a well-tuned local 8B model on a consumer GPU produces tokens at a perfectly conversational pace — but the API providers are running highly optimised inference infrastructure at scale, and they will beat you on speed. If your use case is generate a 300-word narrative summary attached to an overnight report, this is fine; nobody notices the difference between two seconds and four. If your use case is autocomplete every keystroke in a code editor, it isn't fine. Pick your use cases accordingly.

The quality ceiling is lower. An 8B model, even a well-trained one, is not GPT-5 and is not Claude. For bounded tasks — summarising a structured payload, converting natural language to a database query, generating a templated piece of editorial — the gap matters far less than you'd expect, because the task itself constrains what good looks like. For open-ended creative or analytical work the gap is real and noticeable. Be honest with yourself about which kind of work your AI features actually do. Most internal-tool AI is bounded, and bounded tasks are what local models are good at.

You're now in the inference-hosting business. That means GPU drivers, model updates, power draw, thermal management, capacity planning for concurrent users. With a single mid-range GPU you can comfortably support three to five simultaneous active users; if you need more, you need either a bigger GPU or more of them, and the cost curve starts to look less attractive. None of this is hard for a team that already runs its own infrastructure, but it's a step up in operational complexity from "we call an API."

You're betting on open-weights models keeping pace. They have been, and the trajectory is good — every six months the gap between the best open-weights model and the best closed one closes a little — but it's a bet, not a certainty. If you need state-of-the-art quality today and the open-weights world hasn't caught up to it, you have a hard choice between accepting lower quality and breaking your compliance posture.

These tradeoffs are real. They are also, for the regulated-data class of business, the price of being able to use AI at all. The alternative isn't "use the hyperscaler API and have the best of both worlds." The alternative is "don't use AI on this data." Local inference is what makes the conversation possible in the first place.

Compliance posture as architecture

Step back from the implementation detail and there's a broader point worth pulling out. For most of the last decade, the answer to "can we use AI on this data?" in regulated industries has been one of two unsatisfactory things. Either yes, via an API, trust the data-handling terms, which made the lawyers uncomfortable and the regulators sceptical. Or no, it's too sensitive, which left the business sitting on data it couldn't extract intelligence from while everyone else built AI features around it.

Local inference makes a third answer possible: yes, on our terms. The data stays where it is. The model comes to it. The infrastructure boundary that defines what the regulators care about — the network, the building, the contractually-defined processing environment — is the same boundary the inference happens inside.

That changes the shape of the conversation with auditors and clients in a way that's worth more than the tokens-per-second number. The cloud-LLM default treats AI as a service you consume. Local inference treats it as infrastructure you operate. For some businesses that distinction is fashion. For others it's existential.

It also changes the design lens you use when you're picking models and architecting features. If your binding constraint is "where can my data go," then "which model scores highest on this benchmark" is the wrong first question. The right first question is "what model can I run inside the boundary?" — and then, of those, which is best. That ordering matters. It rules out a lot of options early, but it also clarifies the remaining ones enormously.

What it costs

The numbers are worth being concrete about because they're more favourable than most people assume.

A single RTX 5060 Ti 16GB at current UK prices: around £390. A workstation-class server to run it in, assuming you don't already have one: £1,500 to £2,000, depending on storage and RAM choices. Power draw, idle: a few watts. Power draw under sustained inference: 100 to 150 watts. At UK electricity prices of around 24p per kWh, even running flat out 24 hours a day, you're looking at less than £30 a month in electricity.

So all-in, capex of roughly £2,000 and opex of perhaps £40 a month including a generous allowance for power and incidentals.

Compare that to API costs at any plausible production volume. A team running modest AI features on hosted APIs typically spends somewhere between £200 and £2,000 a month, depending on volume and which model they're calling. Heavier use scales quickly into five figures per month. The break-even point against owning the infrastructure outright is usually somewhere between six and eighteen months, depending on volume, and after that the local-inference cost curve is essentially flat while the API curve keeps growing with your usage.

This isn't an argument against API spend in general. For a business where the data is unrestricted and the priority is shipping fast, that monthly spend is a perfectly reasonable price for never having to think about GPU drivers. The argument is narrower than that. It is: for a business that has a compliance reason to keep inference local, the cost of doing so is far lower than it used to be, and is now firmly in the territory where the maths works regardless.

What to take from this

I'm not arguing that everyone should self-host their LLMs. Most teams shouldn't. The hyperscaler APIs are excellent, the model quality is genuinely better than what you'll run locally on a £390 GPU, and the time you save by not running your own inference infrastructure is real time you can spend on the actual product.

What I am arguing is that there's a class of business — bigger than most people think, and growing as regulation catches up with where AI features have already gone — where the API default is quietly wrong. For those businesses, where can my data go is a more useful design lens than which model scores highest on the benchmarks, and the teams that figure this out early will have an easier time with regulators, easier conversations with clients, and a foundation that doesn't have to be ripped out when the rules tighten.

The good news is that the alternative is no longer exotic. It's a £390 GPU, a free inference server, an open-weights model, and an afternoon. Whether you should reach for it depends entirely on where your data is allowed to go — but if you've ever sat in that meeting and given the honest answer about API calls being data egress events, you already know which side of the line your business is on.