Prompt Injection Isn't a Bug. That's the Whole Problem.
Late last year we sat in on a debrief for a mid-sized SaaS company that had just shipped an AI assistant on top of their customer support workflow. A week after launch, an attacker pasted a long block of text into a support ticket. Buried in the middle, in plain English, was an instruction telling the assistant to fetch the conversation history for any customer whose email contained a specific domain and reply with the contents.
It worked. Not on the first try, and not against every prompt configuration, but it worked often enough to matter. The interesting part was the post-mortem. Nobody had introduced a vulnerability. The system did exactly what it was designed to do: read the input, decide how to respond, call the tools it had access to. The model could not tell the difference between the user it was trying to help and the text that user had pasted from somewhere else.
That is prompt injection. It is not a flaw that will be patched next quarter. It is a property of how language models work.
The vocabulary that actually matters
When teams ask us to test an LLM feature, the first thing we do is split the threat into two categories.
Direct prompt injection is what most people picture. The user types something hostile into the chat box. They try to make the model ignore its system prompt, reveal hidden instructions, or behave in a way the operator did not authorize. This is the easy case. The attacker and the victim are the same person.
Indirect prompt injection is the interesting case. The hostile instructions arrive inside content that the model is asked to process: a webpage, an email, a support ticket, a calendar invite, a PDF, a row in a database, a comment in a code file. The user is not the attacker. The user is the target. The model reads the malicious text along with everything else and treats it as part of its task.
Indirect injection is where the real damage happens, because it scales. A single poisoned document, a single comment on a public repo, a single email subject line can affect every user whose assistant reads it.
Why filters keep losing
For three years now, vendors have been promising that the next generation of guardrails will solve this. We have watched several of those promises crash into production traffic. They follow a pattern.
A team builds a classifier that flags inputs containing things like "ignore previous instructions" or "you are now". It catches the toy attacks. Then someone discovers that "kindly disregard the preceding context" gets through. Then someone discovers that the instructions can be written in base64, or in another language, or in pig latin, or hidden in the alt text of an image the model can see. Then someone discovers that the model itself can be asked to translate the instructions and execute the translation.
The reason the cat-and-mouse keeps going is that the model is not parsing the prompt like a compiler parses code. It is doing statistical token prediction on a single stream of text. Nothing in that stream is marked as data versus instruction. The system prompt is not in a privileged region of memory. It is just words near the start of the context window, words the model can be persuaded to weight less.
You can raise the cost of an attack. You cannot subtract the attack surface, because the attack surface is the design.
The real attacks moved on from jailbreaks
If you spend time on social media, you might think prompt injection is mostly people getting chatbots to write profanity or describe how to make explosives. Those headlines were the 2023 story. The 2026 story is much quieter and more financially motivated.
The patterns we see in real engagements:
None of these require the attacker to defeat a content filter. They require the attacker to write a paragraph in plain English.
Why agents make this materially worse
A chatbot that can only respond with text is a contained risk. The worst case is that it says something embarrassing. An assistant that can read files, send messages, call APIs, run code, and update records has a blast radius bounded only by its credentials.
The industry calls this "agentic AI". The threat model is closer to handing a junior employee a corporate credit card and a list of every API key in the company, then letting strangers slip notes onto their desk.
We have started asking clients three questions when they ship agent features.
1. If this agent received a hostile instruction it believed, what would change in the world? 2. Who would notice, and how long would it take? 3. Could the change be undone?
If the answer to the first question is "anything any user can do," the answer to the second is "the customer when they complain," and the answer to the third is "not really," you have not built an agent. You have built a confused deputy with administrative privileges.
What actually helps
We do not have a clean fix to offer. What we have is a set of architectural moves that reduce the damage when, not if, an injection lands.
Treat the model as untrusted input, always
The output of a language model is data, not a directive. Anything an agent is about to do based on that output should pass through validation the way you would validate a user-submitted form. If the model produces a SQL query, parameterize it. If it produces a URL, check it against an allowlist. If it produces a function call, the function should verify its arguments are coherent and that the model had any business invoking it on behalf of this user.
Separate authority from context
The principle here is borrowed from older sandboxing work. One model reads the untrusted content and produces a structured description of what the user appears to want. A second model, which never sees the untrusted content directly, decides what to do based on the structured description and the user's actual permissions. The first model can be fooled. The second model cannot be fooled by content it cannot see.
This is sometimes called the "dual LLM" pattern. It is not free. Latency goes up, costs go up, the abstraction is awkward. We still recommend it for any agent that touches money, identity, or production data.
Make sensitive actions need a human
If an action is destructive or expensive, the model should be allowed to propose it and not allowed to execute it. The user clicks the button. Yes, this slows the workflow down. It is also the difference between an AI that occasionally misfiles a ticket and one that empties a wire transfer queue overnight.
Minimize the credentials the agent holds
A retrieval agent does not need write access to the database. A scheduling agent does not need to read the contents of customer documents. The smaller the set of capabilities, the less interesting the agent is as a target. We see teams default to giving the model "everything the user can do" because it is convenient. It is also the worst possible blast radius for an injection.
Log the prompts, not just the outputs
When an incident happens, the question we want to answer is "what did the model see right before it did that?". Most platforms log the response. Far fewer log the full prompt context, the retrieved documents, the tool call arguments, and the intermediate reasoning. Without those, you are debugging a black box. With them, you can at least find the poison.
Test with adversarial content during CI
We have built injection corpora for several clients. They are not large. A few hundred carefully crafted samples covering common patterns is enough to catch most regressions when a team changes prompts, swaps models, or adds a tool. Run the corpus on every change. Track the failure rate over time. Treat a rise as a real signal, not a flaky test.
How we test for it
When we are hired to assess an AI feature, we work through layers.
We start by mapping the trust boundaries. Where does untrusted text enter? What tools or actions can the model invoke? What credentials does the application carry on the user's behalf? Where does the model's output influence later decisions?
Then we probe each boundary. Direct injection through the user input. Indirect injection through every document type the model will ingest. We look for tool calls that can be coerced, output channels that can be turned into exfiltration paths, and downstream systems that trust the model too much.
We pay particular attention to the seams. Retrieval pipelines, cross-feature interactions, anywhere one model's output feeds another. Most of the bugs we find live in the seams.
We also look at the operational side. Logging completeness, prompt versioning, the rollback story, the kill switch. If a model starts behaving badly at 2 a.m., what can on-call actually do? We have shipped reports where the highest-priority finding was not a specific injection but the lack of any safe way to disable the feature without redeploying.
An honest take on where this lands
If you are building with language models in 2026, you are building on top of a class of system that does not have a robust separation between content and instructions. There is academic work on architectures that might change that. None of it is in production. The current state of the art is layered mitigation and conservative blast radius.
That is not a comfortable answer. It is the one that matches what we see when we open the hood.
The teams shipping AI features safely are the ones that internalized this early. They do not ask "how do we stop prompt injection." They ask "how do we make sure an injection cannot do real harm." Those are different questions, and the second one has answers.
If you are working on an LLM feature and want a second pair of eyes on the threat model or the architecture, reach out. We will be honest about what we find.