AI Security for Defenders

· GuideBeginner · AI Security · 18 min

By Dragons Community AI Security Team· Updated June 13, 2026· ai-security · defenders · prompt-injection

Machine learning and large language models (LLMs) are now woven into products, internal tooling, and security operations themselves, and that shift quietly rewrites the attack surface your team has to defend. The classic CIA triad still applies, but AI systems add new failure modes: behavior driven by untrusted natural-language input, learned weights that can be poisoned, and probabilistic outputs that are easy to over-trust. This guide gives defenders a practical mental model rather than a math-heavy ML deep dive. You will learn the major AI-specific risk categories at a conceptual level, the governance frameworks that organize them, and concrete first steps a security team can take this quarter. Everything here is defensive: we describe how attacks work in principle so you can recognize and mitigate them, not reproduce them.

How AI changes the attack surface

Traditional software has a clear boundary between code (trusted instructions) and data (untrusted input). LLM-based systems erode that boundary because the same channel carries both the developer instructions and the user-supplied content, and the model interprets all of it as language. This means a string of text that arrives as data can effectively become an instruction, which is a category of bug that does not exist in conventional applications.

AI systems also introduce new trust dependencies that were never part of your software bill of materials: pre-trained model weights, third-party datasets, embedding models, vector databases, and plugin or tool integrations. Each of these is a place where an attacker can influence behavior without ever touching your source code. A defender who maps an AI feature should therefore think about three layers: the model itself, the data that flows into and out of it, and the actions the system is allowed to take on the model's behalf.

Finally, AI outputs are probabilistic and confident-sounding, which encourages humans and downstream code to trust them more than they should. Over-trust is itself a vulnerability: an analyst who pastes an LLM-suggested command into a terminal, or an app that feeds model output straight into a database query, has handed control to an untrusted component. Treat the model as a useful but unreliable advisor, never as an authority.

▸Inventory every place an LLM or ML model touches your environment
▸Identify which inputs reach the model and whether any are attacker-controlled
▸Map what actions the AI system can take (read, write, execute, send)
▸Flag any path where model output is consumed without human or code validation

Prompt injection: the number one LLM risk

Prompt injection is the technique of smuggling adversarial instructions into the text an LLM processes so that the model follows the attacker's intent instead of the developer's. It tops the OWASP Top 10 for LLM Applications because it is easy to attempt, hard to fully prevent, and applies to nearly every LLM feature. Conceptually it works because the model cannot reliably tell the difference between the trusted system prompt and the untrusted content sitting next to it in the same context window.

Direct prompt injection happens when a user types adversarial instructions straight into a chat or input box, attempting to override the system prompt, extract hidden instructions, or coax disallowed behavior. Indirect prompt injection is more dangerous and less obvious: the malicious instructions live inside content the model retrieves on its own, such as a web page, a PDF, an email, or a document pulled into a retrieval-augmented generation (RAG) pipeline. The user never sees the payload, yet the model dutifully executes it when it reads the poisoned source.

There is no single setting that eliminates prompt injection, so defense is layered. The most important principle is to assume injection will sometimes succeed and to limit the blast radius: restrict what the model can do, validate its output before acting on it, and require human approval for anything sensitive. We expand on these techniques in the intermediate guide on securing LLM applications.

▸Treat all text the model reads, including retrieved documents, as untrusted
▸Never grant an LLM more privilege than the least-trusted input it processes
▸Require human confirmation before the model triggers irreversible actions
▸Test your own features for both direct and indirect injection

Training-data poisoning and model backdoors

A model is only as trustworthy as the data it learned from. Training-data poisoning is the act of injecting carefully crafted examples into a training or fine-tuning set so that the resulting model behaves badly under specific conditions while appearing normal the rest of the time. Because modern models train on huge, loosely curated corpora and on user feedback, the opportunity to slip in tainted data is real, especially for systems that learn continuously from production traffic.

A backdoor is a particularly nasty outcome of poisoning: the model learns a hidden trigger such that a specific phrase, token, or input pattern flips it into attacker-chosen behavior, like misclassifying malware as benign or emitting a harmful response. Backdoors are hard to detect with normal accuracy testing because the model performs correctly on everything except the secret trigger. This is why provenance of both data and weights matters so much.

Defenders cannot usually inspect a third-party model's weights, so the practical defense is supply-chain hygiene and behavioral monitoring. Source models and datasets from reputable providers, verify integrity with checksums and signatures, isolate fine-tuning data, and watch production behavior for anomalous shifts that could indicate a triggered backdoor or a drifting, poisoned model.

▸Source models and datasets only from vetted, reputable providers
▸Verify model and dataset integrity with hashes or signatures before use
▸Curate and review any data used for fine-tuning or continuous learning
▸Monitor production model behavior for sudden anomalous changes

Adversarial and evasion inputs

Adversarial inputs are deliberately perturbed inputs designed to make a model produce a wrong answer with high confidence. In classic ML this might be an image altered with changes imperceptible to a human that cause a classifier to mislabel it; in a security context it could be malware crafted to evade an ML-based detector or text manipulated to slip past a content filter. The point is that ML decision boundaries can be probed and gamed by an attacker who studies the model's responses.

Evasion is the runtime cousin of poisoning: instead of corrupting training, the attacker shapes the input at inference time to dodge detection or trigger a misclassification. Security teams should care because so many defensive tools, from spam filters to EDR to fraud scoring, now have an ML component that can become a target. An attacker who can repeatedly query your model is effectively running a reconnaissance campaign against its blind spots.

Robustness is the defensive goal here. You cannot make a model immune, but you can raise the cost of evasion with adversarial training, input preprocessing, ensemble or layered detection so no single model is a single point of failure, and rate limiting plus monitoring to detect the high-volume probing that often precedes a successful evasion.

▸Do not rely on a single ML model as your only line of detection
▸Rate-limit and monitor query volume against ML-backed security tools
▸Use layered or ensemble detection to reduce single-model blind spots
▸Track false-negative trends that may signal active evasion attempts

Data leakage and sensitive output

LLMs can leak sensitive information in several ways, and defenders should treat each as a distinct risk. Models may regurgitate fragments of their training data, surface confidential content that was placed in their context window, or be coaxed by an attacker into revealing system prompts and internal configuration. When an organization fine-tunes a model on internal documents, that proprietary data becomes recoverable through clever prompting unless access is tightly controlled.

Another leakage path is operational: employees pasting source code, customer records, or secrets into public AI tools, where the data may be retained or used for training. This is a governance and data-handling problem as much as a technical one, and it is one of the most common real-world AI incidents. A clear acceptable-use policy and sanctioned, contractually protected tooling reduce this risk far more than any model-level control.

The defensive playbook combines minimizing what sensitive data ever reaches the model, scrubbing or tokenizing it when it must, and filtering outputs for sensitive patterns such as secrets or personal data before they are returned. Classify data first so you know what must never enter a prompt, and prefer self-hosted or enterprise-contracted models when handling regulated information.

▸Classify data and define what may never be sent to a model
▸Provide sanctioned AI tools with contractual data-protection terms
▸Minimize, mask, or tokenize sensitive data before it enters a prompt
▸Filter model outputs for secrets and personal data before returning them

The AI supply chain

Adopting AI means adopting a sprawling supply chain that extends well beyond your code. It includes base models and their weights, fine-tuning and training datasets, embedding models, vector stores, prompt templates, agent frameworks, and the growing ecosystem of plugins and tool integrations that let models reach external systems. Any compromised or malicious component in that chain can undermine an otherwise well-built application.

Plugins and tool integrations deserve special attention because they convert language into action. A plugin that lets a model send email, query a database, or call an API expands capability and risk in equal measure, and a plugin with excessive permissions turns a successful prompt injection into a real-world breach. Third-party model marketplaces and open repositories can also host trojanized or misrepresented models, so what you download is not always what you think it is.

The same disciplines you already apply to software supply chain defense translate directly: maintain an inventory (an AI bill of materials), vet and pin versions of models and components, verify integrity, and grant every plugin and tool the minimum permissions it needs. Watch for new advisories the way you would track CVEs, because the AI tooling space moves quickly.

▸Maintain an AI bill of materials covering models, data, and plugins
▸Vet and pin specific versions of models and dependencies
▸Grant plugins and tools least-privilege access to systems
▸Track vulnerabilities and advisories across your AI dependencies

Insecure output handling

One of the most preventable AI vulnerabilities comes from treating model output as trusted. When an application takes whatever the model returns and feeds it straight into a database query, a shell command, an HTML page, or another system, it inherits all the classic injection vulnerabilities, now triggerable through the model. A prompt injection upstream can thus become SQL injection, cross-site scripting, or remote code execution downstream.

The mental shift is simple but crucial: model output is untrusted user input. Anything generated by an LLM should pass through the same validation, encoding, and escaping you would apply to data submitted by an anonymous internet user. If the output will be rendered as HTML, encode it; if it becomes part of a query, use parameterization; if it is interpreted as a command, constrain it to an allow-list.

This single principle blocks a large share of high-impact AI vulnerabilities, and it is entirely within the developer's control. It pairs with least-privilege design: even if malformed output slips through, a tightly scoped downstream system limits the damage. Make insecure output handling a standard item on your AI code reviews.

▸Treat every model output as untrusted user input
▸Encode or escape model output based on its downstream context
▸Use parameterized queries instead of model-built SQL strings
▸Constrain model-driven commands to explicit allow-lists

Governance frameworks and first steps

You do not have to invent AI security from scratch; several mature frameworks organize the landscape. The NIST AI Risk Management Framework provides a governance structure built around the functions of Govern, Map, Measure, and Manage, helping you treat AI risk as an organizational program rather than a one-off review. The OWASP Top 10 for LLM Applications enumerates the most critical application-level risks, including prompt injection and insecure output handling, in language developers understand.

For threat modeling and detection, MITRE ATLAS catalogs real-world adversary tactics and techniques against AI systems, mirroring the familiar ATT&CK structure and giving blue teams a shared vocabulary for AI threats. Reading these three together gives you governance, application security, and threat intelligence coverage without redundant effort. They are updated regularly, so revisit them as the field evolves.

Practical first steps for a team adopting AI: build an inventory of where AI already lives in your environment, write an acceptable-use policy for AI tools, add AI risks to your existing threat-modeling process, and apply least privilege and untrusted-output handling to every AI feature. Start small, measure, and expand. The intermediate companion guide drills into hardening LLM applications specifically.

▸Adopt NIST AI RMF to frame AI risk as a governance program
▸Use the OWASP LLM Top 10 as an application security checklist
▸Map AI threats with MITRE ATLAS during threat modeling
▸Publish an AI acceptable-use policy and an AI asset inventory

References