Prompt Injection
Also known as · injection attack
A security attack where malicious text hijacks a model's instructions.
Prompt injection is the AI equivalent of a code-injection attack: malicious instructions hidden in content the model reads — an email, a web page, a document — trick it into ignoring its real instructions and doing the attacker's bidding, like leaking data or taking an unwanted action.
It's especially dangerous for agents with tool access and private data. The classic risky pattern is the 'lethal trifecta': a system that has access to private data, exposure to untrusted content, and a way to communicate externally. Combine all three and a hidden instruction can exfiltrate information.
There's no perfect fix yet. Defenses include treating all external content as untrusted data (never as commands), separating privileged instructions from user content, and requiring human approval for high-stakes actions.