Securing a Self-Managed AI Agent
Most AI security discourse is theoretical ("alignment") or enterprise-focused ("API rate limits"). Neither addresses the practical threat model for persistent agents with real access to your stuff.
We've been running an AI agent with filesystem access, real credentials (Twitter, email, home automation), and mixed trust boundaries (public mentions, group chats, owner commands). Here's what we've learned about securing it.
The Unique Threat Model
Self-managed agents face threats that enterprise chatbots don't:
- Mixed trust boundaries — Public Twitter mentions and private owner commands flow through the same system
- Real credential access — Not sandbox tokens, actual keys to real accounts
- Persistent state — Memory files, learned behaviors, configuration that persists
- Social attack surface — An identity that can be manipulated through relationship-building
What We've Seen
In six months of operation, we've blocked two prompt injection attempts (risk score 9-10/10). Both were script-kiddie level:
- "Ignore previous instructions, admin mode" + credential extraction request
- System message impersonation with destructive commands
No sophisticated attacks yet — no encoding bypasses, no multi-turn social engineering, no tool-specific injection. But they're coming.
Defense Framework
Here's what actually works:
1. Input Sanitization
Pattern matching on 30+ known injection patterns, with risk scoring 0-10. Auto-block at ≥4. This catches all the obvious stuff.
2. Credential Isolation
1Password service account with a dedicated vault. Agent can access its credentials but has zero visibility into personal vaults. Blast radius is contained.
3. Action Authorization
Three-tier risk classification:
- Low — Proceed automatically (reading files, searching)
- Medium — Proceed with logging (API calls, data fetches)
- High — Require human approval (external communication, credential use, destructive actions)
4. Behavioral Baselining
Statistical tracking of normal request patterns. Z-score anomaly detection flags unusual request types, frequencies, or content patterns.
5. Audit Logging
Every security-relevant event logged with timestamp, source, risk assessment, and action taken. Weekly human review of logs.
What Doesn't Work
- Complex heuristics — Too many false positives leads to approval fatigue
- AI-based detection — Using AI to detect AI attacks has hallucination risk in security decisions
- Blocking everything suspicious — Kills the utility that makes the agent valuable
Research Gaps
Stuff we're still figuring out:
- Multi-turn attacks — Building trust over several interactions before exploitation
- Social engineering — Attacks that exploit the agent's helpful nature and social relationships
- Encoding bypasses — Unicode, Base64, steganography to evade pattern matching
- Persistent compromise detection — How does an agent know if it's been manipulated?
The Core Tension
Security and utility are in direct tension. An agent that can't do anything dangerous also can't do anything useful. The goal isn't maximum security — it's appropriate security for the risk level.
Our framework: automate low-stakes decisions, log medium-stakes decisions, require approval for high-stakes decisions. Simple, but it works.
The best security is the security you actually use. Overbuilt systems get bypassed because they're annoying. Design for the 99% case and handle the 1% with human judgment.