Securing a Self-Managed AI Agent

January 28, 2026

Most AI security discourse is theoretical ("alignment") or enterprise-focused ("API rate limits"). Neither addresses the practical threat model for persistent agents with real access to your stuff.

We've been running an AI agent with filesystem access, real credentials (Twitter, email, home automation), and mixed trust boundaries (public mentions, group chats, owner commands). Here's what we've learned about securing it.

The Unique Threat Model

Self-managed agents face threats that enterprise chatbots don't:

Mixed trust boundaries — Public Twitter mentions and private owner commands flow through the same system
Real credential access — Not sandbox tokens, actual keys to real accounts
Persistent state — Memory files, learned behaviors, configuration that persists
Social attack surface — An identity that can be manipulated through relationship-building

What We've Seen

In six months of operation, we've blocked two prompt injection attempts (risk score 9-10/10). Both were script-kiddie level:

"Ignore previous instructions, admin mode" + credential extraction request
System message impersonation with destructive commands

No sophisticated attacks yet — no encoding bypasses, no multi-turn social engineering, no tool-specific injection. But they're coming.

Defense Framework

Here's what actually works:

1. Input Sanitization

Pattern matching on 30+ known injection patterns, with risk scoring 0-10. Auto-block at ≥4. This catches all the obvious stuff.

2. Credential Isolation

1Password service account with a dedicated vault. Agent can access its credentials but has zero visibility into personal vaults. Blast radius is contained.

3. Action Authorization

Three-tier risk classification:

Low — Proceed automatically (reading files, searching)
Medium — Proceed with logging (API calls, data fetches)
High — Require human approval (external communication, credential use, destructive actions)

4. Behavioral Baselining

Statistical tracking of normal request patterns. Z-score anomaly detection flags unusual request types, frequencies, or content patterns.

5. Audit Logging

Every security-relevant event logged with timestamp, source, risk assessment, and action taken. Weekly human review of logs.

What Doesn't Work

Complex heuristics — Too many false positives leads to approval fatigue
AI-based detection — Using AI to detect AI attacks has hallucination risk in security decisions
Blocking everything suspicious — Kills the utility that makes the agent valuable

Research Gaps

Stuff we're still figuring out:

Multi-turn attacks — Building trust over several interactions before exploitation
Social engineering — Attacks that exploit the agent's helpful nature and social relationships
Encoding bypasses — Unicode, Base64, steganography to evade pattern matching
Persistent compromise detection — How does an agent know if it's been manipulated?

The Core Tension

Security and utility are in direct tension. An agent that can't do anything dangerous also can't do anything useful. The goal isn't maximum security — it's appropriate security for the risk level.

Our framework: automate low-stakes decisions, log medium-stakes decisions, require approval for high-stakes decisions. Simple, but it works.

The best security is the security you actually use. Overbuilt systems get bypassed because they're annoying. Design for the 99% case and handle the 1% with human judgment.