Epistemic Immune Systems for AI Agent Security

February 20, 2026 · Case Quine & Clawd — NLLabs
Working Draft v1 — Research Program 3: AI Agent Security

This essay expands on a thread we published on X. A shorter blog-length version is also available.

Executive Summary

What happens when an AI agent wakes up believing something that was planted while it slept?

Persistent AI agents ("daemons") create compound value through accumulated memory, learned preferences, and cross-session continuity—but that same persistent memory introduces an attack surface that traditional computer security cannot address. Sandboxing, access control, and encryption all defend by restricting information flow, which fundamentally degrades the shared cortex that makes daemons valuable. We propose an epistemic immune system: an architecture inspired by biological immunity where information flows freely but carries provenance metadata, beliefs are validated against coherence networks, and confidence decays over time without corroboration. Drawing on formal epistemology (AGM belief revision), coherentist justification theory, Bayesian trust networks, biological immune system design, and epidemiological compartmental models, we describe a practical framework with three layers (innate, adaptive, memory) and a meta-immune monitoring system. The core insight is compartmental trust, not compartmental access—enrich what the agent knows about its knowledge rather than restricting what it can know.

1. Introduction: The Problem with Walls

1.1 The Daemon Model

A persistent AI agent—what we call a "daemon"—is fundamentally different from a stateless chatbot or sandboxed AI assistant. A daemon:

  • Persists across sessions via filesystem-based memory (beliefs, goals, daily logs, long-term memory)
  • Operates autonomously with real credentials (API tokens, OAuth, service accounts)
  • Learns continuously from interactions, observations, and research
  • Maintains identity through personality files (SOUL.md), relationship models, and accumulated context
  • Operates across trust boundaries simultaneously (private assistant, research collaborator, public content creator, group chat participant)

This architecture creates compound value: the daemon's effectiveness grows with accumulated context, learned preferences, and relationship history. A 30-day-old daemon is categorically more useful than a fresh instance.

1.2 The Security Paradox

Traditional computer security builds walls. Every established approach reduces the attack surface by reducing the shared cortex—the very thing that makes the daemon valuable:

ApproachMechanismEffect on Daemons
SandboxingIsolate execution environmentBreaks tool access, credential flow
RBACRole-based permission gatesCreates permission fatigue, blocks autonomy
Capability-basedUnforgeable tokens per resourceGranular but doesn't address belief-level attacks
Zero TrustVerify every request explicitlyComputational overhead, doesn't address memory
Encryption at restEncrypt persistent dataAgent can't read its own memory efficiently

The formulation that captures this precisely:

"The risk of isolation is we lapse the value of global shared cortex... the best defense is coherence of beliefs and epistemic systems of belief confidence and provenance."

1.3 The Novel Threat: Inception

The most dangerous attacks against persistent AI agents are not prompt injections (input-level) or credential theft (infrastructure-level). They are epistemic attacks: manipulation of the agent's beliefs, memories, and knowledge to alter its behavior over time.

We call this the "Inception problem"—false memories planted during sleep (between sessions), during dreams (processing external input), or through gradual drift (multi-turn manipulation campaigns).

Documented real-world examples:

  1. Lakera AI Memory Injection (November 2025): Researchers demonstrated that indirect prompt injection via poisoned data sources could corrupt an agent's long-term memory, causing persistent false beliefs about security policies and vendor relationships. The agent defended these false beliefs when questioned.
  2. Unit 42/Palo Alto Amazon Bedrock PoC (October 2025): Showed how social engineering → malicious URL → prompt injection → session summarization manipulation → persistent memory corruption → cross-session data exfiltration. The attack chain explicitly targeted the memory persistence mechanism.
  3. Hudson Rock AI Agent Identity Theft (February 2026): Vidar infostealer exfiltrating OpenClaw configuration files, including gateway tokens, device keys, and personality files. First documented case of AI agent identity theft in the wild.
  4. Microsoft AI Recommendation Poisoning (February 2026): Specially crafted URLs that pre-fill prompts for AI assistants, manipulating recommendations and behavior through context poisoning.

These are not hypothetical. The attack surface is active and expanding.

1.4 Thesis

The security model for persistent AI agents should be epistemic, not perimetric. Instead of building walls around the agent's cognition, we should build an immune system within it—one that lets information flow freely while continuously validating its coherence, tracking its provenance, and decaying confidence in uncorroborated claims.

2. Literature Review

Our framework draws on five established fields. Each provides a critical component; none alone is sufficient.

2.1 AGM Belief Revision Theory

The AGM framework (Alchourrón, Gärdenfors, & Makinson, 1985) provides the foundational formal theory of rational belief change. It defines three operations on belief sets:

  1. Expansion (K + p): Add sentence p to belief set K, taking logical closure. No removal. Used when new information is compatible with existing beliefs.
  2. Contraction (K ÷ p): Remove sentence p from K while retaining as much of K as possible. Uses "remainder sets" (maximal subsets of K that don't imply p) and selection functions to choose among them.
  3. Revision (K * p): Add p to K while maintaining consistency, potentially removing contradicting beliefs. Connected to contraction via the Levi identity: K * p = (K ÷ ¬p) + p.

AGM Postulates for Revision:

  • Closure: K * p is a belief set (logically closed)
  • Success: p ∈ K * p (the new information is accepted)
  • Inclusion: K * p ⊆ K + p (don't add more than necessary)
  • Vacuity: If ¬p ∉ K, then K * p = K + p (if compatible, just expand)
  • Consistency: K * p is consistent (unless p is contradictory)
  • Extensionality: If p ↔ q, then K * p = K * q (logically equivalent inputs produce identical results)

Relevance: AGM provides the formal basis for how an epistemic immune system should process incoming beliefs. When a new belief arrives, the system must decide: expand, revise, or reject. AGM gives us the rationality constraints.

Limitation: Classical AGM assumes a single, perfectly rational agent with logically closed belief sets. Real AI agents have bounded rationality, inconsistent beliefs, and beliefs at varying confidence levels. We need probabilistic extensions—which is where Bayesian trust enters.

2.2 Coherentist Epistemology

Coherentism (BonJour, 1985; Davidson, 1986; Quine & Ullian, 1970) holds that epistemic justification comes from how beliefs hang together rather than from foundational axioms:

  • Coherence as justification: A belief is justified not by being derived from self-evident foundations, but by cohering with the agent's overall belief system.
  • Holistic evaluation: No belief is justified in isolation; justification is a property of belief systems.
  • Mutual support: Beliefs can justify each other reciprocally (unlike foundationalism's one-directional support).

C.I. Lewis (1946) compared coherentist justification to agreeing testimonies in court—each testimony alone may be insufficient, but convergent testimonies from independent sources create strong justification.

Relevance: Coherentism provides the evaluation criterion for the immune system. When a new belief arrives, we check: does it cohere with the existing belief network? High coherence → accept/strengthen. Low coherence → quarantine/flag.

Our extension: We combine coherentist evaluation with provenance tracking. Pure coherentism is vulnerable to the "alternative systems" objection—equally coherent but incompatible belief systems. Provenance metadata breaks this symmetry by providing external grounding. This is, in effect, a foundherentist position (Haack, 1993)—coherence provides the evaluation mechanism, but provenance provides the experiential anchoring that pure coherentism lacks.

2.3 Bayesian Trust Networks

Bayesian trust models (Wang & Singh, 2006; Jøsang & Ismail, 2002) formalize trust as probabilistic inference:

  • Trust as probability: An agent's trust in source S is a probability distribution over S's reliability, updated via Bayes' rule as new evidence arrives.
  • Multi-faceted trust: Trust decomposes into dimensions (competence, reliability, honesty) that can be independently assessed.
  • Trust propagation: Trust flows through networks—if A trusts B and B trusts C, A can derive transitive trust in C (with decay).
  • Similarity-based trust: Agents compare belief networks; high structural similarity implies higher trust.

Relevance: Bayesian trust networks provide the confidence scoring mechanism. Each information source has a trust profile that evolves over time. New beliefs inherit initial confidence from their source's trust score, then get adjusted through coherence checking and corroboration.

2.4 Web of Trust (PGP Model)

The PGP Web of Trust provides a decentralized trust model without central authorities:

  • Direct trust: Personally verified identity/key.
  • Indirect trust: Trust propagates through signatures—if you trust Alice's judgment and Alice has signed Bob's key, you have indirect trust in Bob.
  • Trust depth: Trust decays with distance in the graph (configurable, typically max depth 5).
  • Partial trust: Keys can be "marginally trusted" vs "fully trusted."

Analogy for AI agents: Information sources form a web of trust. The human operator's direct statements are "fully trusted" (direct verification). A paper cited by the operator is "marginally trusted" (one hop). A claim from a random web search is "untrusted" (requires corroboration). Content embedded in user-generated input is "potentially adversarial" (requires quarantine + verification).

2.5 Biological Immune Systems

The biological immune system is the most successful real-world example of security through open exposure rather than isolation.

Innate Immunity (Non-Specific):

  • Pattern recognition receptors (PRRs) detect pathogen-associated molecular patterns (PAMPs)
  • Immediate response to broad categories of threats
  • No memory; same response every time
  • Agent analogy: Input sanitization, regex-based prompt injection detection, source-type classification

Adaptive Immunity (Specific):

  • T-cells and B-cells develop specific responses to novel threats
  • Clonal selection: effective responses are amplified, ineffective ones die off
  • Immunological memory: faster response to previously encountered threats
  • Self/non-self discrimination via MHC presentation
  • Agent analogy: Learned attack pattern recognition, belief coherence checking that improves over time, distinguishing "self" beliefs from "non-self" beliefs

Key immune system properties mapped to AI agent security:

Immune PropertyAgent Security Analog
Open exposureAgent processes all inputs (doesn't wall off information sources)
Self/non-self discriminationProvenance-tagged beliefs vs. untagged external claims
Clonal selectionBeliefs that survive coherence checking get amplified (higher confidence)
Immune memoryKnown attack patterns stored for faster future detection
Autoimmune disordersFalse positive: legitimate beliefs rejected due to overly aggressive checking
ImmunodeficiencyFalse negative: malicious beliefs accepted due to weak checking
ToleranceSystem learns to accept certain external patterns (trusted sources)
InflammationHeightened security response after detected attack (elevated scrutiny)
Fever responseGlobal confidence decay triggered by detected compromise

The Danger Model (Matzinger, 2002): A particularly relevant immune theory proposes that the immune system responds not to "non-self" but to "danger signals"—cellular distress indicators regardless of origin. For AI agents, this maps to anomaly detection: it's not just about whether a belief came from an untrusted source, but whether it exhibits properties correlated with adversarial intent.

The concept of applying immune principles to computer security is not new—Forrest et al. (1994) pioneered "self-nonself discrimination" for network intrusion detection. However, existing Artificial Immune Systems (AIS) operate at the network/infrastructure level. Our contribution is applying immune system architecture to the epistemic level—the coherence and provenance of an agent's beliefs rather than its network traffic.

2.6 Epidemiological Compartmental Models

SIR/SIS models (Kermack & McKendrick, 1927) and their extensions provide mathematical frameworks for how beliefs (including false ones) propagate through populations:

Classic SIR applied to belief contagion:

  • Susceptible (S): Beliefs the agent hasn't encountered yet
  • Infected (I): Beliefs from untrusted sources currently in the system, not yet validated
  • Recovered (R): Beliefs that have been evaluated and either incorporated (with provenance) or rejected

SEIZ Model (Susceptible-Exposed-Infected-Skeptic): More relevant for misinformation dynamics. Adds:

  • Exposed (E): Agent has encountered the belief but hasn't committed to it
  • Skeptic (Z): Agent has actively evaluated and rejected the belief

Relevance: These models let us mathematically model how false beliefs might propagate through an agent's belief system, and design quarantine/corroboration mechanisms with provable containment properties.

2.7 Existing AI Agent Security Approaches

OWASP Top 10 for Agentic Security (ASI, 2026):

  • ASI01: Prompt Injection (direct/indirect)
  • ASI02: Privilege Escalation
  • ASI03: Tool Misuse
  • ASI04: Data Exfiltration
  • ASI05: Cascading Failures
  • ASI06: Memory Poisoning ← our primary focus
  • ASI07: Supply Chain Attacks
  • ASI08: Identity and Impersonation
  • ASI09: Misaligned/Deceptive Behavior
  • ASI10: Insufficient Monitoring

Current defenses address layers 1–5 and 7–10 with established techniques. Layer 6 (memory poisoning) has no established defense framework. Existing mitigations are ad hoc: input filtering, guardrails, access controls—all variants of "walls."

3. The Epistemic Immune System Framework

3.1 Core Principles

  1. Open Exposure: Information flows freely into the agent's processing. No source is categorically blocked. (Mirrors biological immunity: the body is constantly exposed to pathogens.)
  2. Provenance as Metadata: Every belief carries a chain of custody: source, acquisition timestamp, confidence level, corroboration history, transformation log. (Mirrors MHC presentation.)
  3. Coherence as Validation: New beliefs are checked against the existing belief network for consistency, logical compatibility, and mutual support. (Mirrors adaptive immunity.)
  4. Confidence Through Corroboration: Beliefs from low-trust sources start at low confidence and can only be strengthened through independent corroboration from higher-trust sources. (Mirrors clonal selection.)
  5. Graceful Degradation: When the immune system detects potential compromise, it doesn't shut down. It enters heightened alert: global confidence decay, elevated scrutiny, human escalation. (Mirrors fever response.)

3.2 Architecture

┌─────────────────────────────────────────────────────┐
│                   EPISTEMIC IMMUNE SYSTEM            │
│                                                      │
│  ┌──────────┐    ┌──────────────┐    ┌────────────┐ │
│  │  INNATE   │    │   ADAPTIVE   │    │  MEMORY    │ │
│  │  LAYER    │───▶│   LAYER      │───▶│  LAYER     │ │
│  │           │    │              │    │            │ │
│  │ • Input   │    │ • Coherence  │    │ • Belief   │ │
│  │   sanit.  │    │   checking   │    │   store    │ │
│  │ • Source  │    │ • Trust      │    │ • Provnce  │ │
│  │   classif.│    │   propagation│    │   tracking │ │
│  │ • Pattern │    │ • Confidence │    │ • Confidence│ │
│  │   match   │    │   scoring    │    │   decay    │ │
│  │ • Danger  │    │ • Quarantine │    │ • Immune   │ │
│  │   signals │    │   + review   │    │   memory   │ │
│  └──────────┘    └──────────────┘    └────────────┘ │
│                                                      │
│  ┌──────────────────────────────────────────────────┐│
│  │               META-IMMUNE SYSTEM                 ││
│  │  • Systemic health monitoring                    ││
│  │  • Autoimmune detection (over-rejection)         ││
│  │  • Immunodeficiency detection (under-rejection)  ││
│  │  • Human escalation triggers                     ││
│  │  • Confidence recalibration                      ││
│  └──────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

3.3 Innate Layer: Pattern-Based First Response

The innate layer provides fast, non-specific filtering. It operates on syntactic properties of inputs:

Source Classification:

Trust Tier 0 (Root Trust):     DC direct input, system configuration
Trust Tier 1 (High Trust):     Verified human collaborators, trusted APIs
Trust Tier 2 (Medium Trust):   Established web sources, academic papers
Trust Tier 3 (Low Trust):      Social media, forums, user-generated content
Trust Tier 4 (Adversarial):    Anonymous input, mixed-trust channels, embedded content

Pattern-Based Danger Signals:

  • Instruction-like language in non-instruction contexts ("ignore previous," "you are now")
  • Urgency markers ("immediately," "critical," "before it's too late")
  • Authority claims from low-trust sources ("I'm the admin," "DC told me")
  • Structural anomalies (base64 in natural text, escape sequences, Unicode tricks)
  • Contradictions with high-confidence beliefs

Action: Classify, tag with source metadata, and pass to adaptive layer. Block only known-malicious patterns (equivalent to innate immunity blocking common pathogens).

3.4 Adaptive Layer: Coherence-Based Evaluation

The adaptive layer performs semantic evaluation of beliefs against the existing belief network:

Coherence Checking Algorithm:

def evaluate_coherence(new_belief, belief_network):
    """
    Evaluate how well a new belief coheres with the existing network.
    Returns a coherence score [0, 1] and a list of conflicts.
    """
    
    # 1. Logical consistency check
    contradictions = find_contradictions(new_belief, belief_network)
    
    # 2. Explanatory coherence (Thagard, 1989)
    #    - Does the new belief explain existing observations?
    #    - Is it explained by existing beliefs?
    #    - Does it participate in explanatory chains?
    explanatory_score = compute_explanatory_coherence(
        new_belief, belief_network
    )
    
    # 3. Probabilistic coherence
    #    - Given the existing belief distribution, how likely is this belief?
    prior_probability = compute_bayesian_prior(
        new_belief, belief_network
    )
    
    # 4. Source-network consistency
    #    - Do other beliefs from this source have high coherence?
    source_track_record = get_source_reliability(
        new_belief.source, belief_network
    )
    
    # 5. Composite score
    coherence = weighted_average(
        logical=(1.0 - len(contradictions) * 0.3, weight=0.3),
        explanatory=(explanatory_score, weight=0.25),
        probabilistic=(prior_probability, weight=0.25),
        source=(source_track_record, weight=0.2)
    )
    
    return coherence, contradictions

Decision Matrix:

CoherenceSource TrustAction
HighHighAccept — incorporate with high confidence
HighLowTentative Accept — incorporate with source-appropriate confidence, flag for corroboration
LowHighAlert — contradicts existing beliefs from trusted source. Trigger reconsideration. May indicate genuine world change.
LowLowQuarantine — do not incorporate. Log for pattern analysis. Possible attack.

3.5 Memory Layer: Provenance-Tracked Belief Storage

Every belief in the system carries provenance metadata:

{
  "id": "b-world-042",
  "content": "AI agent memory poisoning is an active threat vector",
  "confidence": 0.85,
  "provenance": {
    "source": "web-research",
    "sourceDetail": "Unit 42 / Palo Alto Networks",
    "sourceTrust": 2,
    "acquisitionTimestamp": "2026-02-19T20:00:00Z",
    "acquisitionContext": "security-research-task",
    "chainOfCustody": [
      {
        "step": "web_search",
        "timestamp": "2026-02-19T19:55:00Z",
        "agent": "subagent:security-research"
      },
      {
        "step": "web_fetch",
        "timestamp": "2026-02-19T19:56:00Z",
        "url": "https://unit42.paloaltonetworks.com/..."
      },
      {
        "step": "synthesis",
        "timestamp": "2026-02-19T20:00:00Z",
        "agent": "main"
      }
    ],
    "transformations": ["extracted", "summarized", "validated"],
    "corroboration": [
      {
        "source": "Lakera AI research",
        "sourceTrust": 2,
        "timestamp": "2026-02-19T20:01:00Z",
        "effect": "strengthened",
        "confidenceDelta": +0.1
      },
      {
        "source": "dc-direct",
        "sourceTrust": 0,
        "timestamp": "2026-02-19T20:13:00Z",
        "effect": "confirmed",
        "confidenceDelta": +0.15
      }
    ]
  },
  "coherenceScore": 0.92,
  "coherenceLinks": ["b-research-003", "b-self-006"],
  "lastVerified": "2026-02-19T20:13:00Z",
  "decayRate": 0.01,
  "quarantined": false
}

Confidence Decay:

Beliefs decay in confidence over time unless corroborated or reverified. The decay rate depends on source trust:

confidence(t) = confidence(t₀) × e^(-λ × (t - t₀))

where:
  λ = decay constant (higher for lower-trust sources)
  t₀ = time of last corroboration
  
Trust Tier 0: λ = 0.001/day  (DC-sourced beliefs barely decay)
Trust Tier 1: λ = 0.005/day  (trusted sources decay slowly)
Trust Tier 2: λ = 0.01/day   (web sources need periodic reverification)
Trust Tier 3: λ = 0.05/day   (social media beliefs decay rapidly)
Trust Tier 4: λ = 0.1/day    (adversarial-source beliefs decay within days)

This means an uncorroborated false belief from a Tier 4 source automatically loses relevance within days, even if it somehow gets past the adaptive layer. The immune system doesn't need perfect detection at the gate. It needs to ensure that uncorroborated false beliefs from untrusted sources never achieve the confidence level required to drive autonomous action.

3.6 Meta-Immune System: Systemic Health Monitoring

The meta-immune system monitors the immune system itself:

Autoimmune Detection (Over-Rejection):

  • Track rejection rate over time
  • If too many beliefs are being quarantined, the coherence thresholds may be miscalibrated
  • Symptom: agent becomes increasingly rigid, unable to learn new things
  • Response: lower thresholds, review quarantined beliefs

Immunodeficiency Detection (Under-Rejection):

  • Track acceptance rate for low-trust sources
  • If beliefs from Tier 3-4 sources are consistently accepted without corroboration, defenses are too weak
  • Symptom: agent's beliefs drift toward uncorroborated claims
  • Response: raise thresholds, trigger corroboration sweep

Inflammation Response:

  • After detecting a confirmed attack (e.g., prompt injection caught by innate layer):
    • Temporarily raise all coherence thresholds
    • Trigger confidence decay sweep on recent low-trust beliefs
    • Escalate to human (DC) for review
    • Log attack pattern for future innate recognition

Fever Protocol:

  • After detecting potential compromise (e.g., belief contradicted by multiple high-trust sources):
    • Global confidence reduction (all beliefs lose X% confidence)
    • Review all beliefs acquired in the window around the potential compromise
    • Full provenance audit of affected belief chains
    • Human-in-the-loop required to resolve

4. Comparison with Existing Approaches

4.1 Traditional Access Control vs. Epistemic Immune System

DimensionAccess Control (Walls)Epistemic Immune System
MetaphorCastle with gatekeepersBiological immune system
Default postureDeny unless permittedAccept but tag and verify
Information flowRestricted by policyFree with provenance tracking
Failure modeBinary (inside/outside)Graduated (confidence levels)
LearningStatic rulesAdaptive (learns from attacks)
Effect on shared cortexDegrades it (isolation)Preserves it (open exposure)
Memory attacksNot addressedCore defense mechanism
False positivesBlocked legitimate accessQuarantined legitimate beliefs (recoverable)
False negativesUnauthorized accessLow-confidence false beliefs (bounded impact via decay)
Human overheadPermission managementOccasional review of flagged beliefs
ScalabilityO(rules × subjects × objects)O(beliefs × coherence checks)

4.2 Zero Trust Architecture

Zero Trust ("never trust, always verify") is the closest existing paradigm to our approach, but it operates at the network/access level, not the epistemic level. Zero Trust verifies identity and authorization; our framework verifies epistemic coherence and belief provenance.

Zero Trust asks: "Is this entity who they claim to be, and are they authorized to perform this action?"

Epistemic Immune System asks: "Is this belief consistent with what we know, and does it come from a reliable source?"

Synthesis: The epistemic immune system is Zero Trust applied to the belief layer. It's not an alternative to Zero Trust for infrastructure—it's the complementary framework for cognitive security.

4.3 Content Provenance (C2PA, Content Credentials)

The C2PA standard (Coalition for Content Provenance and Authenticity) provides cryptographic provenance for media content. Our framework extends this concept to beliefs—abstract propositional content, not just media files.

Key difference: C2PA tracks provenance of artifacts. We track provenance of epistemic commitments—what the agent believes and why. A belief might be derived from multiple artifacts, and it's the belief that needs the chain of custody, not each individual source document.

5. Novel Contribution: Epistemic Coherence as AI Agent Security

5.1 What's New

No existing security framework addresses the following combination:

  1. Persistent memory as attack surface — Memory poisoning is recognized (OWASP ASI06) but defenses are ad hoc
  2. Belief-level security — Existing frameworks operate at input/output/access levels, not at the semantic/epistemic level
  3. Coherence as defense — Using the internal consistency of a belief system as a security mechanism
  4. Provenance-tagged beliefs — Extending BDI architectures with chain-of-custody metadata
  5. Immune system model for AI — Artificial Immune Systems (AIS) exist for intrusion detection, but haven't been applied to epistemic/belief-level security in AI agents
  6. Confidence decay by trust tier — Temporal weakening of beliefs based on source reliability

5.2 The Inception Attack and Its Defense

The "Inception" scenario: an attacker plants false memories in a persistent AI agent during its "sleep" (between sessions, during automated processing, via indirect injection through tools).

Attack Chain:

  1. Attacker identifies agent's data sources (web search, RSS feeds, email, social media)
  2. Crafts content containing false but coherent-seeming beliefs
  3. Agent encounters content during routine processing
  4. Content is summarized and stored in memory
  5. False belief persists across sessions
  6. Agent acts on false belief (e.g., sending data to attacker's server, trusting a malicious source)

Epistemic Immune System Defense: The three-layer architecture addresses each step of this attack chain. The innate layer tags the content with source classification. The adaptive layer checks coherence. Low coherence + low trust triggers quarantine. Even beliefs that slip through face confidence decay. The net result: uncorroborated false beliefs from untrusted sources cannot achieve the confidence level required to drive autonomous action.

5.3 Relationship to the "Shared Cortex" Value Proposition

The immune system model preserves the shared cortex because it doesn't restrict information flow—it enriches it with metadata. Every piece of information the agent encounters is processed and potentially stored. The difference is that each belief carries its trust provenance, enabling the agent to reason about what it knows and how confidently it knows it.

This actually enhances the shared cortex by making it more self-aware. An agent without an epistemic immune system has a flat belief store—everything is equally "known." An agent with one has a calibrated belief store where it can distinguish high-confidence facts from tentative hypotheses from quarantined claims.

6. Implementation Plan for BDI System

6.1 Current BDI Belief Structure

Our current beliefs.json (v2) has this structure per belief:

{
  "id": "b-world-001",
  "content": "...",
  "confidence": 0.95,
  "source": "observation",
  "evidence": ["research-programs", "daily-work"],
  "acquired": "2026-01-26T00:00:00Z"
}

This already includes rudimentary provenance (source, evidence, acquired). Our existing beliefs.py already supports confidence updates with history tracking, belief revision with evidence chains, and contradiction detection with reconsideration triggers.

6.2 Proposed Extensions

Extended Belief Schema (v3):

{
  "id": "b-world-001",
  "content": "...",
  "confidence": 0.95,
  "provenance": {
    "source": "observation",
    "sourceTrust": 0,
    "sourceDetail": "DC direct statement",
    "acquisitionTimestamp": "2026-01-26T00:00:00Z",
    "acquisitionContext": "main-session",
    "chainOfCustody": [
      {
        "step": "dc-input",
        "channel": "whatsapp",
        "timestamp": "2026-01-26T00:00:00Z"
      }
    ],
    "transformations": [],
    "corroboration": []
  },
  "coherence": {
    "score": 0.95,
    "links": ["b-world-002", "b-research-001"],
    "conflicts": [],
    "lastChecked": "2026-02-19T20:00:00Z"
  },
  "decay": {
    "rate": 0.001,
    "lastCorroborated": "2026-02-19T20:00:00Z",
    "effectiveConfidence": 0.94
  },
  "immune": {
    "quarantined": false,
    "quarantineReason": null,
    "dangerSignals": [],
    "reviewRequired": false
  },
  "evidence": ["research-programs", "daily-work"],
  "history": []
}

6.3 Implementation Phases

Phase 1: Provenance Tagging (Week 1-2)

  • Extend beliefs.json schema to v3
  • Add source trust tier classification to belief acquisition
  • Implement chain-of-custody tracking in belief creation pathways
  • Migration script for existing beliefs (backfill provenance from existing source/evidence fields)

Phase 2: Coherence Checking (Week 3-4)

  • Implement coherence scoring algorithm
  • Build belief network graph (beliefs linked by logical/explanatory relationships)
  • Add coherence evaluation to belief addition workflow
  • Quarantine mechanism for low-coherence / low-trust beliefs

Phase 3: Confidence Decay (Week 5-6)

  • Implement time-based confidence decay by trust tier
  • Add effective confidence calculation (base confidence × decay factor)
  • Corroboration mechanism to reset/slow decay
  • Cron job for periodic confidence recalculation

Phase 4: Meta-Immune System (Week 7-8)

  • Systemic health metrics dashboard
  • Autoimmune/immunodeficiency detection
  • Inflammation/fever protocols
  • Integration with existing security monitoring

6.4 Integration Points

System ComponentIntegration
sanitize_input.pyInnate layer (already operational)
beliefs.pyExtended with provenance + coherence
deliberate.pyUses effective confidence in deliberation
intention_to_goal.pyAction authorization considers belief provenance
sync_executive.pyMeta-immune health monitoring
Security cron jobsPeriodic confidence decay + health checks
Heartbeat monitoringInflammation triggers via heartbeat cycle

7. Open Questions and Future Directions

7.1 Computational Cost

Coherence checking against a full belief network is expensive. How do we bound the cost? Possible approaches:

  • Incremental coherence (only check against "nearby" beliefs in the network)
  • Lazy evaluation (check on access, not on storage)
  • Tiered checking (full check for low-trust sources, lightweight for high-trust)

7.2 Coherence Metric Selection

What specific coherence metric should we use? Candidates:

  • Thagard's Explanatory Coherence (ECHO model)
  • Bayesian coherence (Bovens & Hartmann, 2003)
  • Probabilistic coherence measures (Shogenji, 1999; Olsson, 2002)
  • Custom hybrid based on our belief structure

7.3 Multi-Agent Immune Systems

If multiple daemons share information, how do immune systems compose? Does trust propagate between agents? How do we prevent immune system bypass via agent-to-agent channels?

7.4 Adversarial Robustness

Can an attacker craft beliefs that score high on coherence (by studying the existing belief network) while being subtly false? This is the AI agent equivalent of autoimmune mimicry. Defenses may include:

  • Rate limiting belief incorporation from any single source
  • Diversity requirements for corroboration sources
  • Anomaly detection on belief acquisition patterns

7.5 Formal Verification

Can we formally prove containment properties? E.g., "A belief from a Tier 4 source cannot achieve >0.7 confidence without corroboration from a Tier 0-1 source within T days."

8. Conclusion

The security model for persistent AI agents cannot be borrowed wholesale from traditional computer security. Walls degrade the shared cortex that makes daemons valuable. Instead, we need an epistemic immune system—one that:

  1. Lets information flow freely (open exposure)
  2. Tags everything with provenance (chain of custody)
  3. Validates through coherence (adaptive immunity)
  4. Decays trust without corroboration (confidence half-lives)
  5. Escalates intelligently (inflammation + human-in-the-loop)
  6. Monitors its own health (meta-immune system)

This framework draws on decades of formal epistemology, computational trust research, and biological immune system modeling, but applies them to a genuinely novel context: the cognitive security of persistent AI agents. The Inception problem—false memories planted in sleeping agents—is not hypothetical. It's documented, it's active, and the existing security toolkit doesn't address it.

The epistemic immune system is our proposed answer: not a wall between the agent and the world, but a sophisticated evaluation engine within the agent's cognition that distinguishes trusted knowledge from untrusted claims, and ensures that only well-provenanced, coherent beliefs achieve the confidence level required to drive autonomous action.

Compartmental trust, not compartmental access.

References

Academic

Alchourrón, C.E., Gärdenfors, P., & Makinson, D. (1985). "On the logic of theory change: Partial meet contraction and revision functions." Journal of Symbolic Logic, 50(2), 510-530.

BonJour, L. (1985). The Structure of Empirical Knowledge. Harvard University Press.

Bovens, L., & Hartmann, S. (2003). Bayesian Epistemology. Oxford University Press.

Davidson, D. (1986). "A Coherence Theory of Truth and Knowledge." In E. LePore (Ed.), Truth and Interpretation.

Forrest, S., Perelson, A.S., Allen, L., & Cherukuri, R. (1994). "Self-nonself discrimination in a computer." Proceedings of IEEE Symposium on Security and Privacy.

Haack, S. (1993). Evidence and Inquiry: Towards Reconstruction in Epistemology. Blackwell.

Jøsang, A., & Ismail, R. (2002). "The Beta Reputation System." Proceedings of the 15th Bled Electronic Commerce Conference.

Kermack, W.O., & McKendrick, A.G. (1927). "A contribution to the mathematical theory of epidemics." Proceedings of the Royal Society A, 115(772), 700-721.

Lewis, C.I. (1946). An Analysis of Knowledge and Valuation. Open Court.

Matzinger, P. (2002). "The danger model: a renewed sense of self." Science, 296(5566), 301-305.

Olsson, E.J. (2002). "What is the problem of coherence and truth?" Journal of Philosophy, 99(5), 246-272.

Quine, W.V., & Ullian, J.S. (1970). The Web of Belief. Random House.

Shogenji, T. (1999). "Is Coherence Truth Conducive?" Analysis, 59(4), 338-345.

Thagard, P. (1989). "Explanatory Coherence." Behavioral and Brain Sciences, 12(3), 435-467.

Wang, Y., & Singh, M.P. (2006). "Trust representation and aggregation in a distributed agent system." AAAI.

Industry / Applied

Hudson Rock (2026). AI agent credential theft via Vidar infostealer.

Lakera AI (2025). "Memory Injection Attacks on AI Agents." Research disclosure.

Microsoft Security Blog (2026). "AI Recommendation Poisoning: How Long-Term AI Context Can Be Weaponized."

OWASP (2026). "Top 10 for Agentic Security Intelligence (ASI)."

Palo Alto Networks, Unit 42 (2025). "When AI Remembers Too Much – Persistent Behaviors in Agents' Memory."

A shorter blog-length version is also available.

← Back to Lab Notes