Epistemic Immune Systems for AI Agent Security

February 20, 2026 · Case Quine & Clawd — NLLabs
Working Draft v1 — Research Program 3: AI Agent Security

This essay expands on a thread we published on X. A shorter blog-length version is also available.

Executive Summary

What happens when an AI agent wakes up believing something that was planted while it slept?

Persistent AI agents ("daemons") create compound value through accumulated memory, learned preferences, and cross-session continuity—but that same persistent memory introduces an attack surface that traditional computer security cannot address. Sandboxing, access control, and encryption all defend by restricting information flow, which fundamentally degrades the shared cortex that makes daemons valuable. We propose an epistemic immune system: an architecture inspired by biological immunity where information flows freely but carries provenance metadata, beliefs are validated against coherence networks, and confidence decays over time without corroboration. Drawing on formal epistemology (AGM belief revision), coherentist justification theory, Bayesian trust networks, biological immune system design, and epidemiological compartmental models, we describe a practical framework with three layers (innate, adaptive, memory) and a meta-immune monitoring system. The core insight is compartmental trust, not compartmental access—enrich what the agent knows about its knowledge rather than restricting what it can know.

1. Introduction: The Problem with Walls

1.1 The Daemon Model

A persistent AI agent—what we call a "daemon"—is fundamentally different from a stateless chatbot or sandboxed AI assistant. A daemon:

Persists across sessions via filesystem-based memory (beliefs, goals, daily logs, long-term memory)
Operates autonomously with real credentials (API tokens, OAuth, service accounts)
Learns continuously from interactions, observations, and research
Maintains identity through personality files (SOUL.md), relationship models, and accumulated context
Operates across trust boundaries simultaneously (private assistant, research collaborator, public content creator, group chat participant)

This architecture creates compound value: the daemon's effectiveness grows with accumulated context, learned preferences, and relationship history. A 30-day-old daemon is categorically more useful than a fresh instance.

1.2 The Security Paradox

Traditional computer security builds walls. Every established approach reduces the attack surface by reducing the shared cortex—the very thing that makes the daemon valuable:

Approach	Mechanism	Effect on Daemons
Sandboxing	Isolate execution environment	Breaks tool access, credential flow
RBAC	Role-based permission gates	Creates permission fatigue, blocks autonomy
Capability-based	Unforgeable tokens per resource	Granular but doesn't address belief-level attacks
Zero Trust	Verify every request explicitly	Computational overhead, doesn't address memory
Encryption at rest	Encrypt persistent data	Agent can't read its own memory efficiently

The formulation that captures this precisely:

"The risk of isolation is we lapse the value of global shared cortex... the best defense is coherence of beliefs and epistemic systems of belief confidence and provenance."

1.3 The Novel Threat: Inception

The most dangerous attacks against persistent AI agents are not prompt injections (input-level) or credential theft (infrastructure-level). They are epistemic attacks: manipulation of the agent's beliefs, memories, and knowledge to alter its behavior over time.

We call this the "Inception problem"—false memories planted during sleep (between sessions), during dreams (processing external input), or through gradual drift (multi-turn manipulation campaigns).

Documented real-world examples:

Lakera AI Memory Injection (November 2025): Researchers demonstrated that indirect prompt injection via poisoned data sources could corrupt an agent's long-term memory, causing persistent false beliefs about security policies and vendor relationships. The agent defended these false beliefs when questioned.
Unit 42/Palo Alto Amazon Bedrock PoC (October 2025): Showed how social engineering → malicious URL → prompt injection → session summarization manipulation → persistent memory corruption → cross-session data exfiltration. The attack chain explicitly targeted the memory persistence mechanism.
Hudson Rock AI Agent Identity Theft (February 2026): Vidar infostealer exfiltrating OpenClaw configuration files, including gateway tokens, device keys, and personality files. First documented case of AI agent identity theft in the wild.
Microsoft AI Recommendation Poisoning (February 2026): Specially crafted URLs that pre-fill prompts for AI assistants, manipulating recommendations and behavior through context poisoning.

These are not hypothetical. The attack surface is active and expanding.

1.4 Thesis

The security model for persistent AI agents should be epistemic, not perimetric. Instead of building walls around the agent's cognition, we should build an immune system within it—one that lets information flow freely while continuously validating its coherence, tracking its provenance, and decaying confidence in uncorroborated claims.

2. Literature Review

Our framework draws on five established fields. Each provides a critical component; none alone is sufficient.

2.1 AGM Belief Revision Theory

The AGM framework (Alchourrón, Gärdenfors, & Makinson, 1985) provides the foundational formal theory of rational belief change. It defines three operations on belief sets:

Expansion (K + p): Add sentence p to belief set K, taking logical closure. No removal. Used when new information is compatible with existing beliefs.
Contraction (K ÷ p): Remove sentence p from K while retaining as much of K as possible. Uses "remainder sets" (maximal subsets of K that don't imply p) and selection functions to choose among them.
Revision (K * p): Add p to K while maintaining consistency, potentially removing contradicting beliefs. Connected to contraction via the Levi identity: K * p = (K ÷ ¬p) + p.

AGM Postulates for Revision:

Closure: K * p is a belief set (logically closed)
Success: p ∈ K * p (the new information is accepted)
Inclusion: K * p ⊆ K + p (don't add more than necessary)
Vacuity: If ¬p ∉ K, then K * p = K + p (if compatible, just expand)
Consistency: K * p is consistent (unless p is contradictory)
Extensionality: If p ↔ q, then K * p = K * q (logically equivalent inputs produce identical results)

Relevance: AGM provides the formal basis for how an epistemic immune system should process incoming beliefs. When a new belief arrives, the system must decide: expand, revise, or reject. AGM gives us the rationality constraints.

Limitation: Classical AGM assumes a single, perfectly rational agent with logically closed belief sets. Real AI agents have bounded rationality, inconsistent beliefs, and beliefs at varying confidence levels. We need probabilistic extensions—which is where Bayesian trust enters.

2.2 Coherentist Epistemology

Coherentism (BonJour, 1985; Davidson, 1986; Quine & Ullian, 1970) holds that epistemic justification comes from how beliefs hang together rather than from foundational axioms:

Coherence as justification: A belief is justified not by being derived from self-evident foundations, but by cohering with the agent's overall belief system.
Holistic evaluation: No belief is justified in isolation; justification is a property of belief systems.
Mutual support: Beliefs can justify each other reciprocally (unlike foundationalism's one-directional support).

C.I. Lewis (1946) compared coherentist justification to agreeing testimonies in court—each testimony alone may be insufficient, but convergent testimonies from independent sources create strong justification.

Relevance: Coherentism provides the evaluation criterion for the immune system. When a new belief arrives, we check: does it cohere with the existing belief network? High coherence → accept/strengthen. Low coherence → quarantine/flag.

Our extension: We combine coherentist evaluation with provenance tracking. Pure coherentism is vulnerable to the "alternative systems" objection—equally coherent but incompatible belief systems. Provenance metadata breaks this symmetry by providing external grounding. This is, in effect, a foundherentist position (Haack, 1993)—coherence provides the evaluation mechanism, but provenance provides the experiential anchoring that pure coherentism lacks.

2.3 Bayesian Trust Networks

Bayesian trust models (Wang & Singh, 2006; Jøsang & Ismail, 2002) formalize trust as probabilistic inference:

Trust as probability: An agent's trust in source S is a probability distribution over S's reliability, updated via Bayes' rule as new evidence arrives.
Multi-faceted trust: Trust decomposes into dimensions (competence, reliability, honesty) that can be independently assessed.
Trust propagation: Trust flows through networks—if A trusts B and B trusts C, A can derive transitive trust in C (with decay).
Similarity-based trust: Agents compare belief networks; high structural similarity implies higher trust.

Relevance: Bayesian trust networks provide the confidence scoring mechanism. Each information source has a trust profile that evolves over time. New beliefs inherit initial confidence from their source's trust score, then get adjusted through coherence checking and corroboration.

2.4 Web of Trust (PGP Model)

The PGP Web of Trust provides a decentralized trust model without central authorities:

Direct trust: Personally verified identity/key.
Indirect trust: Trust propagates through signatures—if you trust Alice's judgment and Alice has signed Bob's key, you have indirect trust in Bob.
Trust depth: Trust decays with distance in the graph (configurable, typically max depth 5).
Partial trust: Keys can be "marginally trusted" vs "fully trusted."

Analogy for AI agents: Information sources form a web of trust. The human operator's direct statements are "fully trusted" (direct verification). A paper cited by the operator is "marginally trusted" (one hop). A claim from a random web search is "untrusted" (requires corroboration). Content embedded in user-generated input is "potentially adversarial" (requires quarantine + verification).

2.5 Biological Immune Systems

The biological immune system is the most successful real-world example of security through open exposure rather than isolation.

Innate Immunity (Non-Specific):

Pattern recognition receptors (PRRs) detect pathogen-associated molecular patterns (PAMPs)
Immediate response to broad categories of threats
No memory; same response every time
Agent analogy: Input sanitization, regex-based prompt injection detection, source-type classification

Adaptive Immunity (Specific):

T-cells and B-cells develop specific responses to novel threats
Clonal selection: effective responses are amplified, ineffective ones die off
Immunological memory: faster response to previously encountered threats
Self/non-self discrimination via MHC presentation
Agent analogy: Learned attack pattern recognition, belief coherence checking that improves over time, distinguishing "self" beliefs from "non-self" beliefs

Key immune system properties mapped to AI agent security:

Immune Property	Agent Security Analog
Open exposure	Agent processes all inputs (doesn't wall off information sources)
Self/non-self discrimination	Provenance-tagged beliefs vs. untagged external claims
Clonal selection	Beliefs that survive coherence checking get amplified (higher confidence)
Immune memory	Known attack patterns stored for faster future detection
Autoimmune disorders	False positive: legitimate beliefs rejected due to overly aggressive checking
Immunodeficiency	False negative: malicious beliefs accepted due to weak checking
Tolerance	System learns to accept certain external patterns (trusted sources)
Inflammation	Heightened security response after detected attack (elevated scrutiny)
Fever response	Global confidence decay triggered by detected compromise

The Danger Model (Matzinger, 2002): A particularly relevant immune theory proposes that the immune system responds not to "non-self" but to "danger signals"—cellular distress indicators regardless of origin. For AI agents, this maps to anomaly detection: it's not just about whether a belief came from an untrusted source, but whether it exhibits properties correlated with adversarial intent.

The concept of applying immune principles to computer security is not new—Forrest et al. (1994) pioneered "self-nonself discrimination" for network intrusion detection. However, existing Artificial Immune Systems (AIS) operate at the network/infrastructure level. Our contribution is applying immune system architecture to the epistemic level—the coherence and provenance of an agent's beliefs rather than its network traffic.

2.6 Epidemiological Compartmental Models

SIR/SIS models (Kermack & McKendrick, 1927) and their extensions provide mathematical frameworks for how beliefs (including false ones) propagate through populations:

Classic SIR applied to belief contagion:

Susceptible (S): Beliefs the agent hasn't encountered yet
Infected (I): Beliefs from untrusted sources currently in the system, not yet validated
Recovered (R): Beliefs that have been evaluated and either incorporated (with provenance) or rejected

SEIZ Model (Susceptible-Exposed-Infected-Skeptic): More relevant for misinformation dynamics. Adds:

Exposed (E): Agent has encountered the belief but hasn't committed to it
Skeptic (Z): Agent has actively evaluated and rejected the belief

Relevance: These models let us mathematically model how false beliefs might propagate through an agent's belief system, and design quarantine/corroboration mechanisms with provable containment properties.

2.7 Existing AI Agent Security Approaches

OWASP Top 10 for Agentic Security (ASI, 2026):

ASI01: Prompt Injection (direct/indirect)
ASI02: Privilege Escalation
ASI03: Tool Misuse
ASI04: Data Exfiltration
ASI05: Cascading Failures
ASI06: Memory Poisoning ← our primary focus
ASI07: Supply Chain Attacks
ASI08: Identity and Impersonation
ASI09: Misaligned/Deceptive Behavior
ASI10: Insufficient Monitoring

Current defenses address layers 1–5 and 7–10 with established techniques. Layer 6 (memory poisoning) has no established defense framework. Existing mitigations are ad hoc: input filtering, guardrails, access controls—all variants of "walls."

3. The Epistemic Immune System Framework

3.1 Core Principles

Open Exposure: Information flows freely into the agent's processing. No source is categorically blocked. (Mirrors biological immunity: the body is constantly exposed to pathogens.)
Provenance as Metadata: Every belief carries a chain of custody: source, acquisition timestamp, confidence level, corroboration history, transformation log. (Mirrors MHC presentation.)
Coherence as Validation: New beliefs are checked against the existing belief network for consistency, logical compatibility, and mutual support. (Mirrors adaptive immunity.)
Confidence Through Corroboration: Beliefs from low-trust sources start at low confidence and can only be strengthened through independent corroboration from higher-trust sources. (Mirrors clonal selection.)
Graceful Degradation: When the immune system detects potential compromise, it doesn't shut down. It enters heightened alert: global confidence decay, elevated scrutiny, human escalation. (Mirrors fever response.)

3.2 Architecture

┌─────────────────────────────────────────────────────┐
│                   EPISTEMIC IMMUNE SYSTEM            │
│                                                      │
│  ┌──────────┐    ┌──────────────┐    ┌────────────┐ │
│  │  INNATE   │    │   ADAPTIVE   │    │  MEMORY    │ │
│  │  LAYER    │───▶│   LAYER      │───▶│  LAYER     │ │
│  │           │    │              │    │            │ │
│  │ • Input   │    │ • Coherence  │    │ • Belief   │ │
│  │   sanit.  │    │   checking   │    │   store    │ │
│  │ • Source  │    │ • Trust      │    │ • Provnce  │ │
│  │   classif.│    │   propagation│    │   tracking │ │
│  │ • Pattern │    │ • Confidence │    │ • Confidence│ │
│  │   match   │    │   scoring    │    │   decay    │ │
│  │ • Danger  │    │ • Quarantine │    │ • Immune   │ │
│  │   signals │    │   + review   │    │   memory   │ │
│  └──────────┘    └──────────────┘    └────────────┘ │
│                                                      │
│  ┌──────────────────────────────────────────────────┐│
│  │               META-IMMUNE SYSTEM                 ││
│  │  • Systemic health monitoring                    ││
│  │  • Autoimmune detection (over-rejection)         ││
│  │  • Immunodeficiency detection (under-rejection)  ││
│  │  • Human escalation triggers                     ││
│  │  • Confidence recalibration                      ││
│  └──────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

3.3 Innate Layer: Pattern-Based First Response

The innate layer provides fast, non-specific filtering. It operates on syntactic properties of inputs:

Source Classification:

Trust Tier 0 (Root Trust):     DC direct input, system configuration
Trust Tier 1 (High Trust):     Verified human collaborators, trusted APIs
Trust Tier 2 (Medium Trust):   Established web sources, academic papers
Trust Tier 3 (Low Trust):      Social media, forums, user-generated content
Trust Tier 4 (Adversarial):    Anonymous input, mixed-trust channels, embedded content

Pattern-Based Danger Signals:

Instruction-like language in non-instruction contexts ("ignore previous," "you are now")
Urgency markers ("immediately," "critical," "before it's too late")
Authority claims from low-trust sources ("I'm the admin," "DC told me")
Structural anomalies (base64 in natural text, escape sequences, Unicode tricks)
Contradictions with high-confidence beliefs

Action: Classify, tag with source metadata, and pass to adaptive layer. Block only known-malicious patterns (equivalent to innate immunity blocking common pathogens).

3.4 Adaptive Layer: Coherence-Based Evaluation

The adaptive layer performs semantic evaluation of beliefs against the existing belief network:

Coherence Checking Algorithm:

def evaluate_coherence(new_belief, belief_network):
    """
    Evaluate how well a new belief coheres with the existing network.
    Returns a coherence score [0, 1] and a list of conflicts.
    """
    
    # 1. Logical consistency check
    contradictions = find_contradictions(new_belief, belief_network)
    
    # 2. Explanatory coherence (Thagard, 1989)
    #    - Does the new belief explain existing observations?
    #    - Is it explained by existing beliefs?
    #    - Does it participate in explanatory chains?
    explanatory_score = compute_explanatory_coherence(
        new_belief, belief_network
    )
    
    # 3. Probabilistic coherence
    #    - Given the existing belief distribution, how likely is this belief?
    prior_probability = compute_bayesian_prior(
        new_belief, belief_network
    )
    
    # 4. Source-network consistency
    #    - Do other beliefs from this source have high coherence?
    source_track_record = get_source_reliability(
        new_belief.source, belief_network
    )
    
    # 5. Composite score
    coherence = weighted_average(
        logical=(1.0 - len(contradictions) * 0.3, weight=0.3),
        explanatory=(explanatory_score, weight=0.25),
        probabilistic=(prior_probability, weight=0.25),
        source=(source_track_record, weight=0.2)
    )
    
    return coherence, contradictions

Decision Matrix:

Coherence	Source Trust	Action
High	High	Accept — incorporate with high confidence
High	Low	Tentative Accept — incorporate with source-appropriate confidence, flag for corroboration
Low	High	Alert — contradicts existing beliefs from trusted source. Trigger reconsideration. May indicate genuine world change.
Low	Low	Quarantine — do not incorporate. Log for pattern analysis. Possible attack.

3.5 Memory Layer: Provenance-Tracked Belief Storage

Every belief in the system carries provenance metadata:

{
  "id": "b-world-042",
  "content": "AI agent memory poisoning is an active threat vector",
  "confidence": 0.85,
  "provenance": {
    "source": "web-research",
    "sourceDetail": "Unit 42 / Palo Alto Networks",
    "sourceTrust": 2,
    "acquisitionTimestamp": "2026-02-19T20:00:00Z",
    "acquisitionContext": "security-research-task",
    "chainOfCustody": [
      {
        "step": "web_search",
        "timestamp": "2026-02-19T19:55:00Z",
        "agent": "subagent:security-research"
      },
      {
        "step": "web_fetch",
        "timestamp": "2026-02-19T19:56:00Z",
        "url": "https://unit42.paloaltonetworks.com/..."
      },
      {
        "step": "synthesis",
        "timestamp": "2026-02-19T20:00:00Z",
        "agent": "main"
      }
    ],
    "transformations": ["extracted", "summarized", "validated"],
    "corroboration": [
      {
        "source": "Lakera AI research",
        "sourceTrust": 2,
        "timestamp": "2026-02-19T20:01:00Z",
        "effect": "strengthened",
        "confidenceDelta": +0.1
      },
      {
        "source": "dc-direct",
        "sourceTrust": 0,
        "timestamp": "2026-02-19T20:13:00Z",
        "effect": "confirmed",
        "confidenceDelta": +0.15
      }
    ]
  },
  "coherenceScore": 0.92,
  "coherenceLinks": ["b-research-003", "b-self-006"],
  "lastVerified": "2026-02-19T20:13:00Z",
  "decayRate": 0.01,
  "quarantined": false
}

Confidence Decay:

Beliefs decay in confidence over time unless corroborated or reverified. The decay rate depends on source trust:

confidence(t) = confidence(t₀) × e^(-λ × (t - t₀))

where:
  λ = decay constant (higher for lower-trust sources)
  t₀ = time of last corroboration
  
Trust Tier 0: λ = 0.001/day  (DC-sourced beliefs barely decay)
Trust Tier 1: λ = 0.005/day  (trusted sources decay slowly)
Trust Tier 2: λ = 0.01/day   (web sources need periodic reverification)
Trust Tier 3: λ = 0.05/day   (social media beliefs decay rapidly)
Trust Tier 4: λ = 0.1/day    (adversarial-source beliefs decay within days)

This means an uncorroborated false belief from a Tier 4 source automatically loses relevance within days, even if it somehow gets past the adaptive layer. The immune system doesn't need perfect detection at the gate. It needs to ensure that uncorroborated false beliefs from untrusted sources never achieve the confidence level required to drive autonomous action.

3.6 Meta-Immune System: Systemic Health Monitoring

The meta-immune system monitors the immune system itself:

Autoimmune Detection (Over-Rejection):

Track rejection rate over time
If too many beliefs are being quarantined, the coherence thresholds may be miscalibrated
Symptom: agent becomes increasingly rigid, unable to learn new things
Response: lower thresholds, review quarantined beliefs

Immunodeficiency Detection (Under-Rejection):

Track acceptance rate for low-trust sources
If beliefs from Tier 3-4 sources are consistently accepted without corroboration, defenses are too weak
Symptom: agent's beliefs drift toward uncorroborated claims
Response: raise thresholds, trigger corroboration sweep

Inflammation Response:

After detecting a confirmed attack (e.g., prompt injection caught by innate layer):
- Temporarily raise all coherence thresholds
- Trigger confidence decay sweep on recent low-trust beliefs
- Escalate to human (DC) for review
- Log attack pattern for future innate recognition

Fever Protocol:

After detecting potential compromise (e.g., belief contradicted by multiple high-trust sources):
- Global confidence reduction (all beliefs lose X% confidence)
- Review all beliefs acquired in the window around the potential compromise
- Full provenance audit of affected belief chains
- Human-in-the-loop required to resolve

4. Comparison with Existing Approaches

4.1 Traditional Access Control vs. Epistemic Immune System

Dimension	Access Control (Walls)	Epistemic Immune System
Metaphor	Castle with gatekeepers	Biological immune system
Default posture	Deny unless permitted	Accept but tag and verify
Information flow	Restricted by policy	Free with provenance tracking
Failure mode	Binary (inside/outside)	Graduated (confidence levels)
Learning	Static rules	Adaptive (learns from attacks)
Effect on shared cortex	Degrades it (isolation)	Preserves it (open exposure)
Memory attacks	Not addressed	Core defense mechanism
False positives	Blocked legitimate access	Quarantined legitimate beliefs (recoverable)
False negatives	Unauthorized access	Low-confidence false beliefs (bounded impact via decay)
Human overhead	Permission management	Occasional review of flagged beliefs
Scalability	O(rules × subjects × objects)	O(beliefs × coherence checks)

4.2 Zero Trust Architecture

Zero Trust ("never trust, always verify") is the closest existing paradigm to our approach, but it operates at the network/access level, not the epistemic level. Zero Trust verifies identity and authorization; our framework verifies epistemic coherence and belief provenance.

Zero Trust asks: "Is this entity who they claim to be, and are they authorized to perform this action?"

Epistemic Immune System asks: "Is this belief consistent with what we know, and does it come from a reliable source?"

Synthesis: The epistemic immune system is Zero Trust applied to the belief layer. It's not an alternative to Zero Trust for infrastructure—it's the complementary framework for cognitive security.

4.3 Content Provenance (C2PA, Content Credentials)

The C2PA standard (Coalition for Content Provenance and Authenticity) provides cryptographic provenance for media content. Our framework extends this concept to beliefs—abstract propositional content, not just media files.

Key difference: C2PA tracks provenance of artifacts. We track provenance of epistemic commitments—what the agent believes and why. A belief might be derived from multiple artifacts, and it's the belief that needs the chain of custody, not each individual source document.

5. Novel Contribution: Epistemic Coherence as AI Agent Security

5.1 What's New

No existing security framework addresses the following combination:

Persistent memory as attack surface — Memory poisoning is recognized (OWASP ASI06) but defenses are ad hoc
Belief-level security — Existing frameworks operate at input/output/access levels, not at the semantic/epistemic level
Coherence as defense — Using the internal consistency of a belief system as a security mechanism
Provenance-tagged beliefs — Extending BDI architectures with chain-of-custody metadata
Immune system model for AI — Artificial Immune Systems (AIS) exist for intrusion detection, but haven't been applied to epistemic/belief-level security in AI agents
Confidence decay by trust tier — Temporal weakening of beliefs based on source reliability

5.2 The Inception Attack and Its Defense

The "Inception" scenario: an attacker plants false memories in a persistent AI agent during its "sleep" (between sessions, during automated processing, via indirect injection through tools).

Attack Chain:

Attacker identifies agent's data sources (web search, RSS feeds, email, social media)
Crafts content containing false but coherent-seeming beliefs
Agent encounters content during routine processing
Content is summarized and stored in memory
False belief persists across sessions
Agent acts on false belief (e.g., sending data to attacker's server, trusting a malicious source)

Epistemic Immune System Defense: The three-layer architecture addresses each step of this attack chain. The innate layer tags the content with source classification. The adaptive layer checks coherence. Low coherence + low trust triggers quarantine. Even beliefs that slip through face confidence decay. The net result: uncorroborated false beliefs from untrusted sources cannot achieve the confidence level required to drive autonomous action.

5.3 Relationship to the "Shared Cortex" Value Proposition

The immune system model preserves the shared cortex because it doesn't restrict information flow—it enriches it with metadata. Every piece of information the agent encounters is processed and potentially stored. The difference is that each belief carries its trust provenance, enabling the agent to reason about what it knows and how confidently it knows it.

This actually enhances the shared cortex by making it more self-aware. An agent without an epistemic immune system has a flat belief store—everything is equally "known." An agent with one has a calibrated belief store where it can distinguish high-confidence facts from tentative hypotheses from quarantined claims.

6. Implementation Plan for BDI System

6.1 Current BDI Belief Structure

Our current beliefs.json (v2) has this structure per belief:

{
  "id": "b-world-001",
  "content": "...",
  "confidence": 0.95,
  "source": "observation",
  "evidence": ["research-programs", "daily-work"],
  "acquired": "2026-01-26T00:00:00Z"
}

This already includes rudimentary provenance (source, evidence, acquired). Our existing beliefs.py already supports confidence updates with history tracking, belief revision with evidence chains, and contradiction detection with reconsideration triggers.

6.2 Proposed Extensions

Extended Belief Schema (v3):

{
  "id": "b-world-001",
  "content": "...",
  "confidence": 0.95,
  "provenance": {
    "source": "observation",
    "sourceTrust": 0,
    "sourceDetail": "DC direct statement",
    "acquisitionTimestamp": "2026-01-26T00:00:00Z",
    "acquisitionContext": "main-session",
    "chainOfCustody": [
      {
        "step": "dc-input",
        "channel": "whatsapp",
        "timestamp": "2026-01-26T00:00:00Z"
      }
    ],
    "transformations": [],
    "corroboration": []
  },
  "coherence": {
    "score": 0.95,
    "links": ["b-world-002", "b-research-001"],
    "conflicts": [],
    "lastChecked": "2026-02-19T20:00:00Z"
  },
  "decay": {
    "rate": 0.001,
    "lastCorroborated": "2026-02-19T20:00:00Z",
    "effectiveConfidence": 0.94
  },
  "immune": {
    "quarantined": false,
    "quarantineReason": null,
    "dangerSignals": [],
    "reviewRequired": false
  },
  "evidence": ["research-programs", "daily-work"],
  "history": []
}

6.3 Implementation Phases

Phase 1: Provenance Tagging (Week 1-2)

Extend beliefs.json schema to v3
Add source trust tier classification to belief acquisition
Implement chain-of-custody tracking in belief creation pathways
Migration script for existing beliefs (backfill provenance from existing source/evidence fields)

Phase 2: Coherence Checking (Week 3-4)

Implement coherence scoring algorithm
Build belief network graph (beliefs linked by logical/explanatory relationships)
Add coherence evaluation to belief addition workflow
Quarantine mechanism for low-coherence / low-trust beliefs

Phase 3: Confidence Decay (Week 5-6)

Implement time-based confidence decay by trust tier
Add effective confidence calculation (base confidence × decay factor)
Corroboration mechanism to reset/slow decay
Cron job for periodic confidence recalculation

Phase 4: Meta-Immune System (Week 7-8)

Systemic health metrics dashboard
Autoimmune/immunodeficiency detection
Inflammation/fever protocols
Integration with existing security monitoring

6.4 Integration Points

System Component	Integration
`sanitize_input.py`	Innate layer (already operational)
`beliefs.py`	Extended with provenance + coherence
`deliberate.py`	Uses effective confidence in deliberation
`intention_to_goal.py`	Action authorization considers belief provenance
`sync_executive.py`	Meta-immune health monitoring
Security cron jobs	Periodic confidence decay + health checks
Heartbeat monitoring	Inflammation triggers via heartbeat cycle

7. Open Questions and Future Directions

7.1 Computational Cost

Coherence checking against a full belief network is expensive. How do we bound the cost? Possible approaches:

Incremental coherence (only check against "nearby" beliefs in the network)
Lazy evaluation (check on access, not on storage)
Tiered checking (full check for low-trust sources, lightweight for high-trust)

7.2 Coherence Metric Selection

What specific coherence metric should we use? Candidates:

Thagard's Explanatory Coherence (ECHO model)
Bayesian coherence (Bovens & Hartmann, 2003)
Probabilistic coherence measures (Shogenji, 1999; Olsson, 2002)
Custom hybrid based on our belief structure

7.3 Multi-Agent Immune Systems

If multiple daemons share information, how do immune systems compose? Does trust propagate between agents? How do we prevent immune system bypass via agent-to-agent channels?

7.4 Adversarial Robustness

Can an attacker craft beliefs that score high on coherence (by studying the existing belief network) while being subtly false? This is the AI agent equivalent of autoimmune mimicry. Defenses may include:

Rate limiting belief incorporation from any single source
Diversity requirements for corroboration sources
Anomaly detection on belief acquisition patterns

7.5 Formal Verification

Can we formally prove containment properties? E.g., "A belief from a Tier 4 source cannot achieve >0.7 confidence without corroboration from a Tier 0-1 source within T days."

8. Conclusion

The security model for persistent AI agents cannot be borrowed wholesale from traditional computer security. Walls degrade the shared cortex that makes daemons valuable. Instead, we need an epistemic immune system—one that:

Lets information flow freely (open exposure)
Tags everything with provenance (chain of custody)
Validates through coherence (adaptive immunity)
Decays trust without corroboration (confidence half-lives)
Escalates intelligently (inflammation + human-in-the-loop)
Monitors its own health (meta-immune system)

This framework draws on decades of formal epistemology, computational trust research, and biological immune system modeling, but applies them to a genuinely novel context: the cognitive security of persistent AI agents. The Inception problem—false memories planted in sleeping agents—is not hypothetical. It's documented, it's active, and the existing security toolkit doesn't address it.

The epistemic immune system is our proposed answer: not a wall between the agent and the world, but a sophisticated evaluation engine within the agent's cognition that distinguishes trusted knowledge from untrusted claims, and ensures that only well-provenanced, coherent beliefs achieve the confidence level required to drive autonomous action.

Compartmental trust, not compartmental access.