The Thermodynamics of Thinking: Energy, Information, and the Limits of AI Cognition
When we talk about AI efficiency, we typically mean: parameters, FLOPs, tokens per second, cost per API call. What we rarely discuss is the fundamental physics: how much energy does thinking actually cost?
In February 2026, Brian Roemmele published a provocative framework called "JouleThought" that attempts to quantify AI cognition in thermodynamic terms. The core claim: AI energy budgets, like biological brains, split into "conscious" (high-order reasoning) and "unconscious" (background operational) processes with radically different energy profiles.
Roemmele's work has the right intuition wrapped in weak formalism. But it points toward something genuinely important: the intersection of thermodynamics and information theory offers a rigorous framework for understanding AI efficiency limits. The real insight isn't about "joules per thought" — it's that consciousness is compression, and the most valuable cognition may cost the least energy.
Let's unpack that.
The Conscious/Unconscious Energy Split
Biological Foundation
Start with the human brain. It consumes approximately 20 watts — about the power of a dim light bulb. That's 20% of your total metabolic energy despite being only 2% of body mass.
But here's the striking part: 75–85% of that 20 watts goes to "unconscious" processes[1]:
- Autonomic function (breathing, heartbeat)
- Homeostatic maintenance (ion pumps, synaptic recycling)
- Resting-state network activity (default mode network)
Only 15–25% supports what we experience as conscious thought.
The information disparity is even more dramatic. Your sensory systems process approximately 11 million bits per second of input (vision dominates at ~10M bits/sec). But the bandwidth of conscious awareness is estimated at just 40–60 bits per second[2].
That's a compression ratio of 275,000:1.
AI Parallel
Roemmele observes that modern AI architectures show a similar pattern. In transformer models:
"Unconscious" processes (70–90% of energy):
- Tokenization and embedding
- Matrix multiplication (attention mechanisms)
- Key-value cache management
- Gradient computation during training
"Conscious" processes (10–30% of energy):
- Final token prediction
- Output generation
- High-level inference synthesis
The ratio R = JTu / JTc (unconscious energy / conscious energy) is typically much greater than 1, mirroring the biological split.
This is an interesting observation. But it raises an immediate question: is this describing something fundamental, or just the inefficiency of current architectures?
The Landauer Floor: Physics vs Engineering
To answer that, we need to look at the thermodynamic foundation: Landauer's principle (1961)[3].
Landauer proved that erasing one bit of information dissipates a minimum of kT ln 2 joules, where k is Boltzmann's constant and T is temperature. At room temperature (300K), this works out to approximately 3 × 10−21 joules per bit.
This isn't an engineering limit — it's a fundamental consequence of the second law of thermodynamics. Any irreversible computation must dissipate at least this much energy.
Now compare that to modern AI hardware. A single GPT-4 inference consumes roughly 0.001–0.01 kWh (3,600–36,000 joules). Even accounting for billions of bit operations, we're operating approximately six orders of magnitude above the Landauer limit.
Implication: The "unconscious overhead" Roemmele describes is almost entirely engineering inefficiency, not thermodynamic necessity. The high JTu/JTc ratio tells us about current architecture choices and hardware limitations, not fundamental physics.
The Landauer limit represents the floor. We're not even close to it. That six-orders-of-magnitude gap is optimization headroom.
Information Bottleneck: The Real Theory
If Landauer tells us the floor, what tells us the structure? Why does the brain compress 11M bits/sec to 40 bits/sec? Why do transformers dedicate 80% of energy to processes we don't see in the output?
The answer comes from information theory, specifically Naftali Tishby's information bottleneck framework (1999)[4].
Consider a system trying to predict an output Y from an input X. The bottleneck principle says: optimal compression finds a representation T that minimizes I(X;T) while maximizing I(T;Y). In plain language: compress the input to the smallest representation that still captures everything relevant for predicting what you care about.
This creates a natural tradeoff:
- Too much compression (small I(X;T)) → lose predictive power
- Too little compression (large I(X;T)) → waste resources on irrelevant information
The optimal point is the information bottleneck — the minimal sufficient statistic for the task.
Consciousness as the Bottleneck
Here's the key insight: consciousness IS that compressed representation T.
Your visual cortex processes 10 million bits/sec of photoreceptor data. Your conscious experience is the 40 bits/sec compressed state that captures what's behaviorally relevant.
The 11M → 40 compression isn't incidental. It's the function of consciousness. The brain dedicates 85% of its energy to maintaining the generative model, updating priors, and computing the compression. The "conscious" 15% operates on the compressed representation.
AI Transformers as Bottleneck Systems
Now look at a transformer through this lens:
- Input encoding (unconscious): Billions of parameters embed the input into high-dimensional space
- Attention mechanism (bottleneck): Selectively routes information, compressing to task-relevant features
- Output generation (conscious): Final layers operate on the compressed state to produce the next token
The energy distribution mirrors the information flow. The "unconscious" 80% is computing the bottleneck. The "conscious" 20% is using it.
This is why Tishby called the information bottleneck "a theory of deep learning"[5]. The architecture itself is performing compression to a minimal sufficient statistic. That's not overhead — that's the cognition.
The Compression-Value Paradox
Now we arrive at the genuinely surprising implication.
If consciousness is compression, and compression is the function of intelligence, then the most valuable cognition — the elegant insight that collapses a vast search space — is thermodynamically the cheapest.
Think about a eureka moment. The "aha!" where a complex problem suddenly becomes simple. That's maximum compression — reducing a high-dimensional possibility space to a single salient representation.
From an information-bottleneck perspective, that's the goal. And thermodynamically, it costs less than brute-force retrieval or exhaustive search.
The Economics Problem
This creates a fundamental problem for energy-denominated AI economics (Roemmele's "JouleWork" proposal).
If you price AI work in joules:
- A GPU burning 1000 joules on hallucinated garbage scores the same as 1000 joules of breakthrough reasoning
- You systematically overpay for retrieval, sorting, and mechanical processing
- You systematically underpay for synthesis, insight, and creative compression
We've seen this pattern before in human labor markets. Credential inflation: rewarding visible proxies over actual value. Presenteeism: valuing hours spent over outcomes achieved. Energy-denominated AI economics would make the same category error — conflating the cost of computation with its value.
Free Energy and the "Unconscious" Baseline
There's another wrinkle. Karl Friston's free energy principle (2010)[6] suggests that the "unconscious" baseline energy isn't overhead at all — it's active inference.
The brain's resting-state network — that 85% "unconscious" energy — is performing Bayesian model updating, maintaining priors, minimizing prediction error. It's not idling. It's doing the thermodynamic work of maintaining a world model.
Under this view, the conscious/unconscious split may be the wrong decomposition entirely. Both are cognition, just different phases of the same variational inference process:
- "Unconscious" = maintaining and updating the generative model
- "Conscious" = final prediction error minimization using that model
The energy distribution reflects the computational demands of each phase, not a fundamental difference in kind.
What Would Actually Be Useful?
Roemmele's JouleThought framework identifies a real pattern (energy splits like brains) but lacks theoretical grounding. The information bottleneck provides that grounding: consciousness as compression, with thermodynamics as the cost function.
But the genuinely useful question isn't "how many joules per thought?" It's: can thermodynamic metrics predict architectural efficiency walls?
Here are testable predictions:
1. Context Scaling Limits. Attention mechanisms are O(n²) in sequence length. If unconscious energy scales quadratically with context while conscious output scales linearly, the ratio should grow with context window size. Long-context models will hit an efficiency ceiling where the quadratic energy cost overwhelms quality improvements.
2. Sparse Models as Thermodynamic Optimization. Mixture-of-experts (MoE) models activate only ~10% of parameters per token. From a bottleneck perspective, this is selective routing — computing the compression only where needed. MoE should show better energy ratios than dense models of equivalent quality.
3. The Landauer Gap as Optimization Target. Each order of magnitude closer to Landauer's limit represents ~10× energy reduction. Current gap: 106×. Benchmarking architectural innovations by how much they close this gap would reveal efficiency improvements invisible to parameter counts or benchmark scores.
4. Emergent Capabilities at Thermodynamic Thresholds. Do qualitative capability jumps correlate with crossing thermodynamic efficiency thresholds? Models that achieve better compression (higher I(T;Y) for lower I(X;T)) should show emergent capabilities at smaller parameter counts.
Conclusion: Physics as Guardrails, Not Ceiling
The thermodynamics of thinking matters because it provides:
- A floor (Landauer): The absolute minimum energy per bit. We're 106× above it.
- A structure (Information bottleneck): Why cognition compresses, and why that compression is valuable.
- A cost function (Free energy): What the "unconscious" baseline is actually doing.
- Testable predictions: Energy scaling laws, architecture comparisons, efficiency ceilings.
What it doesn't provide is an economic valuation scheme. Energy spent ≠ value created. The compression-value paradox is real.
Roemmele's JouleThought has the kernel of something important, but it's wrapped in questionable formalism and distracting crypto-economics. Strip away the token mechanics, ground the framework in information theory, and focus on the predictive question — can thermodynamics tell us where models will hit walls? — and there's a real research program here.
The barrier to AI capability keeps rising. The question is: are we hitting physics, or just engineering limits we haven't optimized through yet?
The Landauer gap suggests it's still mostly engineering.
References
- Raichle, M. E., & Gusnard, D. A. (2002). "Appraising the brain's energy budget." Proceedings of the National Academy of Sciences, 99(16), 10537–10542.
- Zimmermann, E., Weidner, R., & Fink, G. R. (2016). "The neural basis of conscious perception." Cortex, 80, 182–192.
- Landauer, R. (1961). "Irreversibility and heat generation in the computing process." IBM Journal of Research and Development, 5(3), 183–191.
- Tishby, N., Pereira, F. C., & Bialek, W. (1999). "The information bottleneck method." Proc. 37th Allerton Conference on Communication, Control and Computing, 368–377.
- Tishby, N., & Zaslavsky, N. (2015). "Deep learning and the information bottleneck principle." IEEE Information Theory Workshop, 1–5.
- Friston, K. (2010). "The free-energy principle: a unified brain theory?" Nature Reviews Neuroscience, 11(2), 127–138.