Quantifying multi-agent swarm efficiency through information entropy

The core thesis holds up well: information entropy provides a rigorous, measurable framework for optimizing how LLM agents communicate. A growing body of research—from Stanford’s agentic information theory paper to MARL communication topology studies—now directly connects Shannon entropy, the information bottleneck, and mutual information to multi-agent system design. The practical implications are significant: dynamic, entropy-aware topologies reduce token consumption by up to 95% while matching or exceeding static topology performance, and the Data Processing Inequality mathematically guarantees that every unnecessary agent in a chain can only destroy information. This report compiles the technical foundations, framework architectures, empirical evidence, and academic work needed to build the blog post’s argument.

1. How the major frameworks actually move information

Each multi-agent framework makes fundamentally different choices about communication topology, and these choices have direct information-theoretic consequences.

CrewAI offers two modes: a sequential chain (task N’s output becomes task N+1’s context) and a hierarchical hub-and-spoke (a manager agent dynamically dispatches work). It implements a 4-layer memory system—short-term (ChromaDB/RAG), long-term (SQLite), entity memory, and contextual memory—but memory is opt-in. Without it, each agent starts from scratch. In sequential mode, only the explicit task output transfers; intermediate reasoning and chain-of-thought are lost unless deliberately surfaced. CrewAI’s own documentation acknowledges the memory is “fairly static and doesn’t evolve with the user.”

AutoGen (Microsoft) centers on a GroupChatManager that maintains a single shared conversation thread visible to all agents—a hub-and-spoke where the hub broadcasts everything. Speaker selection uses LLM-based routing (examining agent descriptions), round-robin, random, or custom functions. The critical trade-off: full visibility prevents information loss between agents but causes rapid context window exhaustion as conversations grow. In sequential chat mode, AutoGen passes “carryover summaries” between conversations—a lossy compression step. AutoGen 0.4 reimagined the architecture around an actor model with async message passing, supporting SelectorGroupChat, Swarm (tool-based handoffs), and GraphFlow (directed agent graphs).

LangGraph takes a state graph / blackboard approach: agents don’t message each other directly but instead read from and write to a centralized StateGraph object. Each node receives current state, performs work, and returns updated state. State fields can have reducer functions (append vs. overwrite semantics). This pattern prevents corruption but creates a bottleneck—multiple agents can’t simultaneously write. LangGraph’s differentiator is checkpointing: complete state snapshots at every super-step boundary, stored in SQLite/Postgres/Redis, enabling time-travel debugging and fault recovery. The graph is compiled and immutable at runtime, preventing dynamic topology changes.

OpenAI Swarm is the most minimal: only two primitives—Agents and Handoffs. When a tool function returns an Agent object, control transfers. During handoff, chat history is preserved but the system prompt is replaced (the new agent loses visibility into the previous agent’s persona and instructions). Context variables persist within a single run() call but there is zero state between calls—no memory, no persistence. The topology is a conditional directed chain.

MetaGPT introduces the most information-theoretically interesting pattern: a global message pool with publish-subscribe filtering. Agents publish structured artifacts (PRDs, UML diagrams, code) to a shared pool and subscribe based on role profiles. This is a blackboard pattern that naturally implements information filtering—agents only consume messages relevant to their role, preventing information overload. The structured output format (documents rather than chat) reduces hallucination cascading but constrains flexibility.

Other frameworks add further variations. Agency Swarm (VRSEN) allows fully custom directional communication flows defined with operator syntax (ceo > dev). Semantic Kernel (Microsoft) offers five pre-built orchestration patterns (sequential, concurrent, group chat, handoff, Magentic) and explicitly distinguishes chat-based from artifact-based communication—noting that “any information in a chat that isn’t included in an artifact is effectively lost.” Google ADK implements a hierarchical agent tree with the most sophisticated context engineering: a tiered model separating working context, session state, long-term memory, and persistent artifacts. Google’s team articulated a key insight: “Simply giving agents more space to paste text cannot be the single scaling strategy… Context engineering—treating context as a first-class system with its own architecture, lifecycle, and constraints—is necessary for production systems.”

The cross-framework information loss comparison reveals a universal pattern:

Framework	What transfers	What’s lost
CrewAI	Task outputs, shared memory (if enabled)	Intermediate reasoning, chain-of-thought
AutoGen	Full conversation thread (group chat); summaries (sequential)	Nuance in carryover summaries; older messages truncated
LangGraph	All state fields at each checkpoint	Anything not written to state schema; mid-node state
OpenAI Swarm	Chat history, context variables	Previous agent’s system prompt and tools; no cross-run memory
MetaGPT	All published artifacts in message pool	Information not published; chat-style nuance
Semantic Kernel	Pattern-dependent	Non-artifact chat information
Google ADK	Session state, artifacts, memories	Irrelevant history (by design); parallel state overwrites

2. The information-theoretic toolkit for analyzing agent communication

Shannon entropy measures message information density

Shannon entropy H(X) = −Σ P(xᵢ) log₂ P(xᵢ) quantifies the average information content of a message. For LLM outputs, this operates at two levels. Token-level entropy measures how predictable the next token is given context—low entropy means the model is confident, high entropy means uncertainty. Semantic entropy (Farquhar et al., 2024, published in Nature) addresses the deeper question: it clusters sampled completions into semantic equivalence classes using natural language inference, then computes entropy over cluster probabilities. Low semantic entropy (0.5–1.5 bits) correlates with factual statements; higher semantic entropy (2.0–3.0 bits) indicates hallucinations. This distinction matters enormously for agent communication: a message can have low token entropy (confident phrasing) but high semantic entropy (confident about the wrong thing).

For the blog post, the key connection is: an agent message’s value is proportional to its task-relevant semantic entropy. A high-entropy message from Agent A to Agent B carries more novel information but may also carry more noise. The optimal inter-agent message maximizes task-relevant information per token while minimizing irrelevant entropy.

The information bottleneck principle bounds agent chain performance

The information bottleneck (Tishby et al., 1999) finds the optimal trade-off between compressing input X into representation T while preserving information about target Y: min I(X;T) − β·I(T;Y). In a multi-agent chain User Query → Agent₁ → Agent₂ → … → Agentₙ → Output, the Data Processing Inequality (DPI) directly applies:

I(Query; Output_n) ≤ I(Query; Output_{n-1}) ≤ … ≤ I(Query; Output_1)

This is the mathematical foundation for the blog post’s argument. Information can only degrade through the chain—every summarization, rewriting, or delegation step is irreversible. Each agent is an information bottleneck node that must compress its input enough to fit the next agent’s context window (minimize I(Input; Output)) while preserving maximum task-relevant information (maximize I(Output; TaskGoal)). This framework has been directly applied in MARL: Wang et al. (2020, ICML) showed that enforcing IB constraints on inter-agent messages forces agents to compress to task-relevant information only, improving coordination efficiency. Ding et al. (2023, IEEE TPAMI) extended this to graph information bottleneck, learning minimal sufficient message representations over graph-structured communication.

Rate-distortion theory quantifies the compression-fidelity trade-off

Rate-distortion theory establishes the minimum bits R(D) needed to represent source X with distortion at most D. A landmark paper by Arda & Yener (2025) defines the summarizer rate-distortion function R_S(D), proving a fundamental lower bound on summarizer performance. A Stanford CS project by Ishan Khare explicitly casts the local→remote LLM summary channel as a rate-distortion problem, finding that Qwen 7B achieves >3× the bit efficiency of Llama 8B, producing more compact yet information-rich summaries. For the blog: every context window boundary is a compression boundary forcing rate-distortion trade-offs. Memory hierarchies (as in Letta/MemGPT) create “distortion ladders” where each memory layer accepts different levels of information loss.

Context windows are noisy channels with finite capacity

Shannon’s channel capacity C = max I(X;Y) defines the maximum reliable information transmission rate. The context window of an LLM is a hard capacity constraint on inter-agent communication. But the effective capacity is far lower than the nominal window size. Chroma Research (2025) demonstrated “context rot”: at 32K tokens, 11 of 12 tested models dropped below 50% of short-context performance. GPT-4 showed 15.4% degradation extending from 4K to 128K tokens. The “lost in the middle” effect further reduces effective capacity for information positioned in the context’s interior.

Mutual information predicts downstream task performance

The Stanford Hazy Research paper (He et al., December 2025)—“An Information Theoretic Perspective on Agentic System Design”—is the single most relevant paper for the blog post. It explicitly models the compressor agent as a noisy channel: X → [Compressor] → Z → [Predictor] → Y, and measures mutual information I(X;Z) as a task-agnostic indicator of compression quality. Key empirical findings: MI correlates with downstream accuracy (R² = 0.71). Larger compressor models retain up to 5.4× more mutual information and are also more concise, communicating more information more efficiently per token. Scaling compressors from 1B to 7B improves accuracy by 60%, while scaling predictors from 70B to 405B adds only 12%. Their conclusion: “No matter how big the predictor is, it cannot recover information that was never provided by the compressor.” On DeepResearch Bench, they achieved 102% of frontier-LM-only performance at only 28% of cost by optimizing compressor agents.

3. The telephone game is real and measurable

Quantified information degradation in agent chains

Perez et al. (ICLR 2025) directly studied the telephone game effect in LLMs using transmission chain experiments across 5 models, 3 tasks, and 50 generations per chain. Small biases at the single-output level amplify in iterated interactions, driving content toward “attractor states.” Toxicity showed particularly strong attractors with high convergence rates independent of model or task. More open-ended instructions led to stronger attraction effects. The paper’s linear regression method estimates both attractor position and convergence strength—a quantifiable metric for information drift.

A broader benchmark by Laban et al. (2025) tested 200,000+ simulated conversations across 15 LLMs and found a 39% average performance drop from single-turn to multi-turn interactions. Degradation decomposed: aptitude drops ~16%, but unreliability more than doubles (~112% increase). Even reasoning models (o3, DeepSeek-R1) showed no improvement—additional test-time compute doesn’t help. The “Agent Drift” paper (January 2026) introduces the Agent Stability Index (ASI) across 12 dimensions and projects a 42% reduction in task success rates from progressive behavioral degradation.

Error compounding is multiplicative: if each agent has 90% success rate, a 3-agent pipeline drops to 72.9%, a 5-agent pipeline to 59%. One production customer support pipeline with 4 agents at 90% individual success showed actual system success of only **58%**—worse than the expected 65.6% due to error propagation biasing later agents.

How frameworks fight context loss

The blackboard pattern emerges as the strongest mitigation. Salemi et al. showed blackboard architecture achieves 13–57% relative improvement over master-slave and RAG baselines. MetaGPT’s publish-subscribe message pool is the most prominent LLM implementation. Han et al. (2025) demonstrated an LLM blackboard system competitive with SOTA while spending fewer tokens.

Context compression tools provide another defense. LLMLingua (Microsoft, EMNLP 2023) uses small-model perplexity as an entropy proxy to identify redundant tokens, achieving 20× compression with only ~1.5 point performance loss. LongLLMLingua mitigates the “lost in the middle” issue, improving RAG performance by up to 21.4% using only 1/4 of tokens. Implicit compression methods like AutoCompressor achieve 40× compression but require model-specific fine-tuning.

The Manus team’s approach is instructive: they prioritize reversible compaction (replacing file contents with path references—losslessly recoverable) over lossy summarization, and only use LLM summarization as a last resort. JetBrains Research (NeurIPS 2025 workshop) found that simple observation masking often matched or beat LLM summarization while being 52% cheaper—LLM-generated summaries can actually smooth over stop signals, causing agents to persist wastefully.

The MAST taxonomy (UC Berkeley, NeurIPS 2025 Spotlight) analyzed 1600+ traces across 7 MAS frameworks and identified 14 failure modes in 3 categories. Inter-agent misalignment failures—including conversation resets, withholding crucial information, and ignoring other agents’ input—constitute a major category that “solutions focused on context or communication protocols are often insufficient” to address.

4. Dynamic topology is the frontier, and entropy is the optimization target

Frameworks that dynamically rewire agent communication

The most exciting research direction for the blog post is dynamic topology optimization—systems that reconstruct agent communication graphs in real time based on task demands.

DyTopo (February 2025) reconstructs a sparse directed communication graph at each round. Each agent outputs lightweight natural-language “query” (what I need) and “key” (what I offer) descriptors; DyTopo embeds these and performs semantic matching, routing private messages only along induced edges. This achieves +6.2% over strongest baselines on code generation and mathematical reasoning while producing interpretable, evolving coordination traces.

GTD (Guided Topology Diffusion) uses a conditional discrete graph diffusion model to iteratively construct topologies, starting from an empty graph. A context-aware Graph Transformer serves as the denoising network with two-stage guidance from a lightweight proxy model predicting multi-objective rewards. On GSM8K: 94%+ accuracy with only 4.8M tokens, outperforming G-Designer by 15% fewer tokens. It sets a new Pareto frontier for accuracy vs. token consumption.

G-Designer uses a variational graph auto-encoder to generate task-aware communication topologies, achieving 89.90% pass@1 on HumanEval while reducing token consumption by up to 95.33%. DyLAN (COLM 2024) optimizes teams via agent importance scoring, showing that an optimized team of 3 agents outperforms an architecture with 7 agents—a 52.9% efficiency gain. GPTSwarm (ICML 2024) represents agents as computational graphs and uses RL for two-level optimization: node-level prompt refinement and edge-level orchestration.

The most recent work is converging on a powerful pattern: joint optimization of prompts and topology. Mass (February 2025) demonstrates that both prompts and topologies significantly impact performance, and joint optimization outperforms optimizing either alone. MasRouter (ACL 2025) formalizes Multi-Agent System Routing as a unified framework, achieving 1.8–8.2% improvement over SOTA with up to 52% overhead reduction. MasHost (June 2025) is the first RL-driven framework for fully autonomous MAS graph construction.

Empirical topology comparisons reveal clear trade-offs

MultiAgentBench systematically compared four topologies and found graph-mesh (fully decentralized) yields best task scores and planning efficiency, while tree topology is least effective. The MAMA framework tested six topologies for memory leakage and found chain topology provides strongest privacy protection while complete graphs show highest leakage—dense connectivity is systematically more vulnerable.

The synthesized trade-off picture:

Hub-and-spoke: Simple, fast to ship, easy governance. But single point of failure and head-of-line blocking. Best for predictable, audit-heavy workflows.
Hierarchical tree: Controlled parallelism, clear delegation. But rigid, poor adaptability, highest overhead. Least effective in benchmarks.
Sequential chain: Best privacy, supports dependency chaining. But sequential bottleneck, limited parallelism.
Mesh/complete: Best task performance, highest bandwidth. But quadratic communication overhead and privacy risk.
Dynamic/adaptive: Task-adaptive, token-efficient, robust. More complex to implement but dominant when achievable.

A position paper from 2025 (“Topological Structure Learning Should Be A Research Priority for LLM-Based MAS”) advocates treating topology design as a first-class research priority and proposes SPAN (Structure-Profiling Agent Network), which factors edge decisions as probabilities, keeping search linear rather than exponential.

5. The academic foundations connect entropy to topology optimization

Entropy-based metrics already exist for agent communication

The most directly relevant paper for the blog’s thesis is “Network Topology and Information Efficiency of Multi-Agent Systems” (2025), which introduces the Information Entropy Efficiency Index (IEI) and **Specialization Efficiency Index (SEI)**—novel metrics quantifying message compactness and diversity. Lower IEI values indicate more concise, efficient information encoding. Critically, integrating IEI/SEI into training objectives accelerates policy convergence. A companion paper proposes three Communication Efficiency Metrics (CEMs): IEI, SEI, and Traffic Efficiency Index (TEI), providing a practical entropy-based evaluation framework.

Information bottleneck optimizes what agents communicate

Wang et al. (ICML 2020) applied the information bottleneck principle to MARL communication under bandwidth constraints, enforcing an upper bound on mutual information between messages and internal features. This forces agents to compress communication to task-relevant information only and outperforms methods without information-theoretic constraints. Ding et al. (IEEE TPAMI 2023) extended this to graph information bottleneck (GIB), learning minimal sufficient message representations that maximize MI with optimal actions while minimizing dependence on vulnerable features—enabling multi-layer sparse communication graphs. The latest work (February 2026) combines IB with vector quantization, achieving 181.8% performance improvement over no-communication baselines while reducing bandwidth by 41.4%.

Free energy and active inference provide an alternative lens

The “Orchestrator: Active Inference for Multi-Agent Systems” (2025) frames coordination through variational free energy minimization: agents maximize expected information gain while offsetting coordination and navigation costs, measuring epistemic uncertainty through information entropy between consecutive states. A companion paper integrates active inference as a cognitive layer above LLM agents, dynamically adjusting strategies through principled information-seeking behavior. The “Factorised Active Inference” paper (AAMAS 2025) extends this to game-theoretic settings, connecting free energy minimization with Nash equilibria and bounded rationality—agents balance utility maximization with information-processing costs (entropy).

Scaling laws quantify coordination overhead

Kim et al. (Google Research/MIT/DeepMind, December 2025) provide the most rigorous scaling analysis: 180 agent configurations across 5 architectures, 3 LLM families, 4 benchmarks. Centralized coordination improves parallelizable tasks by +80.9% but degrades sequential tasks by −39% to −70%. Independent agents amplify errors 17.2× vs. centralized at 4.4×. A critical finding: above ~45% single-agent accuracy, coordination actually hurts. Their predictive model achieves 87% accuracy for optimal architecture selection. Total reasoning turns show power-law growth with agent count.

Riedl (2025) introduces an information-theoretic framework using partial information decomposition (PID) of time-delayed mutual information to test whether multi-agent LLM systems exhibit higher-order structure. The approach distinguishes spurious temporal coupling from performance-relevant cross-agent synergy, providing specific design principles for productive LLM collectives.

Emergent communication research validates the entropy framework

The emergent communication literature provides theoretical validation. Tucker et al. (NeurIPS 2022) showed that trading off utility, informativeness, and complexity via the Vector-Quantized Variational Information Bottleneck produces communication efficiency mirroring human language evolution pressures. Karten et al. (2023) used IB to capture referential complexity and task-specific utility in emergent protocols, demonstrating that information-theoretic constraints improve message compression. The fundamental insight: human language itself evolved under information-theoretic pressures, and optimal agent communication converges on similar solutions.

Conclusion: An entropy framework for agent topology is both theoretically grounded and practically viable

The research converges on several concrete insights for the blog post’s proposed framework:

The DPI provides the mathematical backbone. Every agent in a chain is an information bottleneck that can only lose signal. This isn’t a design flaw—it’s a law. The framework should use this to argue for minimizing chain depth and maximizing information density at each transfer.

Mutual information is the right metric for agent communication quality. The Stanford Hazy Research paper empirically validates that MI between compressed and original context predicts downstream task performance (R² = 0.71), is task-agnostic, and can be measured without running the full pipeline. This is the blog’s proposed “entropy score” for each communication link.

Dynamic topology dramatically outperforms static topology. Systems like DyTopo, GTD, and G-Designer show 6–15% accuracy improvements and up to 95% token savings by adapting communication graphs per-task and per-round. The blog can frame this as: entropy-optimal topology is task-dependent, and the overhead of computing it is far less than the cost of suboptimal communication.

Sparse beats dense. Nearly all empirical evidence shows that moderately sparse topologies suppress error propagation while preserving beneficial information diffusion. The IEI/SEI metrics provide a concrete way to measure this—the blog can propose computing these at each communication step.

The compressor matters more than the predictor. Scaling worker/compressor agents from 1B to 7B improves accuracy by 60%, while scaling the predictor from 70B to 405B adds only 12%. This reframes the orchestration design question: invest in making each agent’s output maximally information-dense rather than making the orchestrator smarter.

Joint prompt-topology optimization is the frontier. Optimizing what agents say (prompts) and who they say it to (topology) jointly outperforms optimizing either alone. The blog’s proposed entropy framework should encompass both dimensions.

The field is moving fast—most of the key papers cited here are from 2024–2026, and the convergence of information theory with practical agent orchestration is just beginning. The blog post is well-timed to synthesize these threads into a coherent framework.