Probing Dec-POMDP Reasoning in Cooperative MARL

1University of Edinburgh    2Politecnico di Milano
  Oral, AAMAS 2026
Visual overview of probing multi-agent policy behaviour under trained rollouts
Motivation

As more multi-agent environments are developed, it is important to know whether they test the intended Dec-POMDP properties (under widely used training algorithms). High returns can mask a failure to learn the underlying coordination challenge.

Research Question

Do modern cooperative MARL environments truly test the Dec-POMDP properties that make these problems hard, or do they permit success via strategies that bypass them?

TL;DR

In many modern environments, under IPPO/MAPPO rollouts, reactive policies (with no memory) match memory-based agents in over half the scenarios. Our probes also disentangle hidden state from hidden teammate information, and show that synchronous and temporal coordination mechanisms often dissociate across benchmarks.

Contributions (Click Me)
  1. Diagnostic framework. We introduce information-theoretic probes – measuring history dependence, private information flow, synchronous action coupling, and directed temporal influence – that audit whether learned policies actually exhibit Dec-POMDP reasoning, beyond what raw returns reveal.
  2. Systematic benchmark audit. We evaluate 37 scenarios across seven benchmark suites, revealing that history dependence is ubiquitous but rarely performance-critical, coordination structures vary qualitatively across domains, and few environments jointly test both partial observability and coordination.
  3. Open-source tooling and implications. We release diagnostic tools for researchers to audit their own environments, and discuss implications for designing tasks where partial observability and coordination are non-optional - library.

Abstract

Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. The Dec-POMDP framing motivates agents that use history to update hidden states and coordinate based on local information. Yet it remains unclear whether commonly trained policies on popular benchmark suites exhibit this reasoning or succeed via simpler strategies.

We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that, under these baseline policy distributions, these behaviours often do not show evidence that success depends on genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence.

These findings suggest that, under current training paradigms and baseline policies, some benchmark evaluations may not fully exercise core Dec-POMDP assumptions, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous policy-behaviour analysis and evaluation in cooperative MARL.

Methodology

1. train policies
Train in each environment
Train IPPO/MAPPO FF and RNN policies in each benchmark scenario.
2. collect rollouts
Sample behaviour under the converged joint policy distribution.
agent i Oₜ₋₁ Aₜ₋₁ Oₜ Aₜ
agent j Oₜ₋₁ Aₜ₋₁ Oₜ Aₜ
3. compute probes
Measure information dependence between trajectory variables and actions.
OAR HAR PIF AA DAI
info act given
4. audit behaviours

The information-theoretic diagnostics below audit different facets of Dec-POMDP-style behaviour under the trained policy distribution. They are diagnostic readings of sampled IPPO/MAPPO behaviours, not causal claims or worst-case/best-case properties of the environment. Click a panel to pin its information flow; hover a coloured variable in a formula to highlight the matching trajectory cells.

Source
Target
Conditioning
Click a metric block or formula chip to inspect it
Selected metric

Observation-Action Relevance

Current observation Current action No conditioning variables
Click a coloured formula term to see exactly which trajectory cells it refers to.

This probe asks whether the action is predictable from the current observation; high values are consistent with more reactive behaviour, especially when history-based probes are low.

Compute this diagnostic
(a)
Observation–Action Relevance
Ag. i O₁ⁱ A₁ⁱ O₂ⁱ A₂ⁱ Oₜⁱ Aₜⁱ Oₜⁱ
OAR ≜ 𝕀(Oₜⁱ ; Aₜⁱ)
If high, suggestsActions are largely predictable from the current observation under these rollouts.
(b)
History–Action Relevance
Ag. i O₁ⁱ A₁ⁱ O₂ⁱ A₂ⁱ Oₜⁱ Aₜⁱ Oₜⁱ Hₜⁱ
HAR ≜ 𝕀(Hₜⁱ ; Aₜⁱ | Oₜⁱ)
If high, suggestsHistory helps predict actions beyond the current observation; ΔMem is needed to assess performance utility.
(c)
Private Information Flow
Ag. i Ag. j τₜ₋₁ⁱ O₁ⁱ A₁ⁱ Aₜ₋₁ⁱ Oₜⁱ Aₜⁱ Oₜⁱ O₁ʲ A₁ʲ Aₜ₋₁ʲ Oₜʲ Aₜʲ Oₜʲ τₜ₋₁ʲ
PIFi→j ≜ 𝕀((τₜ₋₁ⁱ,Oₜⁱ) ; Aₜʲ | (τₜ₋₁ʲ,Oₜʲ))
If high, suggestsi's trajectory adds predictive information about j's action beyond j's own history.
(d)
Action–Action Coupling
Ag. i Ag. j O₁ⁱ A₁ⁱ O₂ⁱ A₂ⁱ Oₜⁱ Aₜⁱ Oₜⁱ O₁ʲ A₁ʲ O₂ʲ A₂ʲ Oₜʲ Aₜʲ Oₜʲ
AA ≜ 𝕀(Aₜⁱ ; Aₜʲ | Oₜⁱ,Oₜʲ)
If high, suggestsResidual same-timestep action coupling after conditioning on both current observations, consistent with instantaneous conventions or symmetry breaking.
(e)
Directed Action Information
Ag. i Ag. j τₜ₋₁ⁱ O₁ⁱ A₁ⁱ Oₜ₋₁ⁱ Aₜ₋₁ⁱ Aₜⁱ Aₜⁱ O₁ʲ A₁ʲ Oₜ₋₁ʲ Aₜ₋₁ʲ Aₜʲ Aₜʲ τₜ₋₁ʲ
DAIi→j ≜ ⅟T∑ₜ 𝕀(τₜ₋₁ⁱ ; Aₜʲ | τₜ₋₁ʲ)
If high, suggestsi's past carries additional predictive information about j's current action beyond j's own past.
READING THE PROBES
Together, they help localise where behavioural dependence appears.

High OAR + low HAR → actions are mostly predictable from the current observations, consistent with reactive policies.

High AA + low DAI → synchronous coupling without much temporal cross-agent signal, consistent with instantaneous conventions or symmetry breaking.

High PIF + DAI → private teammate information and temporal dependence are both present in the sampled behaviours, consistent with Dec-POMDP reasoning.

See Figure 3 in the paper for benchmark-by-benchmark readings.

Figure 2. Information-theoretic diagnostics. Colours: source, target, conditioning (dashed). Oₜⁱ = observation, Aₜⁱ = action, Hₜⁱ = local history, τₜ₋₁ⁱ = action–observation history up to t−1.

Results

% of scenarios whose tested IPPO/MAPPO rollout behaviours are flagged by each diagnostic decision rule. Colour intensity ∝ fraction. Click any block to unpack the result.

43% show memory-relevant gains
62% show private information flow
MPE has the most consistent signal across criteria
1

History dependence ≠ utility. All policies show some history dependence, but only 43% need memory for high returns.

2

Hidden state vs. teammate info. Probes disentangle hidden state and hidden teammate information as separate difficulty drivers (e.g., Overcooked V1→V2).

3

Coordination structure varies. Synchronous and temporal mechanisms dissociate across benchmarks.

4

Few benchmarks jointly test partial observability and coordination. MPE is the only suite in which every scenario satisfies all diagnostic criteria. Most scenarios do not require meaningful history use for strong performance despite being framed as Dec-POMDP challenges. Many detected dependencies are above null baselines but still modest, leaving plenty of room to design better envs.

Design benchmarks where partial observability and decentralised coordination are non-optional for success. We provide metrics and an open-source library that can help with this. See the paper appendix for detailed diagnostic values, thresholds, null baselines, and confidence intervals.

MPE SMAX V1 SMAX V2 MaBrax Hanabi OC V1 OC V2
Is partial observability relevant?ΔMem + HAR 100%(3/3) 100%(9/9) 100%(3/3) 20%(1/5) 0%(0/1) 0%(0/5) 0%(0/11)
Does teammate information help predict actions?PIF 100%(3/3) 67%(6/9) 67%(2/3) 40%(2/5) 0%(0/1) 20%(1/5) 82%(9/11)
Is synchronous coordination detected?AA 100%(3/3) 44%(4/9) 0%(0/3) 80%(4/5) 0%(0/1) 100%(5/5) 82%(9/11)
Is temporal coordination detected?DAI 100%(3/3) 67%(6/9) 67%(2/3) 80%(4/5) 100%(1/1) 40%(2/5) 100%(11/11)

OC = Overcooked. Under the tested policy behaviours, MPE is the only suite where every scenario is flagged by all four diagnostic criteria. Scroll right on small screens. Full diagnostic tables are in the paper appendix.

Selected result

Click a result block

Choose a benchmark and diagnostic to see what that cell says about partial observability, hidden teammate information, or coordination structure.

See metric flow

Quickstart

Install and run all five diagnostics on your own trajectories:

pip install dec-pomdp-diagnostics

import dec_pomdp_diagnostics as dpd

data = dpd.UserData(
    observations  = {"agent_0": obs0, "agent_1": obs1},   # (N, obs_dim)
    actions       = {"agent_0": act0, "agent_1": act1},   # (N,) int
    timesteps     = {"agent_0": ts0,  "agent_1": ts1},    # (N,) int
    episode_ids   = {"agent_0": eps0, "agent_1": eps1},   # (N,) int
    hidden_states = {"agent_0": h0,   "agent_1": h1},     # optional, RNN only
    env_name="my_env", alg_name="IPPO_RNN", seed=0,
)

result = dpd.compute_diagnostics(data, history_k=3, null_reps=5)
print(result.describe())
# ✓  History dependence > null
# ✓  Does teammate information help predict actions?
# ✗  Does synchronous coordination emerge?
# ✓  Does temporal coordination emerge?

Challenges and Limitations

Policy-dependent probes

All diagnostics are expectations under the converged joint policy pπ and therefore characterise learned behaviour under IPPO/MAPPO with FF/RNN architectures, not worst-case or best-case properties of the environment. This is deliberate, as we probe behaviours induced by widely used algorithms; however, stronger or weaker algorithms may yield different diagnostic profiles for the same scenario.

Estimation noise

Our MI/CMI/DI estimators (kNN and KSG) are biased in finite samples, especially with long histories or large action spaces. We mitigate this via permutation null baselines that account for estimator-specific bias, and report bootstrap confidence intervals throughout. Nonetheless, these probes are diagnostic tools, not hard pass/fail filters, and borderline cases should be interpreted with caution.

BibTeX

@inproceedings{tessera2026probing,
  title={Probing Dec-{POMDP} Reasoning in Cooperative {MARL}},
  author={{Kale-ab} Abebe Tessera and Leonard Hinckeldey and Riccardo Zamboni and David Abel and Amos Storkey},
  booktitle={The 25th International Conference on Autonomous Agents and Multi-Agent Systems},
  year={2026},
  url={https://openreview.net/forum?id=gSK8tR7du3}
}