Recovery-path assertion stall class
Source-traced incident class observed on monad testnet · open archive · 2 confirmed cases · last seen 2026-05-18
What it is
A class of stall on a monad full node where monad-execution aborts with a
MONAD_ASSERT on the recovery path — specifically when monad-bft
is restarted while monad-execution remains alive, and BFT replays proposals
whose block_id is still in monad-execution's in-memory
block_cache window.
Two cases have been confirmed by independent operators so far. The trigger line spans at least monad v0.14.1 and v0.14.3 — this is a stable fragility of the recovery path, not a regression in a particular release.
Operator-visible signature
- monad-bft stops advancing rounds locally; the pacemaker keeps firing local timeouts at the configured cadence;
rounds_totalin any minute-level aggregate drops to 0 whilelocal_timeoutsstays at the round-timer floor. - The chain advances without you. A reference RPC keeps producing blocks. The wedge is local to the affected node.
- BFT restart triggers the abort. When the operator restarts only monad-bft and leaves monad-execution running, monad-execution SIGABRTs within tens of seconds with a
MONAD_ASSERTon the recovery path. - Recovery requires a fresh on-disk state. Both confirmed cases resolved by landing on a clean snapshot (statesync or operator-driven workspace reset) rather than by re-running the same state.
Source trace
The assertion is in cmd/monad/runloop_monad.cpp in the public
category-labs/monad
repository. The relevant block at line 216 (as of the time of writing):
MONAD_ASSERT(
block_cache
.emplace(
block_id,
BlockCacheEntry{...})
.second);
block_cache is an unordered map keyed by the 32-byte
block_id. emplace().second == false means the key was
already in the map — i.e., the same block_id was proposed for execution twice.
The cache is bootstrap-filled on startup, sliding-window-pruned after
finalize, and lives for the lifetime of the monad-execution process.
When monad-bft is restarted independently, BFT re-establishes IPC to the
still-live monad-execution and replays proposals from its WAL / blocksync
catch-up. Some of those replayed proposals carry block_id values
already in monad-execution's block_cache window. Re-emplacing
triggers the assertion. SIGABRT, core dump, dead service.
A second operator observed the same class on monad v0.14.1 on a different
line of the same file; the framing "recovery-path assertion in
runloop_monad.cpp" therefore covers multiple sites in the same
file rather than a single line.
Mitigations (in place on this node)
-
systemd Restart=on-failure on both
monad-execution.serviceandmonad-bft.servicevia a drop-in:
Upstream ships# /etc/systemd/system/monad-execution.service.d/restart.conf [Service] Restart=on-failure RestartSec=15s StartLimitBurst=3 StartLimitIntervalSec=600Restart=no; this catches a SIGABRT of the recovery-path class and bounces execution automatically within 15 s. TheStartLimitBurstguard avoids a tight crash-loop masking a deeper problem. -
journald retention bumped so the BFT debug trace
survives the next incident. Defaults
(
SystemMaxFiles=100) can evict system journals within hours on a node with heavyRUST_LOG=debugoutput:# /etc/systemd/journald.conf.d/retention.conf [Journal] Storage=persistent SystemMaxUse=120G SystemMaxFiles=2000 SystemMaxFileSize=256M
What is still unknown
- The proximate cause of Layer 1. Both confirmed cases fell out of consensus at an epoch boundary on the affected node without a corresponding chain-wide event. The exact BFT-side reason (validator-set view divergence at the boundary, leader-election race, lost RaptorCast chunks for the proposal) requires the BFT debug-level trace from the pre-restart journal — which one of the two cases lost to the systemd journal eviction described above.
- Whether the two confirmed cases share Layer 1. Same recovery-path file class is confirmed; same proximate cause for the wedge is not.
Contribute an observation
If you have observed a SIGABRT of monad-execution on the recovery path after a monad-bft restart, regardless of monad version, please file an issue at github.com/rustemar/monad-ops with:
- monad package version installed at the time
- UTC timestamp of the initial wedge and of the SIGABRT
- The exact
runloop_monad.cppline on which the assertion fired (if available) - Whether the node had been restarted recently or had multi-day uptime before the wedge
- How recovery was achieved (statesync, workspace reset, manual data wipe)
Operator identities and node-specific details remain private unless you explicitly opt in — only the anonymised case-class counters are surfaced back into this page.
Scope and limitations
- Operator-grade write-up. Source citations point at public files in the public monad repository at
HEADat the time of writing. Line numbers drift; the file-class framing is what survives. - Anonymised by default. Operator identities and node addresses are kept out of the public counter. The class is a property of the codebase, not of any specific operator.
- Not Foundation guidance. Mitigations listed here are operator-driven defaults applied on a single full node. Treat as a starting point, not a recommendation from the protocol team.