monad-ops · recovery-path assertion stall class

What it is

A class of stall on a monad full node where monad-execution aborts with a MONAD_ASSERT on the recovery path — specifically when monad-bft is restarted while monad-execution remains alive, and BFT replays proposals whose block_id is still in monad-execution's in-memory block_cache window.

Two cases have been confirmed by independent operators so far. The trigger line spans at least monad v0.14.1 and v0.14.3 — this is a stable fragility of the recovery path, not a regression in a particular release.

Operator-visible signature

monad-bft stops advancing rounds locally; the pacemaker keeps firing local timeouts at the configured cadence; rounds_total in any minute-level aggregate drops to 0 while local_timeouts stays at the round-timer floor.
The chain advances without you. A reference RPC keeps producing blocks. The wedge is local to the affected node.
BFT restart triggers the abort. When the operator restarts only monad-bft and leaves monad-execution running, monad-execution SIGABRTs within tens of seconds with a MONAD_ASSERT on the recovery path.
Recovery requires a fresh on-disk state. Both confirmed cases resolved by landing on a clean snapshot (statesync or operator-driven workspace reset) rather than by re-running the same state.

Source trace

The assertion is in cmd/monad/runloop_monad.cpp in the public category-labs/monad repository. The relevant block at line 216 (as of the time of writing):

MONAD_ASSERT(
    block_cache
        .emplace(
            block_id,
            BlockCacheEntry{...})
        .second);

block_cache is an unordered map keyed by the 32-byte block_id. emplace().second == false means the key was already in the map — i.e., the same block_id was proposed for execution twice. The cache is bootstrap-filled on startup, sliding-window-pruned after finalize, and lives for the lifetime of the monad-execution process.

When monad-bft is restarted independently, BFT re-establishes IPC to the still-live monad-execution and replays proposals from its WAL / blocksync catch-up. Some of those replayed proposals carry block_id values already in monad-execution's block_cache window. Re-emplacing triggers the assertion. SIGABRT, core dump, dead service.

A second operator observed the same class on monad v0.14.1 on a different line of the same file; the framing "recovery-path assertion in runloop_monad.cpp" therefore covers multiple sites in the same file rather than a single line.

Mitigations (in place on this node)

systemd Restart=on-failure on both monad-execution.service and monad-bft.service via a drop-in:
```
# /etc/systemd/system/monad-execution.service.d/restart.conf
[Service]
Restart=on-failure
RestartSec=15s
StartLimitBurst=3
StartLimitIntervalSec=600
```
Upstream ships Restart=no; this catches a SIGABRT of the recovery-path class and bounces execution automatically within 15 s. The StartLimitBurst guard avoids a tight crash-loop masking a deeper problem.
journald retention bumped so the BFT debug trace survives the next incident. Defaults (SystemMaxFiles=100) can evict system journals within hours on a node with heavy RUST_LOG=debug output:
```
# /etc/systemd/journald.conf.d/retention.conf
[Journal]
Storage=persistent
SystemMaxUse=120G
SystemMaxFiles=2000
SystemMaxFileSize=256M
```

What is still unknown

The proximate cause of Layer 1. Both confirmed cases fell out of consensus at an epoch boundary on the affected node without a corresponding chain-wide event. The exact BFT-side reason (validator-set view divergence at the boundary, leader-election race, lost RaptorCast chunks for the proposal) requires the BFT debug-level trace from the pre-restart journal — which one of the two cases lost to the systemd journal eviction described above.
Whether the two confirmed cases share Layer 1. Same recovery-path file class is confirmed; same proximate cause for the wedge is not.

Contribute an observation

If you have observed a SIGABRT of monad-execution on the recovery path after a monad-bft restart, regardless of monad version, please file an issue at github.com/rustemar/monad-ops with:

monad package version installed at the time
UTC timestamp of the initial wedge and of the SIGABRT
The exact runloop_monad.cpp line on which the assertion fired (if available)
Whether the node had been restarted recently or had multi-day uptime before the wedge
How recovery was achieved (statesync, workspace reset, manual data wipe)

Operator identities and node-specific details remain private unless you explicitly opt in — only the anonymised case-class counters are surfaced back into this page.

Scope and limitations

Operator-grade write-up. Source citations point at public files in the public monad repository at HEAD at the time of writing. Line numbers drift; the file-class framing is what survives.
Anonymised by default. Operator identities and node addresses are kept out of the public counter. The class is a property of the codebase, not of any specific operator.
Not Foundation guidance. Mitigations listed here are operator-driven defaults applied on a single full node. Treat as a starting point, not a recommendation from the protocol team.