Performance

mado’s performance story is a single trade: reads are lazy, so the first read of un-cached content pays a fetch, and everything after it is a local cache hit. This page explains the two mechanisms that make cold reads fast — locality packs and the disk cache — and gives honest cold-vs-warm expectations, citing the measured numbers by their source section and date.

All numbers below are copied verbatim from the “as measured” sections of docs/design/performance-testing.md and docs/design/locality-pack-activation.md. They are point measurements on specific hardware (largely an Intel N100) with injected network latency, not guarantees. Where a number does not exist, this page says so rather than estimating.

The cost model

Cold read — the first read of a file whose content is not in the local disk cache. It costs one (or more) object-store round trips (GETs). Over a remote backend with RTT r, a naive per-file cold walk of N files costs roughly N × r.
Warm read — the content is already in the disk cache. No network; a local read.

The two levers below both attack the cold case: locality packs cut the GET count (many files in one ranged fetch), and the disk cache turns every cold read into a permanent warm one.

Locality packs

A locality pack (MDLP on disk) bundles a subtree’s small inline files into one content-addressed blob. Faulting any file in that subtree fetches the whole pack in one ranged GET and explodes it into the disk cache, instead of one GET per file. This is the mechanism that makes a cold walk of a node_modules- or nixpkgs-shaped tree fast.

Building packs

mado pack                     # per-directory (v1) packs for @'s tree
mado pack --recursive         # recursive-subtree (v2) packs — the better default
mado pack -r <revset>         # pack a specific revision's tree

mado pack (per-directory, v1) bundles each directory’s dense inline files. It wins on genuinely per-directory-dense shapes (vendor node_modules), but on thin, spread-out trees it can be worse than no packs (each dense subdir pays a pack GET on top of per-file faults — measured below).
mado pack --recursive (v2, the covering index) bundles a whole thin subtree’s inline files across its many small directories into one pack root. This is the measured-best default for real repository shapes. The daemon loads either index version transparently.

Packs are a reclaimable cache, never a GC root: sweeping them only makes reads fall back to the per-file path. Building is idempotent — an unchanged tree re-uploads nothing.

Zero-config discovery (import-time publishing)

You usually do not have to run mado pack by hand. mado import-git builds and publishes the recursive pack index for each imported tip by default (--no-packs opts out). It does this without any extra remote traffic: the import just wrote every tree and file object, so the pack builder reads that content warm-local — it never issues a remote GET the pack exists to eliminate (the “never build by re-fetching” invariant).

Discovery is content-addressed via a well-known derived pointer (MDPW, keyed by the tree id), so a fresh mount finds the packs with no local file and no manual step. A mado pack run publishes this pointer too (while still writing a local pointer for offline use); the daemon attach path falls back from the local pointer to the derived one.

As measured (locality-pack-activation.md §“Follow-ups” item 2, 2026-07-05): import-time recursive-index build measured 0.15 s for a 2000-file tree — the import just wrote the content, so the builder reads warm-local. A fresh client holding only the tree id discovers and warms the subtree in ≤ 1 GET.

Observing pack state

An unpacked large tree silently falls back to per-file cold reads. mado pack-status makes that observable instead of surfacing only as mysterious slowness:

mado pack-status

It queries the running mount daemon over the control socket and reports either that a prefetcher is attached (mode per-dir / recursive, index version, directory-entry count, and whether discovery came via local pointer or via derived pointer) or the distinct reason it is not (no index built, index for a different tree, index swept, undecodable, …). At attach time, a large tree served without packs also emits an advisory: “cold reads are per-file — run mado pack --recursive.” mado pack-status requires a live mado mount / serve-virtiofs in the workspace.

As measured

Recursive vs per-directory on nixpkgs (locality-pack-activation.md §“Recursive-subtree packs as built”, 2026-07-04, cap 64 MiB): 10 roots → 10 packs, 72.7 MiB (the whole pkgs subtree — 46 134 members — fits ONE 62.4 MiB pack). Modeled whole-tree cold walk 50 747 → 1 958 GETs (−96.1 %). Real FUSE cells at RTT 20 ms, off / per-dir / recursive:

Subtree	off	per-dir	recursive
`nixos/tests` (dense, 1000-file walk)	998 GETs / 24.9 s	176 GETs / 5.5 s	2 GETs / 1.4 s
`pkgs/development/python-modules` (thin)	1000 GETs / 25.4 s	1936 GETs / 47.8 s	2 GETs / 10.4 s

The thin-subtree row is the key honesty point: per-directory packing is worse than no packs on a thin tree (dense subdirs each pay a pack GET on top of the per-file faults), while recursive wins by 2.4× even including the one-time 46k-member explode.

Real-tree census (performance-testing.md §“Real-tree study as measured”, 2026-07-04, full nixpkgs — 52 695 files / 37 086 dirs, avg 1.42 files/dir): only 76 dirs exceed the per-directory trigger, covering just 8.75 % of inline files — the hard ceiling on what per-directory packing can do on a thin tree, and the reason recursive packs exist.

Chunked-file coverage (locality-pack-activation.md §“Follow-ups” item 3, 2026-07-05): MDLP v2 packs a chunked file’s manifest and chunks as typed members. On nixpkgs, 1 945 of 1 948 chunked files now pack (3 giant files stay per-file), taking the whole-tree cold-walk model 1 958 → 787 GETs.

Streamed explode (locality-pack-activation.md §“Follow-ups” item 1, 2026-07-04): the first faulting read now waits only for the pack GET, not the whole explode. Time-to-first-file dropped 954 → 19 ms (RTT 0), 7195 → 57 ms (RTT 20), 16222 → 83 ms (RTT 50), at an unchanged 1-GET budget.

Live FUSE trigger path (performance-testing.md §“Cold-fill fsync + explode serialization levers”, 2026-07-04): a cold 300-file dense subtree walk through the live daemon (trigger-driven prefetch, pack RTT inside the timed walk), median of 5, after the fsync + explode-concurrency fixes:

RTT (ms)	prefetch off, cold	trigger, cold
0	703 ms	222 ms
20	7030 ms	262 ms
50	16068 ms	283 ms

The trigger path serves the whole cold walk in exactly 1 remote GET at every RTT (asserted); its wall-clock is roughly flat in RTT (one round trip) while the per-file path scales as N × RTT — ~60× faster than per-file at 50 ms.

The disk cache

Every workspace has a persistent, content-addressed on-disk blob cache at .jj/working_copy/blob-cache/. Every process of the workspace — the CLI, the mount daemon, across restarts — reads through it. It is:

content-addressed (keyed by BLAKE3 hash), so a cached blob is never stale;
never invalidated, and never evicted by default — it grows with everything materialized or read;
safe to delete at any time (an evicted key simply re-fetches).

Cached fills for value-hashed content are written without a per-blob fsync, protected instead by verify-on-read + self-heal: every local hit is checked (blake3(value) == key.hash); a torn or bit-rotted file is deleted and the read falls through to re-fetch a good copy. This is strictly stronger than fsync (fsync never caught bit rot) — the one documented edge is that a local-only exploded pack member torn by a power crash reads as absent until the next mount’s trigger re-explodes the pack.

Bounding the cache

By default the cache is unbounded. Set MADO_DISK_CACHE_MAX_BYTES=<bytes> to cap it; the tier then evicts least-recently-used blobs to stay at or below the cap. The cap is read once at client-stack construction, so it applies uniformly to the mount daemon, serve-virtiofs, and the CLI backends. 0, empty, or unparseable keeps the unbounded default.

As measured (performance-testing.md §“E3 cache-size curves”, 2026-07-04): the cache-size curve is a cliff, not a slope. Re-reading a working set W with cap C, pass-2 remote GETs are exactly N at C = 0.5 W (classic LRU sequential-scan thrash) and exactly 0 at C ≥ W. A single oversized blob is stored regardless of the cap. Guidance: keep the default unbounded; when you do bound it, size the cap at or above the working set — a graceful sub-W default would need a scan-resistant eviction policy, which is not warranted today.

Prefetching before you go offline

mado materialize [PATH] walks @’s tree (default: the whole tree; a directory prefetches its whole subtree) and fetches every file’s blobs — manifests and chunks — into the disk cache with bounded concurrency. It is the hermetic alternative to faulting content in lazily: run it before going offline or before a latency-sensitive build so no read of committed content ever waits on the network.

mado materialize            # prefetch the whole working-copy tree
mado materialize src/lib     # prefetch just a subtree

mado materialize requires a remote-backed workspace (MADO_REMOTE_ADDR); on a local workspace every blob is already on disk and it is a no-op.

Resume readiness

A mado ws resume restores a workspace’s tree with the reads parallelized, so a resumed build is incremental rather than cold.

As measured (performance-testing.md §“F2 resume TTFB”, 2026-07-04): resuming a 2000×1 KiB scratch shape over injected RTT, after the concurrency fix (16 in-flight ordered prefetches), took 2.1 s (RTT 0), 5.3 s (RTT 20), 9.7 s (RTT 50) — down from 90+ seconds when the restore was serial. GET and byte counts are identical before and after: the fix moves time, not requests. Restored tree facts (content, exec bits, file mtimes) are byte-identical, asserted — so the resumed build is incremental. Note that reads on the resume target are per-file, not per-unique-content (a fresh target has no cache tier yet).

Concurrent multi-agent load

The single-actor numbers above measure one reader or one writer in isolation. The concurrent story (performance-testing.md §“Concurrent multi-agent workload” / §“As measured”, 2026-07-03, concurrent_agents.rs, in-memory fast tier) is a nightly trend tool, not a PR gate (latency percentiles and CAS retry counts depend on thread interleaving and are inherently non-repeatable):

Read fan-out: overlapping cold reads collapse to one key set at every concurrency (unique GETs identical at N ∈ {1,2,4,8}, asserted) — content addressing works under concurrent fan-out.
Write dedup: cross-workspace dedup collapses file content (~70 % saved vs independent copies) but not per-workspace manifests — each agent still contributes a ~131 KiB manifest that does not dedup (the sole varying field is a directory-node mtime). Documented as a known gap.
Checkpoint CAS: because the workspace lease is single-writer per scope, concurrent checkpoints on distinct scopes show CAS retries = 0 at every N by construction.
Interference (the #21 thesis): one foreground reader’s p50 is essentially flat as background write+checkpoint load rises 0→8 agents; its p99 inflates only ~1.17–1.19× at RTT 20 ms and ~1.30–1.39× at RTT 50 ms — so background durability stays subordinate to foreground reads.

Keyboard shortcuts

mado