Performance
mado’s performance story is a single trade: reads are lazy, so the first read of un-cached content pays a fetch, and everything after it is a local cache hit. This page explains the two mechanisms that make cold reads fast — locality packs and the disk cache — and gives honest cold-vs-warm expectations, citing the measured numbers by their source section and date.
All numbers below are copied verbatim from the “as measured” sections of
docs/design/performance-testing.mdanddocs/design/locality-pack-activation.md. They are point measurements on specific hardware (largely an Intel N100) with injected network latency, not guarantees. Where a number does not exist, this page says so rather than estimating.
The cost model
- Cold read — the first read of a file whose content is not in the local
disk cache. It costs one (or more) object-store round trips (GETs). Over a
remote backend with RTT
r, a naive per-file cold walk ofNfiles costs roughlyN × r. - Warm read — the content is already in the disk cache. No network; a local read.
The two levers below both attack the cold case: locality packs cut the GET count (many files in one ranged fetch), and the disk cache turns every cold read into a permanent warm one.
Locality packs
A locality pack (MDLP on disk) bundles a subtree’s small inline files into one
content-addressed blob. Faulting any file in that subtree fetches the whole
pack in one ranged GET and explodes it into the disk cache, instead of one GET
per file. This is the mechanism that makes a cold walk of a node_modules- or
nixpkgs-shaped tree fast.
Building packs
mado pack # per-directory (v1) packs for @'s tree
mado pack --recursive # recursive-subtree (v2) packs — the better default
mado pack -r <revset> # pack a specific revision's tree
mado pack(per-directory, v1) bundles each directory’s dense inline files. It wins on genuinely per-directory-dense shapes (vendornode_modules), but on thin, spread-out trees it can be worse than no packs (each dense subdir pays a pack GET on top of per-file faults — measured below).mado pack --recursive(v2, the covering index) bundles a whole thin subtree’s inline files across its many small directories into one pack root. This is the measured-best default for real repository shapes. The daemon loads either index version transparently.
Packs are a reclaimable cache, never a GC root: sweeping them only makes reads fall back to the per-file path. Building is idempotent — an unchanged tree re-uploads nothing.
Zero-config discovery (import-time publishing)
You usually do not have to run mado pack by hand. mado import-git builds
and publishes the recursive pack index for each imported tip by default
(--no-packs opts out). It does this without any extra remote traffic: the
import just wrote every tree and file object, so the pack builder reads that
content warm-local — it never issues a remote GET the pack exists to eliminate
(the “never build by re-fetching” invariant).
Discovery is content-addressed via a well-known derived pointer (MDPW, keyed
by the tree id), so a fresh mount finds the packs with no local file and no
manual step. A mado pack run publishes this pointer too (while still writing
a local pointer for offline use); the daemon attach path falls back from the
local pointer to the derived one.
As measured (
locality-pack-activation.md§“Follow-ups” item 2, 2026-07-05): import-time recursive-index build measured 0.15 s for a 2000-file tree — the import just wrote the content, so the builder reads warm-local. A fresh client holding only the tree id discovers and warms the subtree in ≤ 1 GET.
Observing pack state
An unpacked large tree silently falls back to per-file cold reads. mado pack-status makes that observable instead of surfacing only as mysterious
slowness:
mado pack-status
It queries the running mount daemon over the control socket and reports either
that a prefetcher is attached (mode per-dir / recursive, index version,
directory-entry count, and whether discovery came via local pointer or via derived pointer) or the distinct reason it is not (no index built, index for a
different tree, index swept, undecodable, …). At attach time, a large tree
served without packs also emits an advisory: “cold reads are per-file — run
mado pack --recursive.” mado pack-status requires a live
mado mount / serve-virtiofs in the workspace.
As measured
Recursive vs per-directory on nixpkgs (locality-pack-activation.md
§“Recursive-subtree packs as built”, 2026-07-04, cap 64 MiB): 10 roots → 10
packs, 72.7 MiB (the whole pkgs subtree — 46 134 members — fits ONE 62.4 MiB
pack). Modeled whole-tree cold walk 50 747 → 1 958 GETs (−96.1 %). Real FUSE
cells at RTT 20 ms, off / per-dir / recursive:
| Subtree | off | per-dir | recursive |
|---|---|---|---|
nixos/tests (dense, 1000-file walk) | 998 GETs / 24.9 s | 176 GETs / 5.5 s | 2 GETs / 1.4 s |
pkgs/development/python-modules (thin) | 1000 GETs / 25.4 s | 1936 GETs / 47.8 s | 2 GETs / 10.4 s |
The thin-subtree row is the key honesty point: per-directory packing is worse than no packs on a thin tree (dense subdirs each pay a pack GET on top of the per-file faults), while recursive wins by 2.4× even including the one-time 46k-member explode.
Real-tree census (performance-testing.md §“Real-tree study as measured”,
2026-07-04, full nixpkgs — 52 695 files / 37 086 dirs, avg 1.42 files/dir): only
76 dirs exceed the per-directory trigger, covering just 8.75 % of inline
files — the hard ceiling on what per-directory packing can do on a thin tree,
and the reason recursive packs exist.
Chunked-file coverage (locality-pack-activation.md §“Follow-ups” item 3,
2026-07-05): MDLP v2 packs a chunked file’s manifest and chunks as typed
members. On nixpkgs, 1 945 of 1 948 chunked files now pack (3 giant files stay
per-file), taking the whole-tree cold-walk model 1 958 → 787 GETs.
Streamed explode (locality-pack-activation.md §“Follow-ups” item 1,
2026-07-04): the first faulting read now waits only for the pack GET, not the
whole explode. Time-to-first-file dropped 954 → 19 ms (RTT 0), 7195 →
57 ms (RTT 20), 16222 → 83 ms (RTT 50), at an unchanged 1-GET budget.
Live FUSE trigger path (performance-testing.md §“Cold-fill fsync + explode
serialization levers”, 2026-07-04): a cold 300-file dense subtree walk through
the live daemon (trigger-driven prefetch, pack RTT inside the timed walk),
median of 5, after the fsync + explode-concurrency fixes:
| RTT (ms) | prefetch off, cold | trigger, cold |
|---|---|---|
| 0 | 703 ms | 222 ms |
| 20 | 7030 ms | 262 ms |
| 50 | 16068 ms | 283 ms |
The trigger path serves the whole cold walk in exactly 1 remote GET at every
RTT (asserted); its wall-clock is roughly flat in RTT (one round trip) while the
per-file path scales as N × RTT — ~60× faster than per-file at 50 ms.
The disk cache
Every workspace has a persistent, content-addressed on-disk blob cache at
.jj/working_copy/blob-cache/. Every process of the workspace — the CLI, the
mount daemon, across restarts — reads through it. It is:
- content-addressed (keyed by BLAKE3 hash), so a cached blob is never stale;
- never invalidated, and never evicted by default — it grows with everything materialized or read;
- safe to delete at any time (an evicted key simply re-fetches).
Cached fills for value-hashed content are written without a per-blob fsync,
protected instead by verify-on-read + self-heal: every local hit is checked
(blake3(value) == key.hash); a torn or bit-rotted file is deleted and the read
falls through to re-fetch a good copy. This is strictly stronger than fsync
(fsync never caught bit rot) — the one documented edge is that a local-only
exploded pack member torn by a power crash reads as absent until the next
mount’s trigger re-explodes the pack.
Bounding the cache
By default the cache is unbounded. Set MADO_DISK_CACHE_MAX_BYTES=<bytes>
to cap it; the tier then evicts least-recently-used blobs to stay at or below
the cap. The cap is read once at client-stack construction, so it applies
uniformly to the mount daemon, serve-virtiofs, and the CLI backends. 0,
empty, or unparseable keeps the unbounded default.
As measured (
performance-testing.md§“E3 cache-size curves”, 2026-07-04): the cache-size curve is a cliff, not a slope. Re-reading a working setWwith capC, pass-2 remote GETs are exactlyNatC = 0.5 W(classic LRU sequential-scan thrash) and exactly0atC ≥ W. A single oversized blob is stored regardless of the cap. Guidance: keep the default unbounded; when you do bound it, size the cap at or above the working set — a graceful sub-Wdefault would need a scan-resistant eviction policy, which is not warranted today.
Prefetching before you go offline
mado materialize [PATH] walks @’s tree (default: the whole tree; a directory
prefetches its whole subtree) and fetches every file’s blobs — manifests and
chunks — into the disk cache with bounded concurrency. It is the hermetic
alternative to faulting content in lazily: run it before going offline or before
a latency-sensitive build so no read of committed content ever waits on the
network.
mado materialize # prefetch the whole working-copy tree
mado materialize src/lib # prefetch just a subtree
mado materialize requires a remote-backed workspace (MADO_REMOTE_ADDR); on a
local workspace every blob is already on disk and it is a no-op.
Resume readiness
A mado ws resume restores a workspace’s tree with the reads parallelized, so a
resumed build is incremental rather than cold.
As measured (
performance-testing.md§“F2 resume TTFB”, 2026-07-04): resuming a 2000×1 KiB scratch shape over injected RTT, after the concurrency fix (16 in-flight ordered prefetches), took 2.1 s (RTT 0), 5.3 s (RTT 20), 9.7 s (RTT 50) — down from 90+ seconds when the restore was serial. GET and byte counts are identical before and after: the fix moves time, not requests. Restored tree facts (content, exec bits, file mtimes) are byte-identical, asserted — so the resumed build is incremental. Note that reads on the resume target are per-file, not per-unique-content (a fresh target has no cache tier yet).
Concurrent multi-agent load
The single-actor numbers above measure one reader or one writer in isolation.
The concurrent story (performance-testing.md §“Concurrent multi-agent
workload” / §“As measured”, 2026-07-03, concurrent_agents.rs, in-memory fast
tier) is a nightly trend tool, not a PR gate (latency percentiles and CAS
retry counts depend on thread interleaving and are inherently non-repeatable):
- Read fan-out: overlapping cold reads collapse to one key set at every concurrency (unique GETs identical at N ∈ {1,2,4,8}, asserted) — content addressing works under concurrent fan-out.
- Write dedup: cross-workspace dedup collapses file content (~70 % saved vs independent copies) but not per-workspace manifests — each agent still contributes a ~131 KiB manifest that does not dedup (the sole varying field is a directory-node mtime). Documented as a known gap.
- Checkpoint CAS: because the workspace lease is single-writer per scope, concurrent checkpoints on distinct scopes show CAS retries = 0 at every N by construction.
- Interference (the #21 thesis): one foreground reader’s p50 is essentially flat as background write+checkpoint load rises 0→8 agents; its p99 inflates only ~1.17–1.19× at RTT 20 ms and ~1.30–1.39× at RTT 50 ms — so background durability stays subordinate to foreground reads.