Disk Fault Model
This is the design note for Marionette's disk simulation work. Production-shaped
storage code should usually use std.Io; mar.Disk is the lower-level
sector/file-lifecycle capability underneath that backend. mar.SimDisk
currently supports
deterministic logical files, latency, read/write IO errors, probabilistic
corrupt reads, scripted sector corruption, and crash/restart simulation for
pending writes. Generic recoverability budgets are still being built.
The goal is a deterministic, recoverability-aware disk authority that can test real storage code without pretending to model every filesystem or device quirk. Marionette should make disk failures replayable from a seed, visible in the trace, and constrained enough that failures teach users something useful.
Goals
- Route every disk decision through the owning
World. - Make disk latency, errors, corruption, and crash timing deterministic.
- Preserve a stable trace of disk operations and fault decisions.
- Support single-node durability testing first.
- Leave room for replicated systems with per-node disk authorities later.
- Avoid fault profiles that destroy all durable truth unless explicitly requested.
Non-Goals
- Full filesystem simulation.
- Modeling every OS-specific
std.fsbehavior. - Arbitrary byte chaos as the default corruption model.
- Real block-device emulation.
- Transparent interception of direct filesystem calls.
- Unbounded double faults that make recovery impossible by construction.
VOPR Lessons
TigerBeetle's VOPR storage simulator is deliberately protocol-aware. It keeps simulated storage in memory, queues reads and writes with simulated latency, tracks faulty sectors, can misdirect writes, and can fault targets of pending writes during crash. More importantly, its cluster-level fault atlas decides which replicas and storage regions are eligible for faults so the simulator does not manufacture impossible worlds where every recoverable copy is destroyed at once.
Marionette should adapt that lesson, not copy the whole design. TigerBeetle can
name zones such as superblock, WAL headers, WAL prepares, and grid blocks
because VOPR is product-specific. Marionette's Phase 1 Disk should stay
generic: logical paths, byte ranges, sector size, pending writes, and explicit
fault profiles. Product-specific recoverability still belongs to examples and
checks until enough examples justify a generic fault-atlas API.
The immediate rule is: no unconstrained "random disk chaos" default. Every destructive disk fault needs a scope, a budget, and a trace event.
Authority Shape
The current mar.SimDisk simulator is constructed from World by the
simulation harness or scenario state. It produces two capabilities over the
same backing state:
mar.Disk: lower-level sectorread,write, andsync, plus file metadata and lifecycle operations.mar.DiskControl: harness-facing faults, crash/restart, and scripted corruption.
Simulation application code that is intentionally testing the disk model can
receive env.disk instead of depending on World internals. Ordinary storage
code should prefer env.io() and std.Io.File:
fn store(disk: mar.Disk, entry: []const u8) !void {
try disk.write(.{
.path = "wal.log",
.offset = 0,
.bytes = entry,
});
try disk.sync(.{ .path = "wal.log" });
}
The test harness gets DiskControl from world.simulate(...).control.disk to
inspect disk state, inject scripted faults, or crash/restart the simulated
disk. Those operations must not leak into the app-visible disk API.
In later multi-node work, each simulated node should expose its own disk view:
try node.env().disk.write(.{ .path = "wal.log", .offset = offset, .bytes = entry });
The shared World remains the owner of the clock, PRNG, global event index,
and trace. The disk handle should not read wall-clock time, call host
randomness, or use host filesystem state as a simulator decision source.
The app-facing public type name should be Disk. Smaller terms like
BlockDevice are too narrow for a WAL/KV-store example, and broader terms
like Storage are too vague. Simulator-specific construction and control live
on SimDisk and DiskControl.
Operation Model
Every disk operation should receive a deterministic operation id. The simulator can then order pending work by:
ready_atsimulated timestamp.- Operation id.
That ordering avoids pointer addresses, hash-map iteration order, and host scheduling as tie breakers.
Implemented operation concepts:
- File identity: logical path-like names (
[]const u8) scoped to the simulated disk. These are not host paths and must not read host filesystem state. Trace output writes them throughrecordFieldstext escaping. - Offset and length: integer byte ranges.
- Sector size: a configured simulation parameter, defaulting to 4096 bytes.
- Completed operation: result delivered to user code after deterministic synchronous latency.
- Runtime fault profile:
DiskFaultOptionscontrols read errors, write errors, corrupt reads, lost pending writes, torn pending writes, and reordered pending writes with validatedBuggifyRatevalues. - Scripted sector corruption: harness code can mark one logical path/sector as
corrupt with
corruptSector. - Pending writes: successful writes are visible to later reads immediately,
but they are not durable until
sync. - Crash window:
crashprocesses pending writes, then marks the disk down untilrestart.
The Phase 1 implementation should start synchronous from the user's
perspective: a write or read may advance simulated time internally and then
return. The model can still assign operation ids and latency so traces match
the future scheduler shape. A later async scheduler can split submit/complete
without changing trace ordering.
The backing implementation should be an in-memory durable model. Production adapters may later route the same narrow API to host filesystem calls, but the simulator itself should not depend on the host filesystem for data, metadata, ordering, or failure behavior.
Faults
Initial faults are small and explicit:
- Latency: operation completes at a deterministic future timestamp.
- IO error: read/write returns a simulated disk error.
- Corruption: read returns bytes that differ from the durable model.
- Scripted corruption: a specific logical sector is marked corrupt by the harness.
- Torn write: a crash leaves only part of a write durable.
- Lost pending write: a crash drops an acknowledged-pending write before it becomes durable.
Later faults can include misdirected writes, stale reads, byte-level corruption, reordered flushes, and more specific media behavior, but only after the basic model is traceable and tested. Misdirected writes should be a named fault type rather than being collapsed into generic corruption, because they test whether user code validates record identity and location.
Recoverability
Fault injection needs budgets. A simulator that can corrupt every copy of truth in one seed is not useful unless the test explicitly asked for a destructive profile.
The first profiles should be conservative:
- Single-node default: at most one destructive disk fault per recovery window.
- Single-node aggressive: allow repeated failures, but keep them traceable.
- Replicated default: allow per-replica faults only while at least one quorum path remains recoverable.
- Destructive: no recoverability guard, intended for negative tests.
The exact recovery-window API is undecided. Users may need to declare durable regions, replicas, checkpoints, or commit points before Marionette can enforce strong budgets.
For Phase 1, the append-only WAL example should define its own recovery window in the checker: flushed records are durable truth; unflushed records may be lost, torn, or corrupted according to the profile. A later generic fault atlas can lift that pattern out of examples.
Trace Events
Disk traces are stable text events until the trace format changes globally.
Disk code must use World.recordFields so logical paths and status strings
are escaped consistently. Current events:
disk.read op=<u64> path=<escaped-text> offset=<u64> len=<u64> status=<literal> latency_ns=<u64>disk.write op=<u64> path=<escaped-text> offset=<u64> len=<u64> status=<literal> latency_ns=<u64>disk.sync op=<u64> path=<escaped-text> status=<literal> committed_writes=<u64> latency_ns=<u64>disk.fault op=<u64> path=<escaped-text> kind=<literal> rate=<literal> roll=<u64> fired=<bool>disk.fault path=<escaped-text> offset=<u64> kind=scripted_corruptiondisk.crash_write op=<u64> path=<escaped-text> offset=<u64> len=<u64> result=<literal>disk.crash pending_writes=<u64> landed=<u64> lost=<u64> torn=<u64> reordered=<u64>disk.restart status=ok
Use status values such as ok, io_error, corrupt, and torn. Use fault
kinds such as read_error, write_error, corrupt_read,
crash_lost_write, crash_torn_write, and crash_reordered_write.
Trace fields must be scalar, deterministic, and independent of pointer
identity. User bytes should not be dumped into the default trace unless a
caller explicitly requests that, because it can make traces huge and unstable.
Trace len, not byte contents. If a debugging mode later records payload
hashes, the hash algorithm must be named and stable.
Determinism Rules
- All latency and fault choices draw from the world's PRNG.
- All time movement routes through the world's clock.
- Ready operations use a stable
(ready_at, op_id)ordering. - Host filesystem calls are not part of the simulator model.
- Disk APIs must not call
std.crypto.random,/dev/urandom, or wall-clock time. - Tests must compare same-seed disk traces byte-for-byte.
- Checksums and record validation belong to user code in Phase 1. Marionette may corrupt, tear, lose, or error operations, but it should not infer storage format semantics.
Phase 1 Decisions
- Low-level disk type:
Disk. - Simulator implementation type:
SimDisk. - Harness-control type:
DiskControl. - Low-level access:
env.disk; preferred storage app access isenv.io(). - File identity: logical path-like
[]const u8, escaped in traces and never resolved against the host filesystem by the simulator. - Default sector size: 4096 bytes.
- Default minimum latency: omitted means "match the world's tick duration"; explicit values are preserved and validated against the tick size.
- Initial app operations: sector-oriented
read,write,sync, andsyncDirare implemented. Path-levelstat, EOF-awarereadSome,setLength,delete, andrenameare also implemented as the first real-storage compatibility slice. - Initial harness-control operations:
setFaults,corruptSector,crash, andrestartare implemented. Read/write IO errors, corrupt reads, scripted sector corruption, lost pending writes, torn pending writes, reordered pending writes, and lost unsynced directory metadata are implemented. - Initial example: append-only WAL recovery.
- User data: store bytes in memory, trace lengths and outcomes by default.
- Checksums: user code owns them.
- Recoverability budgets: start with a conservative single-node default and explicit destructive mode; defer strong multi-replica budgets.
- Misdirected writes: document as a future named fault, not Phase 1 default.
- File lifecycle operations are intentionally narrow and operation-shaped, not
a full
std.fs.Fileclone.setLength,delete, andrenamereject while crashed and commit pending writes for the affected path before mutating metadata.syncDiris the explicit durability boundary for creates, deletes, and renames; cross-directory renames require syncing both parent directories.
Open Questions
- How closely should the production
Diskadapter align with futurestd.Iofile APIs? - Should
syncbe per-file only, or should there also be a whole-disk sync? - Should
syncDirstay onDisk, or should Marionette eventually expose a standardstd.Io.Dirbridge for directory fsync when Zig's API supports it? - What is the smallest explicit API for declaring recovery windows?
- What is the first reusable shape for a VOPR-style fault atlas once Marionette has more than one storage example?