How OMGDB made single inserts 63× faster

The engineering story of how OMGDB stopped rewriting its derived caches on every write and made single-document inserts 63 times faster.

OMGDB stores everything in one append-only, human-readable NDJSON operation log. That is the whole design: oplog.ndjson is the single source of truth, you can cat it, and everything else on disk — five binary cache sidecars covering the primary offset index, live documents, secondary index buckets, collection metadata, and the open checkpoint — is derived, rebuildable, and deletable at any time. Delete every cache, reopen the store, and you get identical query behavior. omgdb verify proves it at runtime.

Until this month, that honesty had a tax we should never have been paying. This is the story of finding it, and of the three smaller taxes hiding behind it.

The villain: ~26,000× write amplification

Every mutation used to rewrite every derived sidecar, in full, before the command returned.

The reasoning was defensible in the way most performance bugs are. Correctness came first, and “the sidecars on disk always exactly match the log” is a very easy invariant to reason about: nothing can ever be stale, so nothing ever needs to be checked. Each of the five caches was written completely, fsynced, and only then did the mutation report success.

Measured at 50,000 documents, the tax was not defensible. A single CLI insert took 4,689 ms — four and a half seconds to durably append one line of JSON. Twenty small inserts churned about 126 MB of sidecar writes. Divide it out and each few-hundred-byte log line dragged roughly 6 MB of cache rewrites behind it: on the order of 26,000× write amplification.

The bitter part was that the log itself was behaving beautifully the entire time. One canonical JSON line, one CRC, one fsync — an O(bytes appended) operation. The derived layer, which by design holds zero authoritative bits, was doing 99.99% of the physical writes.

The fix: checkpoints are allowed to lag

The observation that unlocked everything: a sidecar was never really “a mirror of the state.” It is a checkpoint of the log at some boundary. If a reader can replay the log tail past that boundary, a lagging checkpoint is exactly as correct as a fresh one — it just means a slightly longer tail.

So we made lag official policy:

A checkpoint may lag the durable log by up to 256 KiB (CACHE_REFRESH_TAIL_BYTES in the store) before any code path persists a refresh.
Mutations never rewrite sidecars per operation. An insert appends its record, fsyncs the log, and returns.
Readers replay the canonical tail from the checkpoint boundary and prove LSN continuity while doing it — every record’s log sequence number must equal its position, so a stale, torn, or mismatched sidecar is detected and rebuilt, never trusted.
Reads heal lazily. A read that needs a sidecar advances it past the threshold; a clean close persists the final primary snapshot; verify, compaction, and index/schema DDL persist eagerly.

The core invariant did not move an inch: checkpoint + tail replay must equal full replay, and verify still re-proves that equivalence against the real files. What changed is a mental reclassification — persistence is an optimization, never a correctness event. Once persistence stopped being load-bearing, it stopped belonging on the mutation hot path.

Three more O(N) squatters

With the amplification gone, the profile showed three more places where a cheap operation was quietly paying a dataset-sized cost:

Scan admission was O(N). Direct scans ran a pre-verification pass over the data before yielding their first document. Admission is now O(1): the continuity proof runs over the bounded tail, not the dataset, and scans yield lazily from the decoded checkpoint with a single log handle.
Point reads materialized state they never used. The primary cache maps (collection, _id) to a byte offset in the log, so get is now one seek and one line decode — the log is read directly at the answer.
Unique checks scanned candidates. A unique constraint is a question the index can answer about itself, so unique checks now read the index’s own bucket instead of walking documents.

Each fix follows the same rule as the big one: never do O(dataset) work to answer an O(1) question.

What the numbers look like now

The insert that used to take four and a half seconds is 63× faster — hence the title. More interesting than the ratio is where the remaining time goes:

Durable single-document writes are now fsync-bound — the same physical wall every embedded database hits. In-process that sustains more than 2,500 durable inserts per second, flat across store sizes, and batches that share an fsync exceed 100,000 documents per second.
Bulk loading is fast in absolute terms: import-jsonl loads 50,000 documents in about 2 seconds.
Indexed lookups answer in well under a millisecond in-process (0.22 ms measured).
verify re-proves the entire database — every record, every derived cache — in about a second at 50,000 documents.

When your write path’s dominant cost is the fsync itself, the derived layer has stopped being the story. That was the goal.

Where the frontier is

We are not declaring performance solved. Every fresh CLI process still decodes monolithic sidecar caches before its first direct read, and that decode is the current latency floor for one-shot commands. Warm checkpointed open is not yet measurably faster than a full replay, which is a polite way of saying the checkpoint format needs to earn its keep on open, not just on mutation. There is no paged or memory-mapped store yet, so the write-capable engine still replays the log into RAM. And at raw scan scale, mature engines are still faster today.

Those are the next items on the roadmap. The part we like most about this round is what it did not cost: the log is still the database, the caches are still deletable on a whim, and verify still proves the whole thing from plain text you can read with cat.