Architecture — OMGDB Docs

How OMGDB is organized — a Cargo workspace of eight focused crates layered over a text-canonical storage spine.

OMGDB is a single Cargo workspace (resolver = "2", members = ["crates/*"]) split into eight focused crates. The shape of the codebase mirrors the central design idea: there is one storage spine that owns the durable, human-readable source of truth, and every other capability — querying, aggregation, introspection, vector search, agent-safe mutations — is a layer built on top of it that holds no authoritative state of its own.

This page describes the crate layout, how the crates depend on one another bottom-to-top, the “text is canonical, binary is a rebuildable cache” spine that the whole design rests on, and the engineering practices that keep the workspace honest. For the on-disk details see storage; for the transaction model see transactions.

The eight crates

The workspace is dual-licensed MIT OR Apache-2.0. Each crate has a single, narrow responsibility.

Crate	Responsibility
`omgdb-core`	The storage spine: value model + canonical (bit-exact, deterministic) codec, the op-log (framed NDJSON + per-record CRC32), replay/fold, the in-memory `Store`, ACID transactions, secondary equality/multikey/range indexes, validation, compaction, the integrity check (`verify`), and the single-process advisory store lock.
`omgdb-query`	Filter compilation and matching (MongoDB-style filters, dotted paths, array-contains semantics), the single total-order `compare`/value ordering, projection, explain, diagnose (why-not selectivity), and did-you-mean operator suggestions.
`omgdb-agg`	The aggregation pipeline over `Document` values: stage dispatch plus a recursive expression engine, including `$lookup`, `$facet`, and `$replaceRoot`/`$replaceWith`.
`omgdb-change`	Agent-safe mutations: dry-run `plan_update` (writes only a legible pending plan, no data mutation), `apply_change` (executes a plan in one transaction by token), and `rollback` (restores recorded before-documents).
`omgdb-introspect`	Self-describing introspection: per-field schema inference (types + presence), Markdown `describe` (a live manual), JSON describe, and deterministic canonical `dump`.
`omgdb-md`	Native Markdown support: parse frontmatter into typed fields and headings into a section tree, producing a queryable document.
`omgdb-vector`	Local vector search: a pluggable `Embedder` trait + a deterministic offline `HashingEmbedder`, flat exact cosine kNN, hybrid search (structured pre-filter + semantic ranking), persisted/synced vectors with provenance + staleness, and token-budgeted cited context packs.
`omgdb-cli`	The `omgdb` binary (26 subcommands) plus the stdio JSON-RPC MCP server (`src/mcp.rs`) with capability-scope enforcement.

Note: There is no omgdb-index, omgdb-api, or omgdb-mcp crate. Indexes live inside omgdb-core, and the MCP server lives inside omgdb-cli (src/mcp.rs). Older design sketches that list those crates as separate members are stale.

Layering, bottom to top

The crates form a clean acyclic dependency graph. omgdb-core sits at the bottom and depends on no other workspace crate — it is the storage spine and defines the Value/Document model, the codec, and the Store. Everything else builds upward from there:

                         omgdb-cli
                    (binary + MCP server)
                            │  depends on all seven below
   ┌──────────┬──────────┬──┴────────┬──────────┬──────────┐
omgdb-agg  omgdb-change  omgdb-     omgdb-     omgdb-md   omgdb-query
                         introspect vector        │           │
   │           │            │          │          │           │
   └───────────┴─── omgdb-query ───────┘          │           │
                            │                      │           │
                            └──── omgdb-core ──────┴───────────┘

The actual [dependencies] confirm the layers:

omgdb-core — no internal dependencies. It is the foundation.
omgdb-query and omgdb-md — depend on omgdb-core only.
omgdb-agg, omgdb-change, omgdb-introspect, and omgdb-vector — each depend on omgdb-core + omgdb-query (they all need to read and match documents through the same total-order semantics).
omgdb-cli — depends on all seven library crates and ties them together into the omgdb binary, and additionally hosts the MCP server.

Because querying, aggregation, change planning, introspection, Markdown parsing, and vector search all sit above core and never reach around it, they cannot smuggle authoritative state outside the op-log. That structural constraint is what makes the storage invariants enforceable rather than aspirational.

The text-canonical spine

The defining idea of OMGDB is that the on-disk source of truth is an append-only, human/LLM-readable NDJSON operation log — <store>/oplog.ndjson, with a per-record CRC32 — that you can cat. Every index, vector record, and cache is a derived, rebuildable artifact that holds zero authoritative bits. This legibility is what lets the database explain (describe), verify (verify), and repair (repair) itself.

A live store is a directory bundle (<store>/oplog.ndjson). The single-file .omgdb form is produced by pack/unpack as a transport/archive format — it is not the live storage engine. The physical log line format is:

<canonical-json>\t<crc32-hex>\n

with supported log ops insert, replace, delete, begin, commit, abort, define, and create_index.

The three invariants

The whole architecture is held to three invariants. They are stated briefly here; see storage and transactions for the full treatment.

I1 — Text completeness. oplog.ndjson fully determines the logical state; caches and indexes hold zero authoritative bits.
I2 — Rebuild equivalence. Deleting and rebuilding derived (binary cache) state from the log yields identical query-visible behavior after rebuild — same results, same order under the engine’s defined total order. Note that byte-identical binary rebuild is an explicit non-goal; I2 is behavioral equivalence only.
I3 — Export stability. Canonical export is deterministic: dump -> load -> dump is byte-identical for the canonical representation. This is the headline, property-tested claim.

Engineering practices

The workspace enforces a consistent set of practices across all crates.

Error handling

Library crates that define their own error types use thiserror (omgdb-core, -query, -agg, -change); omgdb-introspect and omgdb-vector reuse the lower-layer error types and pull in no extra error crate. Only the omgdb-cli binary pulls in anyhow. Library code paths avoid unwrap/expect — fallible operations return typed errors rather than panicking. This keeps the libraries embeddable and lets the binary be the single place that flattens errors for human-facing reporting.

Layer	Error crate	Panics
Library crates that define error types (`omgdb-core`, `-query`, `-agg`, `-change`)	`thiserror = "2"`	no `unwrap`/`expect` in library code paths
Binary (`omgdb-cli`)	`anyhow = "1"`	top-level error flattening only

Tests and quality gates

The three storage invariants (I1/I2/I3) are backed by property tests rather than examples alone. The full quality-gate set runs on every commit (and is what CI runs, in order):

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all
cargo doc --no-deps --all-features
cargo test --doc --all

Shared lints are applied to every crate via the workspace (unsafe_code, missing_docs, and rust_2018_idioms set to warn; clippy all = warn). The release profile uses lto = "thin" and codegen-units = 1.

Toolchain and CI

The toolchain is Rust stable, edition 2021, with an MSRV (rust-version) of 1.89 — required for File::try_lock, the advisory store lock that enforces single-process access. CI runs the latest stable toolchain on Linux, Windows, and macOS, plus a dedicated MSRV check.

Limitation: OMGDB is an early project (v0.0). The durability and codec layers are hardened — crash recovery, the repair tool, a bit-exact canonical codec proven by a property test, a crash-truncation matrix, and the cross-platform + MSRV CI — but it is not yet production-ready at scale. The entire dataset is held in RAM and rebuilt by replaying the whole log on open, so it does not scale beyond available memory; there is no paged/mmap binary store yet. A store is single-process (it takes an exclusive advisory lock on open), with no multi-reader/multi-writer concurrency model.

Where each capability lives

If you are looking for the code (or docs) behind a feature, the crate boundary is the map:

You want…	Crate	Docs
The value model and on-disk format	`omgdb-core`	data model, storage
ACID semantics and the single-writer borrow	`omgdb-core`	transactions
Secondary indexes (equality/range/multikey)	`omgdb-core`	indexes
Filter operators and matching	`omgdb-query`	query operators
Pipelines, stages, and expressions	`omgdb-agg`	aggregation
`plan-update` / `apply` / `rollback`	`omgdb-change`	agent mutations
`describe`, schema inference, `dump`	`omgdb-introspect`	introspection
Markdown import	`omgdb-md`	markdown
Vector search and context packs	`omgdb-vector`	vector search, context packs
The `omgdb` binary and the MCP server	`omgdb-cli`	cli, mcp