Data Model

The OMGDB Value/document model, its ten variants, ObjectId generation, the total order used for sorting and ranges, and the deterministic canonical JSON codec.

Every value OMGDB stores is a Value — a tagged union of the ten types the database understands. Documents are ordered key/value maps of these values, and they round-trip through a deterministic canonical JSON codec: common types are plain JSON, while a few types that JSON cannot represent natively use MongoDB-style $-tagged extended forms.

The codec is the foundation of OMGDB’s durability story. The append-only NDJSON op-log is the canonical source of truth, and the canonical writer is a byte-identical fixpoint: dump → load → dump always produces the same bytes (invariant I3). Everything else — indexes, caches, projections — is a derived, rebuildable artifact. This page documents the value model, how each type is encoded, how identifiers are generated, the single total order behind sorting and range queries, and the determinism guarantees that make I3 hold.

The Value type

A Value is one of exactly ten variants. The string variant is named Str in the source (it holds a UTF-8 String); the others map one-to-one to the conceptual types below.

Variant	Holds	Canonical JSON form	`type_name()`
`Null`	—	`null`	`null`
`Bool(bool)`	a boolean	`true` / `false`	`bool`
`I64(i64)`	64-bit signed integer	bare integer, e.g. `3`	`long`
`F64(f64)`	64-bit IEEE-754 float	decimal with `.` or exponent, e.g. `2.0`	`double`
`Str(String)`	UTF-8 string	JSON string with escaping	`string`
`Bytes(Vec<u8>)`	opaque binary	`{"$binary":"<base64>"}`	`binData`
`Array(Vec<Value>)`	ordered list	`[...]`	`array`
`Object(Document)`	embedded document	`{...}`, field order preserved	`object`
`ObjectId(ObjectId)`	12-byte identifier	`{"$oid":"<24 hex>"}`	`objectId`
`DateTime(i64)`	ms since Unix epoch	`{"$date":<i64 ms>}`	`date`

A Document is an ordered set of fields (Vec<(String, Value)> under the hood). Field order is significant and preserved on round-trip, exactly as in MongoDB.

Canonical forms by type

null
true
3
2.0
"héllo \"q\"\n\tend"
{"$binary":"AAEC/w=="}
[1,2,{"k":true}]
{"b":1,"a":2,"z":3}
{"$oid":"0123456789abcdef01234567"}
{"$date":1718000000000}

A few details worth calling out:

Bytes use STANDARD base64 wrapped in a $binary envelope.
DateTime stores only milliseconds since the Unix epoch as a raw i64 — a bare integer payload, not a string. There is no separate sub-millisecond or timezone representation.
Object field order is the insertion order. The document {"b":1,"a":2,"z":3} stays in that order after a round-trip; it is not sorted.

Note: _id is documented in storage. Any Value can be an _id, which is one reason the integer/identity guarantees below matter.

The integer / double distinction

OMGDB never conflates integers and floats. An I64 always serializes as a bare integer (2), and an F64 always carries a . or exponent (2.0), so the type survives a reparse:

2
2.0

The first reparses to I64(2), the second to F64(2.0). There is no implicit widening or narrowing in the codec.

Floats: shortest round-tripping decimals

Finite floats are emitted with ryu, which produces the shortest decimal string that round-trips bit-for-bit, paired with serde_json’s float_roundtrip parsing. The result is bit-exact and uses an exponent for extreme magnitudes:

1e21
1.7976931348623157e308
1e-7

Some golden forms that lock the representation: 1e16 stays 1e16, 100.0 stays 100.0, and 9007199254740993.0 (2^53 + 1) canonicalizes to 9007199254740992.0 because that integer is not exactly representable as an f64 — the codec records the value the float actually holds, not the literal you typed.

Non-finite floats cannot be written as JSON numbers, so they use the $f64 tag:

{"$f64":"NaN"}
{"$f64":"Infinity"}
{"$f64":"-Infinity"}

These decode back to f64::NAN, f64::INFINITY, and f64::NEG_INFINITY.

ObjectId

An ObjectId is a 12-byte identifier, MongoDB-compatible in layout. ObjectId::new() (and Default) generates one from three parts:

Bytes	Contents
`0..4`	4-byte big-endian seconds since the Unix epoch (0 if the clock predates the epoch)
`4..9`	5-byte per-process random value
`9..12`	3-byte big-endian counter, masked to `0x00ffffff`, incremented monotonically per process

The counter is an AtomicU32 advanced with fetch_add (Relaxed ordering), so two ids minted in the same second within one process still differ. The 5 random bytes are seeded once per process from the clock’s sub-second nanosecond field and the process id, mixed through a SplitMix64-style hash and cached for the process lifetime.

Note: The random bytes are derived from a SplitMix64-style hash, not a cryptographic RNG. Uniqueness rests on the per-process random seed plus the monotonic counter, not cryptographic randomness — do not treat an ObjectId as unguessable.

ObjectIds render as 24 lowercase hex characters. to_hex() produces that string; from_hex() requires exactly 24 hex characters and otherwise returns ValueError::ObjectId. as_bytes() / from_bytes() expose the raw [u8; 12].

The total order

Value::order_key() produces an order-preserving String: the lexicographic byte order of two keys matches OMGDB’s single total value order. That order is what backs $sort and range queries, and it is the basis for range-capable secondary indexes (see indexes). The same order is used by the query layer’s comparison logic, so sorting and range matching agree.

The key is built as a fixed two-hex cross-type rank followed by a per-type body. The cross-type ranks are:

Rank	Types
0	`Null`
1	`I64` and `F64` (numbers share a rank)
2	`Str`
3	`Object`
4	`Array`
5	`Bytes`
6	`Bool`
7	`DateTime`
8	`ObjectId`

Within a type, the body is chosen so byte order matches value order:

Numbers (I64 and F64) share the number rank; the i64 is cast to f64 and both encode as 16 hex digits ordered like f64::total_cmp (flip the sign bit for non-negatives, flip all bits for negatives). This is why I64(2) and F64(2.0) produce the same key — they compare equal.
Bool uses '0' / '1'.
Str uses the raw string bytes.
DateTime encodes (ms as u64) XOR (1 << 63) as 16 hex digits, giving correct signed ordering across negative and positive timestamps.
ObjectId and Bytes encode their bytes as hex.
Array and Object use their canonical JSON string as the body; their whole-value type rank still orders them correctly against scalars.

Limitation: order_key() is monotonic but not injective. Order-equal values (such as I64(2) and F64(2.0)) share a key, so an index lookup may return candidates that are merely order-equal. Range and index callers must re-filter for exactness rather than trust key equality. See query operators for comparison semantics.

Codec determinism and invariant I3

The canonical writer is deterministic and, crucially, a fixpoint: parsing OMGDB’s own canonical output and re-serializing yields byte-identical text. This is invariant I3 (dump → load → dump is byte-identical), property-tested over 20,000 randomly generated nested values plus seed cases. Several rules make the fixpoint hold and protect document identity.

`-0.0` canonicalization

-0.0 and 0.0 compare equal but have different bit patterns. The writer maps any value equal to 0.0 to positive 0.0, so both serialize to 0.0. This prevents the two from splitting a document’s identity through different _id keys and keeps the codec a fixpoint.

0.0

Out-of-range integers are rejected

A JSON integer that fits in i64 becomes I64. An integer in the range (i64::MAX, u64::MAX] is rejected at parse time with ValueError::Extended (“integer N is outside the i64 range”) rather than being silently coerced to a lossy f64 that could collide with another document’s _id. There is no u64 or decimal128 type.

9223372036854775807   -> Ok(I64(...))   (i64::MAX)
9223372036854775808   -> Err            (i64::MAX + 1)
18446744073709551615  -> Err            (u64::MAX)

Non-integer JSON numbers become F64.

Duplicate keys collapse

Duplicate keys in parsed input collapse to one, with later values winning (a later insert overwrites in place). Document::insert replaces an existing key while keeping its position, and appends genuinely new keys at the end. So {"a":1,"b":2,"a":3} becomes a two-field document and is a stable fixpoint thereafter.

Order-preserving object fields

Object fields are written in document order, never sorted. Combined with collapse-on-duplicate, the canonical form always has unique keys in a deterministic order.

Control-character escaping for safe framing

The op-log frames records as NDJSON, so a literal tab, newline, or carriage return inside a value could inject a frame boundary. The string writer escapes ", \, \n, \r, \t, backspace (U+0008), form feed (U+000C), and any other control character below 0x20 as \u00xx. Tabs, newlines, and CRs therefore never appear literally in canonical output.

Lenient extended-tag decoding

Only a single-key object whose key is $oid, $date, $binary, or $f64 and whose payload is valid decodes to the corresponding extended type. A tag-shaped key with an invalid payload — for example {"$oid":"not-a-valid-oid"} — does not error; it falls through to an ordinary object. This keeps the parser total over any stored document, so replaying a user field that merely looks like a tag can never brick the store. Operator-shaped keys such as $set, $gt, $group, and $inc are not extended tags and round-trip verbatim, which is why the same codec can parse query, update, and aggregation specs.

Limitation: The flip side is an inherent ambiguity, the same one MongoDB extended JSON has: a single-key object matching a tag with a valid payload (e.g. {"$oid":"0123456789abcdef01234567"}) is canonically that extended type and is indistinguishable from a user field of the same shape. A user field literally named $oid holding a valid 24-hex string will decode to an ObjectId.

A full round-trip

{"name":"x","n":3,"f":2.5,"id":{"$oid":"0123456789abcdef01234567"},"arr":[1,2,{"k":true}]}

Parsing this document and re-serializing yields exactly the same bytes — integers stay integers, the float keeps its decimal point, the $oid decodes to an ObjectId and re-encodes identically, and field order is preserved. That stability is what lets the op-log act as the canonical source of truth. See storage for how these canonical records are framed and durably appended.