Skip to content

Merkle Tree Architecture

Developing Merkle Trees for ACE is driven by the following requirements:

  • Table-diff is efficient for one-off or infrequent checks, but rerunning it regularly recomputes block hashes across the entire table every time.
  • When most data is unchanged, re-hashing every block is wasteful—especially on very large tables (think ~1 TB or ~1B rows) where hash scans dominate runtime and resource use.
  • A Merkle tree lets us cache and reuse block hashes between runs. Unchanged subtrees are skipped; only branches covering modified rows are rehashed and compared, reducing I/O, CPU, and network load for recurring diffs.
  • The goal is fast, incremental validation for operators who need frequent consistency checks without paying the full-block hashing cost each run.

What Is a Merkle Tree?

  • A Merkle tree is a hash tree: leaves hold hashes of data blocks; every internal node combines its children’s hashes into a parent hash. The root represents the entire dataset—if any leaf changes, all ancestors (and the root) change.
  • ACE optimisation: parent nodes are computed as XORs of child hashes rather than hashing the concatenation. XOR is fast, associative, and easy to update incrementally when a child changes, making rebuilds cheaper for large trees.
flowchart TB
  ROOT["Root = (h1 XOR h2) XOR (h3 XOR h4)"]:::root

  N12["Parent = h1 XOR h2"]:::node
  N34["Parent = h3 XOR h4"]:::node

  L1["Leaf 1<br/>hash = h1"]:::leaf
  L2["Leaf 2<br/>hash = h2"]:::leaf
  L3["Leaf 3<br/>hash = h3"]:::leaf
  L4["Leaf 4<br/>hash = h4"]:::leaf

  ROOT --> N12
  ROOT --> N34
  N12 --> L1
  N12 --> L2
  N34 --> L3
  N34 --> L4

  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;

How Merkle Trees Work in ACE

  • Initial build (full scan once): ACE partitions the table into PK-ordered blocks (like table-diff) and computes leaf hashes for every block on each node—this is the only full-table scan. Parent hashes are built upward using XOR to form the root.
  • Change capture (pgoutput + replication slot): The table is added to a publication; pgoutput changes are streamed into a logical replication slot. ACE marks affected blocks as dirty based on PK ranges (no full re-scan).
  • Update step: UpdateMtree reads accumulated changes from the slot, splits or merges blocks if needed, recomputes hashes only for dirty/new leaves, rebuilds parent XORs, and clears dirty flags. Block size is recovered from metadata to stay consistent with the build.
  • Diff step: DiffMtree first runs an update (to fold recent changes into the tree), then traverses trees across nodes. Matching hashes at a node mean “skip subtree”; mismatches recurse until leaves are reached, where row-level diffs are fetched if required.
flowchart TD
  Build["Build: full table scan<br/>compute leaf hashes<br/>build XOR parents"]:::build --> Slot["Logical replication slot<br/>(pgoutput changes)"]:::cdc
  Slot --> Update["Update: mark dirty blocks<br/>split/merge if needed<br/>rehash dirty leaves<br/>rebuild parents"]:::update
  Update --> Diff["Diff: compare roots<br>→ if mismatch, recurse<br>→ fetch rows at leaves"]:::diff

  classDef build fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  classDef cdc fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef update fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef diff fill:#fdeaea,stroke:#d14343,color:#1f1a1a;

CDC Pipeline and Dirty Block Detection

  • Capture: Each node adds the table to a publication; pgoutput changes are streamed via a logical replication slot (see UpdateFromCDC). Inserts/deletes/updates are mapped to PK ranges; affected blocks are marked dirty—no rehash yet.
  • Persist: The slot retains WAL until ACE consumes it; long delays can bloat WAL. Ensure slot retention is acceptable and the service role has replication/publication privileges.
  • Apply: On UpdateMtree, ACE reads the slot, updates per-block counters/dirty flags, and resets range_end to NULL for the tail when new rows arrive past the last bound. Splits/merges are then decided from those dirty flags.
  • Options: --no-cdc skips slot consumption (not recommended before a diff); CDCProcessingTimeout bounds how long ACE waits while applying changes.
  • User reminders
  • Keep the slot healthy: monitor lag to avoid WAL buildup; drop/recreate only if you rebuild trees.
  • Run update before diff (ACE does this automatically for diff) so trees include recent changes.
  • Ensure the publication still includes the table and that the slot isn’t removed by maintenance.
flowchart LR
  WAL["WAL (pgoutput)"]:::wal --> Slot["Logical slot<br/>(retains changes)"]:::cdc --> Apply["UpdateFromCDC:<br/>mark dirty blocks<br/>update tail bound"]:::apply --> Update["UpdateMtree:<br/>split/merge decisions<br/>rehash dirty leaves"]:::update

  classDef wal fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  classDef cdc fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef apply fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef update fill:#fdeaea,stroke:#d14343,color:#1f1a1a;

Updates: Why Splits and Merges Matter

  • Why split? Hot ranges can grow far beyond the intended block size due to concentrated inserts (e.g., sequential PKs). Oversized blocks slow hashing and make diff localisation coarse. Splitting restores a balanced fan-out and keeps leaf work bounded.
  • Why merge? Sparse/deleted ranges can shrink, leaving tiny blocks that add overhead with little data. Merging adjacent small blocks reduces hash tasks and tree depth.
  • Update flow: During UpdateMtree, ACE identifies dirty/new blocks from CDC, then:
  • Finds overfull blocks and splits them (with re-sequencing of node positions).
  • Optionally merges underfull blocks if Rebalance is enabled.
  • Rehashes only the affected leaves and rebuilds parents.
  • Clears dirty flags so future updates skip unchanged blocks.
flowchart LR
  CDC["CDC marks dirty blocks"]:::cdc --> SplitCheck{"Over block_size?"}:::split
  SplitCheck -->|yes| Split["Split block<br/>adjust ranges"]:::op
  SplitCheck -->|no| MergeCheck{"Under merge threshold?"}:::split
  MergeCheck -->|yes| Merge["Merge adjacent small blocks"]:::op
  MergeCheck -->|no| Skip["No shape change"]:::op

  Split --> Rehash["Rehash affected leaves"]:::rehash
  Merge --> Rehash
  Skip --> Rehash
  Rehash --> Parents["Rebuild parent XORs<br/>clear dirty flags"]:::rehash

  classDef cdc fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef split fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef op fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  classDef rehash fill:#fdeaea,stroke:#d14343,color:#1f1a1a;

Examples: Splitting and Merging Merkle Tree Nodes

  • Splitting
  • A split block is appended at the end (new node_position = max + 1) to avoid immediate parent-hash churn; afterward ACE re-sequences leaves by PK start so positions are contiguous for later maintenance.
  • Parent nodes are dropped and rebuilt after splits so XOR hashes reflect the new leaf layout.
  • Rationale: inserting the new block adjacent to the original would change hashes for every ancestor along that branch, forcing deeper traversal even though only two leaves changed. Appending preserves most parent-child hashes until the controlled rebuild, which limits needless tree churn.

  • Why adjacency hurts
    Inserting the split block next to the original forces a cascade of parent hash changes, so diff traversal cannot prune early.

flowchart TB
  TitleBad["Bad: insert adjacent (parent churn)"]:::title
    AROOT["p6' (root changed)"]:::root
    AP4["p4'"]:::node
    AP5["p5'"]:::node
    AP1["p1'"]:::node
    AP2["p2'"]:::node
    AP3["p3'"]:::node
    AX1["x1"]:::leaf
    AM["m (new)"]:::leaf
    AN["n (new)"]:::leaf
    AX3["x3"]:::leaf
    AX4["x4"]:::leaf
    AX5["x5"]:::leaf

    AROOT --> AP4
    AROOT --> AP5
    AP4 --> AP1
    AP4 --> AP2
    AP5 --> AP3
    AP1 --> AX1
    AP1 --> AM
    AP2 --> AN
    AP2 --> AX3
    AP3 --> AX4
    AP3 --> AX5

  classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  classDef title fill:transparent,stroke:none,color:#0f1c2d;
  • Why appending helps
    Appending the new block leaves most parents intact; only the split branch and ancestors that include it change. Diff traversal prunes more quickly.
flowchart TB
  TitleGood["Good: append split block (minimal churn)"]:::title
    RROOT["p6' (root changed)"]:::root
    RP4["p4'"]:::node
    RP5["p5'"]:::node
    RP1["p1'"]:::node
    RP2["p2 (unchanged)"]:::node
    RP3["p3'"]:::node
    RX1["x1"]:::leaf
    RM["m (split part)"]:::leaf
    RX3["x3"]:::leaf
    RX4["x4"]:::leaf
    RX5["x5"]:::leaf
    RN["n (appended)"]:::leaf

    RROOT --> RP4
    RROOT --> RP5
    RP4 --> RP1
    RP4 --> RP2
    RP5 --> RP3
    RP1 --> RX1
    RP1 --> RM
    RP2 --> RX3
    RP2 --> RX4
    RP3 --> RX5
    RP3 --> RN

  classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  classDef title fill:transparent,stroke:none,color:#0f1c2d;
  • Merging
  • ACE merges a block with its immediate successor when their combined row count is under ~1.5× block_size. The merged block keeps the lower position; the successor is deleted. After each merge pass, positions are re-sequenced (same temp-offset trick) to keep positions contiguous for later splits/merges.
  • Parent nodes are dropped and rebuilt post-merge.
  • Re-sequencing uses a large temporary offset (1,000,000) to avoid position collisions during moves; this is safe up to ~100B rows, but is disruptive and can change parent-child relationships, so heavy rebalancing is recommended only when absolutely necessary and during maintenance windows.
flowchart TB
  subgraph BeforeMerge["Before merge"]
    M0["pos=0  [1..80] (small)"]:::leaf --> M1["pos=1  [81..120] (small)"]:::leaf --> M2["pos=2  [121..300]"]:::leaf --> M3["pos=3  [301..nil]"]:::leaf
  end

  subgraph MergeStep["Merge step"]
    MM0["pos=0  [1..120] (merged)"]:::leaf
    MM2["pos=2  [121..300]"]:::leaf
    MM3["pos=3  [301..nil]"]:::leaf
  end

  subgraph AfterResequence["After resequence"]
    RM0["pos=0  [1..120]"]:::leaf --> RM1["pos=1  [121..300]"]:::leaf --> RM2["pos=2  [301..nil]"]:::leaf
  end

  M3 -.-> MM0
  MM3 -.-> RM0

  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;

Additional Split/Merge Considerations

  • Open-ended last block handling
  • The trailing block often has range_end = NULL so new inserts beyond the current max PK land there. When splitting an unbounded block, ACE temporarily discovers a max PK to create a split point, then ensures the final leaf is unbounded again so future inserts keep hitting a single tail block. This prevents new rows from “falling off” the tree after a split.
flowchart LR
  Before["pos=0 [1..100]"]:::leaf --> Tail["pos=1 [101..nil] (tail)"]:::leaf
  After["pos=0 [1..100]"]:::leaf --> Mid["pos=1 [101..500]"]:::leaf --> Tail2["pos=2 [500..nil] (tail reset)"]:::leaf
  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
  • Composite PK split points
  • For composite keys, split points are chosen using ordered tuples (ROW(col1, col2, ...)) at roughly the midpoint of the block’s row count. Bounds and inserts are bound component-wise. If a final “sliver” block would be too small, ACE drops the last split point to avoid creating a tiny trailing leaf.
  • Simple PKs pick a single midpoint value; composite PKs load and register the composite type so range updates and inserts stay consistent with the PK ordering.

  • Thresholds and rebalance guidance

  • Splits fire when a block grows beyond ~2× block_size; merges apply when neighbouring blocks together are under ~1.5× block_size (or ~25% thresholds in Python reference). This keeps leaf sizes within a sensible band.
  • Re-sequencing and parent rebuilds are cheap compared to leaf rehashing, but they do perturb parent-child relationships. Use heavy rebalancing (--rebalance) sparingly—ideally in maintenance windows—to avoid churn on hot tables.

Diff Traversal Algorithm

  • ACE compares Merkle trees across node pairs top-down. If root hashes match, the entire table is considered in sync. When a parent hash mismatches, ACE descends to its children; matching subtrees are pruned, mismatching subtrees recurse until leaves are reached. If leaves don’t align (one side has extra splits/merges), the extra leaves are treated as mismatches.
  • Parent hashes are XORs of child hashes (fast to recompute). Leaf hashes are cryptographic hashes of table blocks; XOR collisions at parent levels are acceptable because leaf hashes still guard correctness.
  • Edge cases:
  • Unbalanced trees: If one node split/merged more, traversal still walks each subtree; missing peers are flagged as mismatches.
  • Early pruning: Large identical subtrees are skipped after a single hash compare.
  • Diff materialisation: Only at mismatching leaves; ACE then fetches block rows and aligns by PK.
flowchart TB
  %% Two trees with a mismatching branch and extra leaf on right
  subgraph TreeL["Node A"]
    LRoot["root_A"]:::root
    LA["A1"]:::node
    LB["A2"]:::node
    L1["L1"]:::leaf
    L2["L2"]:::leaf
    L3["L3"]:::leaf
    L4["L4"]:::leaf
    LRoot --> LA
    LRoot --> LB
    LA --> L1
    LA --> L2
    LB --> L3
    LB --> L4
  end

  subgraph TreeR["Node B"]
    RRoot["root_B"]:::root
    RA["B1"]:::node
    RB["B2"]:::node
    R1["R1"]:::leaf
    R2["R2"]:::leaf
    R3["R3"]:::leaf
    R4["R4"]:::leaf
    RExtra["R_extra"]:::leaf
    RRoot --> RA
    RRoot --> RB
    RA --> R1
    RA --> R2
    RB --> R3
    RB --> R4
    RB --> RExtra
  end

  %% Traversal annotations
  LRoot -. "hash mismatch" .- RRoot
  LA -. "hash match (prune)" .- RA
  LB -. "hash mismatch" .- RB
  L3 -. "leaf mismatch" .- R3
  L4 -. "leaf mismatch" .- R4
  RB -. "extra leaf" .-> RExtra

  classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
flowchart TB
  %% Balanced trees where only one leaf differs
  subgraph TreeL2["Node A"]
    LRoot2["root_A"]:::root
    LA2["A1"]:::node
    LB2["A2"]:::node
    L1a["L1"]:::leaf
    L2a["L2"]:::leaf
    L3a["L3"]:::leaf
    L4a["L4"]:::leaf
    LRoot2 --> LA2
    LRoot2 --> LB2
    LA2 --> L1a
    LA2 --> L2a
    LB2 --> L3a
    LB2 --> L4a
  end

  subgraph TreeR2["Node B"]
    RRoot2["root_B"]:::root
    RA2["B1"]:::node
    RB2["B2"]:::node
    R1a["R1"]:::leaf
    R2a["R2_diff"]:::leaf
    R3a["R3"]:::leaf
    R4a["R4"]:::leaf
    RRoot2 --> RA2
    RRoot2 --> RB2
    RA2 --> R1a
    RA2 --> R2a
    RB2 --> R3a
    RB2 --> R4a
  end

  LRoot2 -. "hash mismatch" .- RRoot2
  LA2 -. "hash mismatch" .- RA2
  LB2 -. "hash match (prune)" .- RB2
  L1a -. "match" .- R1a
  L2a -. "leaf mismatch" .- R2a

  classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
  classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
  classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;

Notes: - Traversal proceeds until leaves are aligned or exhausted; missing peers (extra leaves) are flagged. - Parent XOR hashes speed pruning; leaf cryptographic hashes guard correctness. - Diffs are materialised only at mismatching leaves, limiting database reads. - In the unbalanced example above, the right-side tree has an extra leaf (R_extra). ACE descends until it exhausts matching parents; when it sees that RB has a child with no peer, it records that entire extra leaf as a mismatch (representing rows present only on Node B). Similarly, if Node A had more leaves, those would be marked as present-only-on-A. This ensures asymmetric splits/merges still surface as drift.

Operational Details and Limits

  • Schema objects and lifecycle: Build creates ace_mtree_<schema>_<table> plus a composite PK type when needed, metadata rows (block size, counts, timestamps), and adds the table to a publication/slot. Teardown drops these. Updates/diffs reuse the metadata; changing block_size requires a rebuild.
  • Block size reuse: UpdateMtree reads block size from metadata; you cannot “resize” a tree via update. Rebuild if you need a different block size.
  • CDC drift and slot health: If the logical slot is dropped or lags, WAL can bloat and trees become stale. Recover by rebuilding (init + build) and recreating the slot/publication entry. --no-cdc means you are diffing stale trees unless you run a rebuild first.
  • Rebalance caveats: --rebalance re-sequences leaves with a large temp offset and rebuilds parents; this is disruptive to parent/child hashes and can slow subsequent diffs temporarily. Use it sparingly, ideally in maintenance windows.
  • XOR vs hash-of-hash: Parent XORs are chosen for speed and incremental updates. Leaf hashes remain cryptographic; XOR parents are for pruning, not integrity on their own. Collisions at parent levels are extremely unlikely to hide a leaf difference because leaves are still compared directly on mismatch paths.
  • Mtree-specific tuning knobs: max_cpu_ratio influences worker count and pool sizing; batch_size controls how many ranges are compared per DB round-trip in diffs. Higher values increase throughput but raise DB and network load.
  • Error handling: If splits/merges or parent rebuilds fail (e.g., type issues, missing metadata), update/diff aborts; the remedy is usually to fix the schema/state and rerun, or rebuild the tree when metadata and actual table diverge.

Observability and Teardown

  • Task records: Mtree runs are recorded in ace_tasks.db (table ace_tasks) with context about block size, nodes, and statuses. Check logs for counts of dirty blocks, splits/merges applied, parent rebuilds, and extra-leaf mismatches during diff traversal.
  • Slot health: Monitor logical slot lag/WAL retention; a bloated slot suggests update hasn’t consumed changes. If the slot is dropped or stale, rebuild (init + build) and recreate the slot/publication entry.
  • Teardown/reset: Prefer the built-in teardown/teardown-table commands to remove mtree artifacts (mtree table, composite type, metadata, publication entry). Use manual drops only if automation fails, then rebuild with the desired block size and resume CDC.