Merkle Tree Architecture
Developing Merkle Trees for ACE is driven by the following requirements:
- Table-diff is efficient for one-off or infrequent checks, but rerunning it regularly recomputes block hashes across the entire table every time.
- When most data is unchanged, re-hashing every block is wasteful—especially on very large tables (think ~1 TB or ~1B rows) where hash scans dominate runtime and resource use.
- A Merkle tree lets us cache and reuse block hashes between runs. Unchanged subtrees are skipped; only branches covering modified rows are rehashed and compared, reducing I/O, CPU, and network load for recurring diffs.
- The goal is fast, incremental validation for operators who need frequent consistency checks without paying the full-block hashing cost each run.
What Is a Merkle Tree?
- A Merkle tree is a hash tree: leaves hold hashes of data blocks; every internal node combines its children’s hashes into a parent hash. The root represents the entire dataset—if any leaf changes, all ancestors (and the root) change.
- ACE optimisation: parent nodes are computed as XORs of child hashes rather than hashing the concatenation. XOR is fast, associative, and easy to update incrementally when a child changes, making rebuilds cheaper for large trees.
flowchart TB
ROOT["Root = (h1 XOR h2) XOR (h3 XOR h4)"]:::root
N12["Parent = h1 XOR h2"]:::node
N34["Parent = h3 XOR h4"]:::node
L1["Leaf 1<br/>hash = h1"]:::leaf
L2["Leaf 2<br/>hash = h2"]:::leaf
L3["Leaf 3<br/>hash = h3"]:::leaf
L4["Leaf 4<br/>hash = h4"]:::leaf
ROOT --> N12
ROOT --> N34
N12 --> L1
N12 --> L2
N34 --> L3
N34 --> L4
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
How Merkle Trees Work in ACE
- Initial build (full scan once): ACE partitions the table into PK-ordered blocks (like table-diff) and computes leaf hashes for every block on each node—this is the only full-table scan. Parent hashes are built upward using XOR to form the root.
- Change capture (pgoutput + replication slot): The table is added to a publication; pgoutput changes are streamed into a logical replication slot. ACE marks affected blocks as dirty based on PK ranges (no full re-scan).
- Update step:
UpdateMtreereads accumulated changes from the slot, splits or merges blocks if needed, recomputes hashes only for dirty/new leaves, rebuilds parent XORs, and clears dirty flags. Block size is recovered from metadata to stay consistent with the build. - Diff step:
DiffMtreefirst runs an update (to fold recent changes into the tree), then traverses trees across nodes. Matching hashes at a node mean “skip subtree”; mismatches recurse until leaves are reached, where row-level diffs are fetched if required.
flowchart TD
Build["Build: full table scan<br/>compute leaf hashes<br/>build XOR parents"]:::build --> Slot["Logical replication slot<br/>(pgoutput changes)"]:::cdc
Slot --> Update["Update: mark dirty blocks<br/>split/merge if needed<br/>rehash dirty leaves<br/>rebuild parents"]:::update
Update --> Diff["Diff: compare roots<br>→ if mismatch, recurse<br>→ fetch rows at leaves"]:::diff
classDef build fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
classDef cdc fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef update fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef diff fill:#fdeaea,stroke:#d14343,color:#1f1a1a;
CDC Pipeline and Dirty Block Detection
- Capture: Each node adds the table to a publication; pgoutput changes are streamed via a logical replication slot (see
UpdateFromCDC). Inserts/deletes/updates are mapped to PK ranges; affected blocks are marked dirty—no rehash yet. - Persist: The slot retains WAL until ACE consumes it; long delays can bloat WAL. Ensure slot retention is acceptable and the service role has replication/publication privileges.
- Apply: On
UpdateMtree, ACE reads the slot, updates per-block counters/dirty flags, and resetsrange_endto NULL for the tail when new rows arrive past the last bound. Splits/merges are then decided from those dirty flags. - Options:
--no-cdcskips slot consumption (not recommended before a diff);CDCProcessingTimeoutbounds how long ACE waits while applying changes. - User reminders
- Keep the slot healthy: monitor lag to avoid WAL buildup; drop/recreate only if you rebuild trees.
- Run
updatebeforediff(ACE does this automatically fordiff) so trees include recent changes. - Ensure the publication still includes the table and that the slot isn’t removed by maintenance.
flowchart LR
WAL["WAL (pgoutput)"]:::wal --> Slot["Logical slot<br/>(retains changes)"]:::cdc --> Apply["UpdateFromCDC:<br/>mark dirty blocks<br/>update tail bound"]:::apply --> Update["UpdateMtree:<br/>split/merge decisions<br/>rehash dirty leaves"]:::update
classDef wal fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
classDef cdc fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef apply fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef update fill:#fdeaea,stroke:#d14343,color:#1f1a1a;
Updates: Why Splits and Merges Matter
- Why split? Hot ranges can grow far beyond the intended block size due to concentrated inserts (e.g., sequential PKs). Oversized blocks slow hashing and make diff localisation coarse. Splitting restores a balanced fan-out and keeps leaf work bounded.
- Why merge? Sparse/deleted ranges can shrink, leaving tiny blocks that add overhead with little data. Merging adjacent small blocks reduces hash tasks and tree depth.
- Update flow: During
UpdateMtree, ACE identifies dirty/new blocks from CDC, then: - Finds overfull blocks and splits them (with re-sequencing of node positions).
- Optionally merges underfull blocks if
Rebalanceis enabled. - Rehashes only the affected leaves and rebuilds parents.
- Clears dirty flags so future updates skip unchanged blocks.
flowchart LR
CDC["CDC marks dirty blocks"]:::cdc --> SplitCheck{"Over block_size?"}:::split
SplitCheck -->|yes| Split["Split block<br/>adjust ranges"]:::op
SplitCheck -->|no| MergeCheck{"Under merge threshold?"}:::split
MergeCheck -->|yes| Merge["Merge adjacent small blocks"]:::op
MergeCheck -->|no| Skip["No shape change"]:::op
Split --> Rehash["Rehash affected leaves"]:::rehash
Merge --> Rehash
Skip --> Rehash
Rehash --> Parents["Rebuild parent XORs<br/>clear dirty flags"]:::rehash
classDef cdc fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef split fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef op fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
classDef rehash fill:#fdeaea,stroke:#d14343,color:#1f1a1a;
Examples: Splitting and Merging Merkle Tree Nodes
- Splitting
- A split block is appended at the end (new
node_position = max + 1) to avoid immediate parent-hash churn; afterward ACE re-sequences leaves by PK start so positions are contiguous for later maintenance. - Parent nodes are dropped and rebuilt after splits so XOR hashes reflect the new leaf layout.
-
Rationale: inserting the new block adjacent to the original would change hashes for every ancestor along that branch, forcing deeper traversal even though only two leaves changed. Appending preserves most parent-child hashes until the controlled rebuild, which limits needless tree churn.
-
Why adjacency hurts
Inserting the split block next to the original forces a cascade of parent hash changes, so diff traversal cannot prune early.
flowchart TB
TitleBad["Bad: insert adjacent (parent churn)"]:::title
AROOT["p6' (root changed)"]:::root
AP4["p4'"]:::node
AP5["p5'"]:::node
AP1["p1'"]:::node
AP2["p2'"]:::node
AP3["p3'"]:::node
AX1["x1"]:::leaf
AM["m (new)"]:::leaf
AN["n (new)"]:::leaf
AX3["x3"]:::leaf
AX4["x4"]:::leaf
AX5["x5"]:::leaf
AROOT --> AP4
AROOT --> AP5
AP4 --> AP1
AP4 --> AP2
AP5 --> AP3
AP1 --> AX1
AP1 --> AM
AP2 --> AN
AP2 --> AX3
AP3 --> AX4
AP3 --> AX5
classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
classDef title fill:transparent,stroke:none,color:#0f1c2d;
- Why appending helps
Appending the new block leaves most parents intact; only the split branch and ancestors that include it change. Diff traversal prunes more quickly.
flowchart TB
TitleGood["Good: append split block (minimal churn)"]:::title
RROOT["p6' (root changed)"]:::root
RP4["p4'"]:::node
RP5["p5'"]:::node
RP1["p1'"]:::node
RP2["p2 (unchanged)"]:::node
RP3["p3'"]:::node
RX1["x1"]:::leaf
RM["m (split part)"]:::leaf
RX3["x3"]:::leaf
RX4["x4"]:::leaf
RX5["x5"]:::leaf
RN["n (appended)"]:::leaf
RROOT --> RP4
RROOT --> RP5
RP4 --> RP1
RP4 --> RP2
RP5 --> RP3
RP1 --> RX1
RP1 --> RM
RP2 --> RX3
RP2 --> RX4
RP3 --> RX5
RP3 --> RN
classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
classDef title fill:transparent,stroke:none,color:#0f1c2d;
- Merging
- ACE merges a block with its immediate successor when their combined row count is under ~1.5×
block_size. The merged block keeps the lower position; the successor is deleted. After each merge pass, positions are re-sequenced (same temp-offset trick) to keep positions contiguous for later splits/merges. - Parent nodes are dropped and rebuilt post-merge.
- Re-sequencing uses a large temporary offset (1,000,000) to avoid position collisions during moves; this is safe up to ~100B rows, but is disruptive and can change parent-child relationships, so heavy rebalancing is recommended only when absolutely necessary and during maintenance windows.
flowchart TB
subgraph BeforeMerge["Before merge"]
M0["pos=0 [1..80] (small)"]:::leaf --> M1["pos=1 [81..120] (small)"]:::leaf --> M2["pos=2 [121..300]"]:::leaf --> M3["pos=3 [301..nil]"]:::leaf
end
subgraph MergeStep["Merge step"]
MM0["pos=0 [1..120] (merged)"]:::leaf
MM2["pos=2 [121..300]"]:::leaf
MM3["pos=3 [301..nil]"]:::leaf
end
subgraph AfterResequence["After resequence"]
RM0["pos=0 [1..120]"]:::leaf --> RM1["pos=1 [121..300]"]:::leaf --> RM2["pos=2 [301..nil]"]:::leaf
end
M3 -.-> MM0
MM3 -.-> RM0
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
Additional Split/Merge Considerations
- Open-ended last block handling
- The trailing block often has
range_end = NULLso new inserts beyond the current max PK land there. When splitting an unbounded block, ACE temporarily discovers a max PK to create a split point, then ensures the final leaf is unbounded again so future inserts keep hitting a single tail block. This prevents new rows from “falling off” the tree after a split.
flowchart LR
Before["pos=0 [1..100]"]:::leaf --> Tail["pos=1 [101..nil] (tail)"]:::leaf
After["pos=0 [1..100]"]:::leaf --> Mid["pos=1 [101..500]"]:::leaf --> Tail2["pos=2 [500..nil] (tail reset)"]:::leaf
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
- Composite PK split points
- For composite keys, split points are chosen using ordered tuples (
ROW(col1, col2, ...)) at roughly the midpoint of the block’s row count. Bounds and inserts are bound component-wise. If a final “sliver” block would be too small, ACE drops the last split point to avoid creating a tiny trailing leaf. -
Simple PKs pick a single midpoint value; composite PKs load and register the composite type so range updates and inserts stay consistent with the PK ordering.
-
Thresholds and rebalance guidance
- Splits fire when a block grows beyond ~2×
block_size; merges apply when neighbouring blocks together are under ~1.5×block_size(or ~25% thresholds in Python reference). This keeps leaf sizes within a sensible band. - Re-sequencing and parent rebuilds are cheap compared to leaf rehashing, but they do perturb parent-child relationships. Use heavy rebalancing (
--rebalance) sparingly—ideally in maintenance windows—to avoid churn on hot tables.
Diff Traversal Algorithm
- ACE compares Merkle trees across node pairs top-down. If root hashes match, the entire table is considered in sync. When a parent hash mismatches, ACE descends to its children; matching subtrees are pruned, mismatching subtrees recurse until leaves are reached. If leaves don’t align (one side has extra splits/merges), the extra leaves are treated as mismatches.
- Parent hashes are XORs of child hashes (fast to recompute). Leaf hashes are cryptographic hashes of table blocks; XOR collisions at parent levels are acceptable because leaf hashes still guard correctness.
- Edge cases:
- Unbalanced trees: If one node split/merged more, traversal still walks each subtree; missing peers are flagged as mismatches.
- Early pruning: Large identical subtrees are skipped after a single hash compare.
- Diff materialisation: Only at mismatching leaves; ACE then fetches block rows and aligns by PK.
flowchart TB
%% Two trees with a mismatching branch and extra leaf on right
subgraph TreeL["Node A"]
LRoot["root_A"]:::root
LA["A1"]:::node
LB["A2"]:::node
L1["L1"]:::leaf
L2["L2"]:::leaf
L3["L3"]:::leaf
L4["L4"]:::leaf
LRoot --> LA
LRoot --> LB
LA --> L1
LA --> L2
LB --> L3
LB --> L4
end
subgraph TreeR["Node B"]
RRoot["root_B"]:::root
RA["B1"]:::node
RB["B2"]:::node
R1["R1"]:::leaf
R2["R2"]:::leaf
R3["R3"]:::leaf
R4["R4"]:::leaf
RExtra["R_extra"]:::leaf
RRoot --> RA
RRoot --> RB
RA --> R1
RA --> R2
RB --> R3
RB --> R4
RB --> RExtra
end
%% Traversal annotations
LRoot -. "hash mismatch" .- RRoot
LA -. "hash match (prune)" .- RA
LB -. "hash mismatch" .- RB
L3 -. "leaf mismatch" .- R3
L4 -. "leaf mismatch" .- R4
RB -. "extra leaf" .-> RExtra
classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
flowchart TB
%% Balanced trees where only one leaf differs
subgraph TreeL2["Node A"]
LRoot2["root_A"]:::root
LA2["A1"]:::node
LB2["A2"]:::node
L1a["L1"]:::leaf
L2a["L2"]:::leaf
L3a["L3"]:::leaf
L4a["L4"]:::leaf
LRoot2 --> LA2
LRoot2 --> LB2
LA2 --> L1a
LA2 --> L2a
LB2 --> L3a
LB2 --> L4a
end
subgraph TreeR2["Node B"]
RRoot2["root_B"]:::root
RA2["B1"]:::node
RB2["B2"]:::node
R1a["R1"]:::leaf
R2a["R2_diff"]:::leaf
R3a["R3"]:::leaf
R4a["R4"]:::leaf
RRoot2 --> RA2
RRoot2 --> RB2
RA2 --> R1a
RA2 --> R2a
RB2 --> R3a
RB2 --> R4a
end
LRoot2 -. "hash mismatch" .- RRoot2
LA2 -. "hash mismatch" .- RA2
LB2 -. "hash match (prune)" .- RB2
L1a -. "match" .- R1a
L2a -. "leaf mismatch" .- R2a
classDef root fill:#fff6e0,stroke:#d39b00,color:#1f2a44;
classDef node fill:#e8f0ff,stroke:#3b73b9,color:#0f1c2d;
classDef leaf fill:#eef2f7,stroke:#5b6f82,color:#0f1c2d;
Notes:
- Traversal proceeds until leaves are aligned or exhausted; missing peers (extra leaves) are flagged.
- Parent XOR hashes speed pruning; leaf cryptographic hashes guard correctness.
- Diffs are materialised only at mismatching leaves, limiting database reads.
- In the unbalanced example above, the right-side tree has an extra leaf (R_extra). ACE descends until it exhausts matching parents; when it sees that RB has a child with no peer, it records that entire extra leaf as a mismatch (representing rows present only on Node B). Similarly, if Node A had more leaves, those would be marked as present-only-on-A. This ensures asymmetric splits/merges still surface as drift.
Operational Details and Limits
- Schema objects and lifecycle: Build creates
ace_mtree_<schema>_<table>plus a composite PK type when needed, metadata rows (block size, counts, timestamps), and adds the table to a publication/slot. Teardown drops these. Updates/diffs reuse the metadata; changingblock_sizerequires a rebuild. - Block size reuse:
UpdateMtreereads block size from metadata; you cannot “resize” a tree via update. Rebuild if you need a different block size. - CDC drift and slot health: If the logical slot is dropped or lags, WAL can bloat and trees become stale. Recover by rebuilding (init + build) and recreating the slot/publication entry.
--no-cdcmeans you are diffing stale trees unless you run a rebuild first. - Rebalance caveats:
--rebalancere-sequences leaves with a large temp offset and rebuilds parents; this is disruptive to parent/child hashes and can slow subsequent diffs temporarily. Use it sparingly, ideally in maintenance windows. - XOR vs hash-of-hash: Parent XORs are chosen for speed and incremental updates. Leaf hashes remain cryptographic; XOR parents are for pruning, not integrity on their own. Collisions at parent levels are extremely unlikely to hide a leaf difference because leaves are still compared directly on mismatch paths.
- Mtree-specific tuning knobs:
max_cpu_ratioinfluences worker count and pool sizing;batch_sizecontrols how many ranges are compared per DB round-trip in diffs. Higher values increase throughput but raise DB and network load. - Error handling: If splits/merges or parent rebuilds fail (e.g., type issues, missing metadata),
update/diffaborts; the remedy is usually to fix the schema/state and rerun, or rebuild the tree when metadata and actual table diverge.
Observability and Teardown
- Task records: Mtree runs are recorded in
ace_tasks.db(tableace_tasks) with context about block size, nodes, and statuses. Check logs for counts of dirty blocks, splits/merges applied, parent rebuilds, and extra-leaf mismatches during diff traversal. - Slot health: Monitor logical slot lag/WAL retention; a bloated slot suggests
updatehasn’t consumed changes. If the slot is dropped or stale, rebuild (init + build) and recreate the slot/publication entry. - Teardown/reset: Prefer the built-in
teardown/teardown-tablecommands to remove mtree artifacts (mtree table, composite type, metadata, publication entry). Use manual drops only if automation fails, then rebuild with the desired block size and resume CDC.