Table Diff Architecture
The design of table diff is driven by the following requirements and constraints:
- Whether a table is kept consistent via logical replication, physical replication, backup/restore, or bulk copy, operators need a fast, low-impact way to verify that the replicas really match.
- Full-table comparisons that take hours are operationally risky (locks, lag, disk and network blow-ups). We need a method that scales down to quick “am I in sync?” checks and scales up to multi-billion-row tables without bringing the cluster to its knees.
- The outcome should be actionable: pinpoint what differs (adds/deletes/changes) rather than a binary “good/bad,” so remediation can be targeted.
Existing Approaches and Their Limitations
- Dump + file diff (e.g.,
pg_dumpthendiff) - Pros: simple tooling, easy to script.
- Limitations: requires extracting the entire table; dumps must be sorted or canonicalised to be diffable; produces huge intermediates; easily runs for hours on large tables; false positives from ordering/format differences.
- Whole-table checksum
- Pros: single pass, compact result.
- Limitations: still scans the whole table; a single checksum gives no locality of differences; any drift forces another full scan to localise; sensitive to scan order; changes during the scan can invalidate results.
- Row-by-row comparison script
- Pros: can respect primary-key ordering and return precise differences.
- Limitations: typically single-threaded and network-heavy; must stream or materialise the entire table; slow on wide rows; prone to timeouts and snapshot drift if not carefully managed; bespoke scripts vary in correctness and performance.
Algorithm (Block-Hash Diff)
The table is partitioned into primary-key ranges (blocks). Identical block boundaries are applied on every node. Each range/node pair is hashed by parallel workers; only blocks whose hashes disagree are split further and, eventually, materialised for row-level diffing.
flowchart LR
%% Styles
classDef blk fill:#eef,stroke:#6b7a99,color:#1f2a44;
classDef worker fill:#f5f9ff,stroke:#5b8def,stroke-dasharray:3 2,color:#0f2b5b;
classDef cmp fill:#fff7e6,stroke:#e0a800,color:#5a4300;
classDef split fill:#fdecea,stroke:#de4c4c,color:#5b0f0f;
classDef skip fill:#e9f7ef,stroke:#2ca24c,color:#1f5b32;
classDef fetch fill:#ffffff,stroke:#6b7a99,color:#1f2a44;
classDef diff fill:#ffffff,stroke:#6b7a99,color:#1f2a44;
classDef report fill:#ffffff,stroke:#6b7a99,color:#1f2a44;
%% Blocks on Node A
subgraph Blocks_NodeA["Node A: PK-ordered blocks"]
A1["Block 1<br>PK a..b"]:::blk --> A2["Block 2<br>PK b..c"]:::blk --> A3["Block 3<br>PK c..d"]:::blk
end
%% Blocks on Node B
subgraph Blocks_NodeB["Node B: same boundaries"]
B1["Block 1<br>PK a..b"]:::blk --> B2["Block 2<br>PK b..c"]:::blk --> B3["Block 3<br>PK c..d"]:::blk
end
%% Hashing in parallel
A1 -.-> WA1["Worker hashes<br>Block 1 @A"]:::worker
B1 -.-> WB1["Worker hashes<br>Block 1 @B"]:::worker
A2 -.-> WA2["Worker hashes<br>Block 2 @A"]:::worker
B2 -.-> WB2["Worker hashes<br>Block 2 @B"]:::worker
A3 -.-> WA3["Worker hashes<br>Block 3 @A"]:::worker
B3 -.-> WB3["Worker hashes<br>Block 3 @B"]:::worker
%% Hash compare per block
subgraph CompareQ["Hash compare per block"]
WA1 --> C1{"Block 1<br>hash match?"}:::cmp
WB1 --> C1
WA2 --> C2{"Block 2<br>hash match?"}:::cmp
WB2 --> C2
WA3 --> C3{"Block 3<br>hash match?"}:::cmp
WB3 --> C3
end
%% Block-level decisions
C1 -->|match| S1["Skip block<br>(no fetch)"]:::skip
C2 -->|mismatch| Split2["Split Block 2<br>into subranges"]:::split
C3 -->|match| S3["Skip block<br>(no fetch)"]:::skip
%% Recursive path for mismatched block
Split2 --> SA["Sub-block 2a"]:::blk
Split2 --> SB["Sub-block 2b"]:::blk
SA -.-> WSA["Worker hashes<br>2a"]:::worker
SB -.-> WSB["Worker hashes<br>2b"]:::worker
WSA --> RC2{"Sub-block hashes<br>match?"}:::cmp
WSB --> RC2
RC2 -->|match| RS2["Skip sub-block"]:::skip
RC2 -->|mismatch| Fetch2["Fetch ordered rows<br>for sub-block(s)"]:::fetch
Fetch2 --> Diff2["Row diff: add/del/mod<br>respect max_diff_rows"]:::diff
Diff2 --> Report["Report & summary<br>(JSON/HTML, taskstore)"]:::report
Range Generation Without Full Scans
For very large tables we cannot afford SELECT COUNT(*) or SELECT pk FROM table ORDER BY pk to drive block boundaries; both imply full scans and can take hours.
- ACE uses planner estimates (
queries.GetRowCountEstimate) to gauge table cardinality cheaply. This makes up-to-date statistics important—stale stats skew block sizing and sampling. - For filtered diffs (materialized view), ACE performs a real
COUNT(*)only on the filtered view, accepting the smaller scope to get accurate bounds. - Based on the estimated row count, ACE chooses a sampling mode for
TABLESAMPLEinGeneratePkeyOffsetsQuery: - Very large tables (100+ billion rows):
SYSTEMwith ~0.01% - Large tables (1–100 billion rows):
SYSTEMwith ~0.1% - Medium tables (100k–1 million rows):
BERNOULLI~1% - Smaller tables (10k–100k rows):
BERNOULLIwith a higher percentage (up to 100%) - The sampled PKs are bucketed via
ntileto yield start/end boundaries for each block. ACE also prepends a synthetic leading range (Start=nil) to ensure coverage from the absolute beginning, even if sampling misses tiny early values.
Range Alignment Across Nodes
Challenge: block boundaries come from one anchor node (the one with the highest estimated row count). Other nodes may have rows that sort before the anchor’s first PK or after its last PK. If we only used closed intervals from the anchor, those edge rows would never be hashed.
- The offsets query (
GeneratePkeyOffsetsQuery) always keeps the anchor’s first and last PKs and leaves the final range open (End = NULL). That makes the top end unbounded on every node, so “extra tail” rows are still included and hashed. table-diffinjects a synthetic leading range withStart = NULLandEnd = <first anchor boundary>to catch rows that exist only before the anchor’s first PK on another node.- Each range is applied identically to every node. In
hashRange,nilmeans “no bound,” so the same logical ranges become open at the edges, letting boundary skew surface as hash mismatches. - When hashes differ because of leading/trailing skew, recursion splits the offending range until the discrepant rows are isolated and, if needed, materialised.
flowchart LR
subgraph Anchor["Anchor node (range generation)"]
A0["Start=nil<br>End=a"] --> A1["a..b"] --> A2["b..c"] --> A3["c..nil"]
end
subgraph Other["Other node (extra rows possible)"]
B0["Rows < a"] --> B1["a..b"] --> B2["b..c"] --> B3["Rows > c"]
end
A0 -.same bounds hashed.-> B0
A1 -.same bounds hashed.-> B1
A2 -.same bounds hashed.-> B2
A3 -.same bounds hashed.-> B3
classDef default fill:#eef,stroke:#6b7a99,color:#1f2a44;
Comparison Notes: Simple vs Composite Primary Keys
- Ordering and hashing
- Simple PK: ranges use scalar bounds; hashing binds one value for lower/upper when present.
- Composite PK: ranges are tuples (
ROW(col1, col2, ...)); hashing binds each component, and comparisons rely on the tuple sort order. - Range splitting
- Simple PK: median discovery and splits are straightforward using a single OFFSET/LIMIT on the ordered PK.
- Composite PK: splits still pivot on PK order but must scan and bind every key component; more columns mean more bind parameters and slightly heavier queries.
- Row materialisation
fetchRowsalways orders by the PK columns (single or composite) soCompareRowSetscan align rows deterministically.- Arrays/UDTs get cast to
TEXTto avoid OID/scan issues regardless of PK shape. - What users should watch for
- Ensure the declared PK is the true business key; otherwise, composite drift can hide behind non-unique or misordered keys.
- Keep statistics fresh so sampling and range sizing remain representative for multi-column distributions.
- Avoid nullable or unstable key components (e.g., keys derived from timestamps that can change) to keep comparisons consistent over time.
flowchart LR
%% Styles
classDef note fill:#fff7e6,stroke:#e0a800,color:#5a4300;
classDef pk fill:#eef,stroke:#6b7a99,color:#1f2a44;
%% Composite PK
subgraph CompositePK["Composite PK (lexical ordering)"]
direction LR
C1["(dept=10, id=1)"]:::pk --> C2["(dept=10, id=2)"]:::pk --> C3["(dept=20, id=1)"]:::pk --> C4["(dept=20, id=3)"]:::pk
C2 -.-> CNOTE["Sorted by ROW(dept,id)<br>(dept then id)"]:::note
end
%% Simple PK
subgraph SimplePK["Simple PK (scalar ordering)"]
direction LR
S1["pk=1"]:::pk --> S2["pk=2"]:::pk --> S3["pk=3"]:::pk
S2 -.-> SNOTE["Compare with scalar <, >, ="]:::note
end
Resource Utilisation and Tuning
- block_size: Larger blocks reduce hash tasks and recursion but increase memory/IO per hash and slow mismatch localisation; smaller blocks do the opposite (more queries, finer locality).
- concurrency_factor: Scales workers by
NumCPU. Higher = faster hashing but more load on DB backends, network, and local CPU; can contend with other workloads and connection limits. - compare_unit_size: Lower values push recursion deeper (more queries, smaller fetches); higher values stop earlier (fewer queries, larger fetches on mismatched ranges).
- max_diff_rows: Early-exit guardrail. Lower caps keep runs short and reports small on divergent tables; raising/removing can grow memory and report size when drift is large.
- table_filter: Narrows scope and cost; enables accurate
COUNT(*)on the filtered view. Must be identical across nodes to avoid false positives. - override_block_size: Skips safety rails from
ace.yaml. Oversized blocks can spike memory and slow hashes, especially on wide rows. - output (json/html): HTML adds minor post-processing; DB load is unaffected.
Failure Modes and Safeguards
- Diff limit hit:
max_diff_rowsstops recursion early and marks the report; more differences may exist. - Permission or schema/PK mismatch: Validation fails before work starts; nothing is executed against the DB.
- Bytea >1 MB:
CheckColumnSizeaborts to avoid runaway memory/IO. - Timeouts/slow ranges: Hashing is wrapped in timeouts; the first error is recorded and surfaced after workers finish.
- Stale stats: Skewed row-count estimates can mis-size blocks; rerun
ANALYZEfor better sampling.
Consistency Caveats
- No cross-node snapshot coordination. Concurrent writes during a run can appear as drift.
- Prefer quiescent windows or use
table_filterto target stable partitions. - PK stability matters: changing PK values mid-run can reshuffle ordering and produce noisy diffs.
Operational Tuning Playbook
- Start conservative on busy systems: smaller
concurrency_factor, moderateblock_size. - For large but mostly consistent tables: increase
block_sizeandconcurrency_factorto hash faster; keepcompare_unit_sizereasonable to localise mismatches. - For drift-heavy tables: lower
block_size/compare_unit_sizeto localise quickly; keepmax_diff_rowslow to bound runtime and report size. - After stats refresh or schema changes, re-evaluate sampling and block sizing.
Limits and Edge Cases
- Requires a declared PK; up to three-way diffs only.
table_filtercreates per-node materialized views; filters must match exactly across nodes.- Sampling can under-represent skewed PK distributions; synthetic leading/open ranges and recursive narrowing mitigate, but extreme skew may need smaller blocks.
- Wide JSON/bytea/UDT columns increase hash and fetch cost; oversized bytea (>1 MB) blocks execution.
Observability
- Logs show range hashing progress, mismatches, recursion, and diff limits.
- Progress bars (mpb) reflect hash and mismatch-analysis stages.
- Task status and summary are persisted to SQLite (
ace_tasks.dbby default, or theACE_TASKS_DBpath) viataskstorein tableace_taskswith columns:task_id,task_type,task_status,cluster_name,task_context(JSON),schema,table_name,repset_name,diff_file_path,started_at,finished_at,time_taken. Example rows:
| task_id | task_status | cluster | schema | table_name | diff_file_path | started_at | finished_at | time_taken | task_context (truncated) |
|---|---|---|---|---|---|---|---|---|---|
| 9f7f…e21b | COMPLETED | acctg | public | customers_large | public_customers_large_diffs-20250722120353.json | 2025-07-22T12:03:51Z | 2025-07-22T12:03:53Z | 2.1 | {"qualified_table":"public.customers_large","mode":"diff","nodes":"all","diff_summary":{...}} |
| 3b2c…9aa4 | FAILED | acctg | public | orders | 2025-07-21T10:15:00Z | 2025-07-21T10:15:08Z | 8.0 | {"qualified_table":"public.orders","mode":"diff","nodes":"n1,n2","error":"user \"replicator\" lacks…"} | |
| 1a4d…c7f2 | COMPLETED | acctg | public | invoices | public_invoices_diffs-20250720115900.json | 2025-07-20T11:58:55Z | 2025-07-20T11:59:00Z | 5.0 | {"qualified_table":"public.invoices","mode":"diff","nodes":"n1,n2","table_filter":"billing_cycle = …"} |
- Diff reports write to timestamped JSON (and HTML if selected); paths are logged on completion.
-
Sample analytics (SQLite):
-
Recent table-diff runs:
SELECT task_id, task_status, started_at, finished_at, time_taken, diff_file_path FROM ace_tasks WHERE task_type = 'TABLE_DIFF' ORDER BY started_at DESC LIMIT 20; -
Success/fail counts over the last 7 days:
SELECT task_status, COUNT(*) FROM ace_tasks WHERE task_type = 'TABLE_DIFF' AND started_at >= datetime('now','-7 day') GROUP BY task_status; -
Drift summary by run (JSON fields from
task_context.diff_summary):SELECT task_id, json_extract(task_context, '$.diff_summary.total_rows_checked') AS rows_checked, json_extract(task_context, '$.diff_summary.mismatched_ranges_count') AS mismatched_ranges, json_extract(task_context, '$.diff_summary.diff_row_limit_reached') AS limit_hit FROM ace_tasks WHERE task_type = 'TABLE_DIFF' ORDER BY started_at DESC LIMIT 20;