Version:

Recovering from Catastrophic Node Failure

Suppose your cluster is running smoothly with five nodes — n1, n2, n3, n4, and n5 — all connected using Spock logical replication. Node n1 sends transactions to the other four nodes. Then something goes wrong: perhaps because of network delay, n2 has not yet received all of the transactions that n3, n4, and n5 have already applied. Before n2 can catch up, n1 crashes. You are left with n2 lagging behind the others, and no way for it to get the missing data from n1. This is a catastrophic node failure scenario.

Depending on how you run your cluster, more than one node may accept writes. Later in this document, we also show a multiple-node failure example where a second node had also originated transactions before failing.

In the following sections, we'll walk through how to recover a node using Spock for replication with the Active Consistency Engine (ACE). In each scenario, we use a fully synchronized node as our source of truth. ACE repairs the missing data and can preserve the origin ID and commit timestamp for each repaired row so that replication metadata stays correct and your cluster remains consistent and conflict-free.

This document covers two cases:

single node failure (one node fails and another is lagging)
multiple node failure (two or more nodes fail and one or more survivors are behind).

The same idea applies in both cases; you identify what is missing on the surviving node(s), then repair the nodes using a node with complete data, preserving the origin ID and timestamp for every repaired row.

Single Node Failure vs. Multiple Node Failure

The following diagrams show the two scenarios and how to recover.

Scenario 1: One Node has Failed, and One Node is Lagging

In our first example, one node (n1) fails, while another node (n2) is lagging behind because it did not receive all of that node's transactions. The other survivors (n3, n4, n5) have complete data.

flowchart LR
  subgraph failed [Failed]
    N1[n1 failed]
  end

  subgraph behind [Behind]
    N2[n2 missing rows from n1]
  end

  subgraph source [Source of truth]
    N3[n3 complete]
    N4[n4 complete]
    N5[n5 complete]
  end

  N1 -.->|was replicating to| N2
  N3 -->|recover from| N2

Implementing a Recovery

Node n1 has failed.
Node n2 is lagging.

Our source of truth has to be: n3, n4, or n5

Our first step is to clean up. First, we will use Spock commands to drop the subscriptions to n1, and then drop node n1 from the cluster.

Then, on our survivor (n2), and for each table, run:

table-diff --preserve-origin n1 --until <n1_failure_time>

Then, on our survivor (n2), for each table with differences, run the ACE table-repair command, specifying the --recovery-mode, --source-of-truth n3, and --preserve-origin options.

These steps ensure that n2 contains all of the rows that originated on n1.

Note: These are high-level steps. Detailed instructions for each step, including exact commands and options, are provided later in this document starting at Phase 1: Assess the Damage. In some cases, tables may still show differences after an initial repair; see the Troubleshooting section for guidance.

Scenario 2: Multiple Nodes have Failed, and One Node is Lagging

In our next example, we'll assume multiple nodes have failed (for our example, we'll use n1 and n4). If two (or more) nodes fail, leaving one or more survivors, the damaged survivor (node n2 is lagging, but not down) may be missing rows that originated from either n1 or n4. To recover, pick one fully synchronized survivor as the source of truth (we'll use n3) and recover n2, adding data from both n1 and n4.

In this example, we'll assume n4 was also accepting writes before it failed, so some rows in the cluster have an origin ID of n4.

flowchart LR
  subgraph failed [Failed]
    N1[n1 failed]
    N4[n4 failed]
  end

  subgraph behind [Behind]
    N2[n2 missing rows from n1 and n4]
  end

  subgraph source [Source of truth]
    N3[n3 complete]
    N5[n5 complete]
  end

  N1 -.->|was replicating to| N2
  N4 -.->|was replicating to| N2
  N3 -->|recover n2 for both n1 and n4| N2

Implementing a Recovery

Nodes n1 and n4 have failed.
Node n2 is lagging.

Our source of truth has to be: n3 or n5

Our first step is to clean up the replication scenario. Drop the subscriptions to n1 and n4 before dropping the nodes from the cluster.

Then, on each surviving node, and for each table, run the table-diff command against each failed node including the --preserve-origin and failure_time options:

table-diff --preserve-origin n1 --until <n1_failure_time> table-diff --preserve-origin n4 --until <n4_failure_time>

Each run will return one diff file per table/origin combination.

Then, on n2, for each table with differences, run table-repair with --recovery-mode, --source-of-truth n3, and --preserve-origin.

These steps will ensure n2 contains all of the rows that originated on n1 and n4, with the origin ID and timestamp preserved for each.

Note

In the case of a multiple-node failure, you run diff and repair once per failed origin. For each table, that means one diff (and one repair) for n1 and one diff (and one repair) for n4. The source of truth (n3) is the same for all repairs.

Implementing Repairs in Both Cases

flowchart TD
  subgraph single [Single node failure]
    S1[n1 fails, n2 behind]
    S2[Drop n1. Diff vs n1. Repair n2 from n3.]
    S1 --> S2
  end

  subgraph multi [Multiple node failure]
    M1[n1 and n4 fail, n2 behind]
    M2[Drop n1 and n4. Diff vs n1, then vs n4.]
    M3[Repair n2 from n3 for n1 rows. Repair n2 from n3 for n4 rows.]
    M1 --> M2 --> M3
  end

In both cases you:

1. Clean up the failed node(s) in Spock.
2. Run table-diff on all tables—once per failed origin in the multiple-node 
   case.
3. Run table-repair with `--preserve-origin` and a single source of truth 
   so that origin ID and commit timestamp are preserved for each repaired row.

What Happens in a Catastrophic Node Failure of a Single Node

When one node fails mid-replication, some of its transactions may have been applied on some subscribers but not others. In our single-failure example:

n1 has failed and is no longer available.
n2 received fewer transactions from n1 (for example, due to network delay) and is behind.
n3, n4, and n5 received all of n1's transactions and are fully synchronized.

You need to bring n2 back in line with n3, n4, and n5. ACE does this by comparing tables across the surviving nodes, focusing only on data that originated from n1 and was committed before the failure. It then repairs n2 using a chosen source of truth (in our example, n3). When you use the --preserve-origin option, ACE preserves each repaired row's origin ID (the node that originally wrote the row) and commit timestamp (to microsecond precision) so that replication metadata remains correct.

The following diagram shows the state at the moment of failure of our single node, n1:

sequenceDiagram
    participant N1 as n1 (failed)
    participant N2 as n2 (behind)
    participant N3 as n3 (up to date)
    participant N4 as n4 (up to date)
    participant N5 as n5 (up to date)

    Note over N1: Sends transactions to all

    rect rgb(230, 240, 255)
      N1->>N2: Partial replication (e.g. 3 of 5)
      N1->>N3: Full replication (all 5)
      N1->>N4: Full replication (all 5)
      N1->>N5: Full replication (all 5)
    end

    N1--xN1: Crash (n1 offline)

    Note over N2: Missing rows from n1
    Note over N3..N5: Complete - use as source of truth

Before You Begin

Before you start the recovery process, make sure you have the following in place:

ACE installed and configured on a host that can reach your surviving nodes. ACE is used to compare and repair table data across nodes. For building, installing, and configuring ACE, see the ACE repository and its documentation (including configuration).
Access to your surviving nodes (n2, n3, n4, and n5). You'll need to run ACE commands and, for Spock cleanup, connect to each node with a Postgres client.
The approximate time when the failed node (or nodes) failed. You'll use this as a cutoff when running ACE so that only data committed before the failure is considered.
Origin tracking enabled. Your Spock cluster should have track_commit_timestamp = on in Postgres so that ACE can use commit timestamps and origin ID for recovery and for preserving them when repairing rows.

Note

Recovery is a multi-step process. Take your time, and run the validation steps so you can confirm that all tables match before resuming normal operations. To preserve the origin ID and commit timestamp for repaired rows, include the --preserve-origin flag with every table-repair command in Phase 4. In some cases, tables may still show differences after an initial repair (for example, if writes occurred during the diff or repair). The Troubleshooting section at the end of this document covers the most common causes and how to resolve them.

Overview of the Recovery Workflow

Recovery has five main phases; you should approach these phases in order:

1. **Assess the damage** – Check replication status and identify which
    nodes are behind and which are fully synchronized.

2. **Spock cleanup** – Drop subscriptions and slots for the failed node
    and remove the failed node from the cluster.

3. **Identify all missing data** – Run ACE table-diff on **every**
    replicated table to see which tables have differences and what is
    missing on n2.

4. **Repair all affected tables** – For each table that has differences,
    run ACE table-repair using n3 (or your chosen node) as the source of
    truth. Use `--preserve-origin` so that the origin ID and commit
    timestamp are preserved for every repaired row.

5. **Validate** – Re-run table-diff on all repaired tables to confirm that
    n2 matches the other survivors.

Because n2 might be missing data from more than one table, you must check all of your tables, not just one. The sections below walk you through each phase and show you the commands to run.

Phase 1: Assess the Damage

Connect to your surviving nodes and check replication status. You need to confirm which nodes are behind and which have all of the data. You can use spock.sub_show_status() to see subscription status and any lag.

For example, connect to n2 and run:

SELECT * FROM spock.sub_show_status();

The output will list each subscription with its current status and lag. A subscription that shows a large lag or a status of down or initializing when it should be replicating indicates that the node is behind. For example:

 subscription_name | status      | provider_node | replication_sets
-------------------+-------------+---------------+---------------------------------------
 sub_n2_n1         | down        | n1            | {default,default_insert_only,ddl_sql}
 sub_n2_n3         | replicating | n3            | {default,default_insert_only,ddl_sql}

In this output, sub_n2_n1 is down — that confirms n2 lost its connection to n1 (which has failed). The subscriptions to n3, n4, and n5 are still healthy. To gauge how far behind n2 is, query spock.lag_tracker for replication lag in bytes and time.

Determine the approximate time when the failed node (or nodes) failed. You'll need this timestamp for the ACE commands in Phase 3 and 4. If you have logs or monitoring, use that; otherwise use the last known good time before the failure.

Phase 2: Spock Cleanup

Once you've confirmed that the failed node (in our case, n1) is gone and won't be coming back in this recovery, you need to remove it from the cluster so that the remaining nodes no longer expect replication from or to n1.

On each surviving node (n2, n3, n4, n5):

1. **Drop subscriptions that involved the failed node.**

If a subscription was receiving from n1 or sending to n1, drop it using
[`spock.sub_drop()`](../spock_functions/functions/spock_sub_drop.md).
For example, on n2 you might drop the subscription that connected n2 to
n1.

2. **Remove the failed node from the cluster.**

From one of the surviving nodes, call
[`spock.node_drop()`](../spock_functions/functions/spock_node_drop.md)
to remove n1. This removes the node entry from the Spock catalog.

Warning

Dropping the node and subscriptions is irreversible; make sure you have identified the failed node correctly and that you do not need to bring n1 back before doing this.

After cleanup, your cluster has four nodes: n2, n3, n4, and n5. The next steps use ACE to fix the data on n2.

If two or more nodes failed (e.g. n1 and n4), you'll perform these steps for each node.

Phase 3: Identify All of the Missing Data

To recover n2, you need to know which tables have differences and what is missing. ACE's table-diff command compares table data across nodes. When you run it with --preserve-origin n1 and --until <timestamp>, it limits the comparison to rows whose origin ID is n1 and whose commit timestamp is at or before that time—exactly what you need after n1 has failed.

You must run table-diff for every table that is replicated in your cluster. If you only run it for one table, you might repair that table but leave others out of sync.

Step 1: Get a List of All Tables to Check

First, get the list of tables that participate in replication. You can get this from the spock.tables view, which lists every table along with the replication set (if any) it belongs to. Connect to any surviving node (for example, n3) and run:

SELECT nspname, relname, set_name
  FROM spock.tables
 WHERE set_name IS NOT NULL
 ORDER BY set_name, nspname, relname;

Alternatively, you can list all tables in the schema you replicate:

SELECT schemaname, tablename FROM pg_tables WHERE schemaname = 'public'
ORDER BY tablename;

Use this list for the next step. In the examples below, we'll use tables like customers, orders, and products; replace these with your actual schema and table names.

Step 2: Run table-diff on Every Table

For each table in your list, run ACE table-diff with the failed node as the origin and your failure timestamp as the cutoff. Replace 2026-02-11T14:30:00Z with the time when n1 failed, and replace mycluster with your ACE cluster name.

For a single table (for example, public.customers):

./ace table-diff \
  --nodes n2,n3,n4,n5 \
  --preserve-origin n1 \
  --until 2026-02-11T14:30:00Z \
  --output json \
  mycluster public.customers

What these options do:

--nodes n2,n3,n4,n5 – Compare only the surviving nodes (n2 and the nodes that have full data).
--preserve-origin n1 – Only consider rows whose origin ID is n1, so you don't mix in later local writes on the survivors.
--until 2026-02-11T14:30:00Z – Only consider rows committed at or before this time (use RFC3339 format).
--output json – Writes a diff report to a JSON file that you'll use for repair.

ACE returns a diff file with a name that contains the schema and table name; for example, public_customers_diffs-20260211143000.json. Repeat the diff command for every other table; for example:

./ace table-diff \
  --nodes n2,n3,n4,n5 \
  --preserve-origin n1 \
  --until 2026-02-11T14:30:00Z \
  --output json \
  mycluster public.orders

./ace table-diff \
  --nodes n2,n3,n4,n5 \
  --preserve-origin n1 \
  --until 2026-02-11T14:30:00Z \
  --output json \
  mycluster public.products

Step 3: Review the Diff Reports

After running table-diff on all tables, check the output. ACE will report whether each table has differences. For tables that do have differences, note the name of the diff file (it will include the table name and a timestamp). You'll use those files when you perform the repair steps. Tables that have no differences don't need repair.

If you have a multiple node failure (e.g. n1 and n4 both failed), run table-diff once per failed origin per table. For example, if your survivors are n2, n3, n5. For each table you run two diffs (one from each node) and get two diff files:

# Rows that originated from n1
./ace table-diff --nodes n2,n3,n5 --preserve-origin n1 \
  --until 2026-02-11T14:30:00Z --output json mycluster public.customers

# Rows that originated from n4 (use n4's failure time if different)
./ace table-diff --nodes n2,n3,n5 --preserve-origin n4 \
  --until 2026-02-11T14:35:00Z --output json mycluster public.customers

Review the diff reports; then in Phase 4, run table-repair for each of these diff files using same source of truth (e.g. n3).

If you have multiple tables, you can script the diff step. The following example loops through a list of table names (customize the list to match your schema):

FAILURE_UNTIL="2026-02-11T14:30:00Z"
CLUSTER="mycluster"
NODES="n2,n3,n4,n5"

for table in customers orders products invoices; do
  echo "Checking table: public.$table"
  ./ace table-diff \
    --nodes "$NODES" \
    --preserve-origin n1 \
    --until "$FAILURE_UNTIL" \
    --output json \
    "$CLUSTER" public."$table"
done

Phase 4: Repair All Affected Tables

For each table that had differences in Phase 3, run ACE table-repair using the diff file that was produced. You must run repair in recovery mode and include --preserve-origin so that repaired rows keep their original origin ID and commit timestamp. Without --preserve-origin, repaired rows would get n2's origin ID and a new commit time, which can cause replication conflicts and incorrect ordering. With --preserve-origin, ACE writes each repaired row with the same origin ID and commit timestamp it had on the source of truth, so n2's replication metadata stays correct.

Choose one node as the source of truth for repair. In our scenario, n3 (and n4 and n5) have the complete data, so we chose n3. You can let ACE auto-select the source of truth based on which survivor has the highest origin LSN for n1, or you can set it explicitly with --source-of-truth n3.

Repair Command for Each Table

For each table that had differences, run the following command, replacing the diff filename and table name with the ones from your table-diff run:

./ace table-repair \
  --diff-file=public_customers_diffs-20260211143000.json \
  --nodes n2,n3,n4,n5 \
  --recovery-mode \
  --source-of-truth n3 \
  --preserve-origin \
  mycluster public.customers

--diff-file – The JSON file produced by table-diff for this table.
--nodes n2,n3,n4,n5 – The surviving nodes (n2 will be repaired; n3 is the source of truth).
--recovery-mode – Required when the diff was created with --preserve-origin; tells ACE this is a catastrophic-failure recovery.
--source-of-truth n3 – Use n3's copy of the data to repair n2. You can omit this and let ACE choose the survivor with the highest LSN for n1, but if there are ties or missing LSNs, you must specify the source.
--preserve-origin – Preserves the origin ID and commit timestamp for each repaired row (to microsecond precision). Repaired rows on n2 will show the same origin ID and commit time as on the source of truth, so replication metadata stays correct.

Repeat for every table that had differences, for example:

./ace table-repair \
  --diff-file=public_orders_diffs-20260211143000.json \
  --nodes n2,n3,n4,n5 \
  --recovery-mode \
  --source-of-truth n3 \
  --preserve-origin \
  mycluster public.orders

./ace table-repair \
  --diff-file=public_products_diffs-20260211143000.json \
  --nodes n2,n3,n4,n5 \
  --recovery-mode \
  --source-of-truth n3 \
  --preserve-origin \
  mycluster public.products

Multiple node failure (e.g. n1 and n4 both failed): You have one diff file per (table, failed origin). Run table-repair for each of those diff files, using the same source of truth (n3) and --preserve-origin every time. ACE writes a new filename for each table-diff run; use the file from your n1 diff run for the first repair and the file from your n4 diff run for the second. Example for one table:

# Repair n2 for rows from n1
./ace table-repair \
  --diff-file=public_customers_diffs-20260211143000.json \
  --nodes n2,n3,n5 --recovery-mode --source-of-truth n3 \
  --preserve-origin mycluster public.customers

# Repair n2 for rows from n4 (different diff file from --preserve-origin n4)
./ace table-repair \
  --diff-file=public_customers_diffs-20260211143500.json \
  --nodes n2,n3,n5 --recovery-mode --source-of-truth n3 \
  --preserve-origin mycluster public.customers

In both commands the source of truth is n3; origin ID and commit timestamp are preserved for every repaired row.

Warning

Always use --preserve-origin when repairing after a catastrophic node failure. This option ensures that the origin ID and commit timestamp are preserved for every repaired row. If you omit it, repaired rows will have n2's origin ID and a new timestamp, which can cause replication conflicts and incorrect conflict resolution.

For more details on recovery mode and how origin ID and timestamp preservation works, see the ACE repository and the ACE docs: Using ACE for CNF recovery and the table-repair command reference.

Phase 5: Validate That All Tables Match

After repairing every affected table, verify that n2 now matches n3, n4, and n5. Re-run table-diff for each repaired table without --preserve-origin and --until. That compares the full table content across the survivors:

./ace table-diff --nodes n2,n3,n4,n5 --output json \
  mycluster public.customers
./ace table-diff --nodes n2,n3,n4,n5 --output json \
  mycluster public.orders
./ace table-diff --nodes n2,n3,n4,n5 --output json \
  mycluster public.products

If ACE reports that the tables match, n2 has been successfully recovered. At this point you can resume normal operations: re-enable or recreate any subscriptions that were dropped during cleanup (for example, subscriptions from n2 to n3, n4, and n5 so that n2 receives new writes), and allow application traffic to flow again. If writes were occurring on other nodes during the repair, a final table-diff run (without --preserve-origin or --until) will confirm that n2 has caught up completely.

If any table still shows differences, review the diff report and consider re-running repair for that table or checking for ongoing writes during the diff. See the Troubleshooting section for common causes and remedies.

flowchart LR
  subgraph before [Before Recovery]
    N2_before[n2 missing data]
    N3_before[n3 complete]
    N4_before[n4 complete]
    N5_before[n5 complete]
  end

  subgraph after [After Recovery]
    N2_after[n2 repaired]
    N3_after[n3 complete]
    N4_after[n4 complete]
    N5_after[n5 complete]
  end

  before --> after

Why Preserve Origin ID and Timestamp?

The repairs that ACE writes to n2 originated on n1. Each row has an origin ID (that identifies which node wrote it) and a commit timestamp (that identifies when it was committed). Spock uses these for conflict resolution and to keep replication consistent.

If you repair without --preserve-origin, the repaired rows get n2's origin ID and a new commit timestamp. That can cause:

Replication conflicts – If n1 ever comes back or you add another node with n1's history, the cluster may see the same row with different origin IDs and timestamps and report conflicts.
Incorrect ordering – Spock uses commit timestamps and origin ID for conflict resolution (for example, last-write-wins). Wrong timestamps or origin IDs can lead to wrong outcomes.
Lost data lineage – You lose an accurate record of which node produced which data.

If you repair with --preserve-origin, ACE keeps the original origin ID and commit timestamp (to microsecond precision) for each repaired row. The repaired rows on n2 then match the replication metadata they had on the source of truth, so conflict resolution and data lineage stay correct.

For catastrophic node failure recovery, always use --preserve-origin when you run table-repair.

Important Considerations when using ACE

Review the following considerations before performing a recovery with ACE.

Multi-Table Recovery

When performing a recovery between nodes with multiple tables:

Check every table – Run table-diff on all replicated tables, not just one. Only then will you see the full picture of what is missing on n2.
Prioritize if needed – If you have many tables, you can repair critical tables first, then the rest. Just ensure you eventually run diff and repair for every affected table.
Same cutoff for all – Use the same --until timestamp for every table so that you're consistently fencing at the same failure time.
Same source of truth – Using one node (e.g., n3) as source of truth for all tables keeps the process simple. If you have a reason to use different sources per table, you can, but document it so you don't get confused later.

Large Tables

For very large tables, ACE supports options such as --table-filter to restrict the comparison to a subset of rows. This allows you to run diff and repair in chunks. See the ACE table-diff command documentation for details.

When to Specify --source-of-truth

If you omit --source-of-truth, ACE will try to pick the survivor with the highest origin LSN for n1. If LSN information is missing on the survivors or there's a tie, ACE will ask you to specify the source. In that case, choose any node that you know has the most complete data set (in our example, n3, n4, or n5).

Catastrophic Failure vs. Network Partition

A catastrophic node failure means the node is gone (crashed, unrecoverable). A network partition might mean the node is still running but temporarily unreachable. If you're only dealing with a partition, you might prefer to wait for connectivity to return or to fail over in a planned way rather than dropping the node and doing ACE recovery. Use this procedure when you've decided that the node is permanently lost for this cluster.

Troubleshooting

ACE says LSN Information is Missing

Specify --source-of-truth explicitly (e.g. --source-of-truth n3) so ACE doesn't need to probe LSNs.

More than One node is behind

Run diff and repair commands for all potentially lagging nodes in --nodes. The source of truth should be a node that has full data; repair will fix the others.

Auto source-of-truth selection fails or ties

Provide --source-of-truth <node_name> with the name of a node you know has the complete data.

Origin metadata missing for some rows

ACE may log a warning and repair those rows without preserving origin ID and timestamp. Ensure track_commit_timestamp = on and that the diff was run with --preserve-origin; check the diff file and ACE logs.

Tables still differ after repair

Re-run table-diff without --preserve-origin to see the current state. If writes occurred during repair, run diff again and repair if needed. For large tables, consider performing a chunked repair with --table-filter.

Source node fails mid-replay

If the node you chose as your source of truth (e.g., n3) fails or becomes unavailable while ACE is still running repairs, stop the repair immediately and switch to a different fully-synchronized survivor before continuing.

Confirm which remaining nodes still have complete data (re-run spock.sub_show_status() on each one and check for lag).
Pick a new source of truth (e.g., n4 or n5) and re-run table-repair with --source-of-truth n4 for any tables that were not yet fully repaired.

Warning

If your cluster has only three nodes and the source-of-truth node also fails mid-replay, you may have no fully-synchronized survivor left. In that situation you cannot guarantee a complete, conflict-free repair and some data loss is possible. Minimize this risk by keeping at least one additional warm standby or by pausing application writes before starting the recovery.

Recovery fails entirely (non-Spock failure)

If the recovery cannot be completed — for example, because the lagging node itself has suffered hardware or OS failure during the repair process — you may need to rebuild it from scratch rather than repair it in place. In that case, remove the damaged node from the cluster and add a fresh replacement using the ZODAN node-management procedures:

Remove the damaged node – Use spock.node_drop() (and drop related subscriptions) on the surviving nodes to clean up the cluster registry.
Add a replacement node – Use ZODAN's add_node procedure to bring a new node into the cluster. ZODAN handles slot creation, initial sync, and subscription wiring automatically. See the ZODAN documentation and the ZODAN tutorial for step-by-step instructions.

Post-Recovery Validation Checklist

After completing all repair steps, work through the following checklist before returning the cluster to normal production traffic.

Replication Status

[ ] Run SELECT * FROM spock.sub_show_status(); on every node and confirm all subscriptions show replicating with zero (or near-zero) lag.
[ ] Confirm that subscriptions dropped during Phase 2 cleanup have been re-created and are active.
[ ] Verify that the recovered node (n2) appears in SELECT * FROM spock.node; on all surviving nodes and that its DSN is correct.

Replication Slot Cleanup

[ ] Run SELECT slot_name, active, restart_lsn FROM pg_replication_slots; on each node and confirm there are no stale or inactive slots left over from the failed node or from the recovery process.
[ ] If any orphaned slots remain (for example, a slot for n1 that was never dropped), remove them with SELECT pg_drop_replication_slot('<slot_name>'); on the node that owns the slot.

Data Integrity Check

[ ] Re-run table-diff on all repaired tables without --preserve-origin or --until to confirm the full table content matches across all nodes:
```
./ace table-diff --nodes n2,n3,n4,n5 --output json \
  mycluster public.<table_name>
```
[ ] Confirm ACE reports no differences for every table.
[ ] Spot-check row counts on the recovered node against a known-good survivor for critical tables:
```
SELECT COUNT(*) FROM public.customers;
```
[ ] If track_commit_timestamp = on, verify that origin IDs and commit timestamps on repaired rows match the source of truth (use pg_xact_commit_timestamp() and spock.xact_commit_timestamp_origin()).

Final Sign-off

[ ] Monitor replication lag for at least 15–30 minutes after re-enabling application traffic to confirm the recovered node is keeping up.
[ ] Review PostgreSQL logs on the recovered node for any replication errors or warnings.
[ ] Document the recovery: nodes affected, failure time used for --until, source of truth used, and any tables that required re-repair.

Summary

In summary, the basic steps required to recover a lagging or damaged node vary only based on how many nodes you need to repair or rebuild:

In the event of a single node failure:

1. **Assess the damage** – Identify which nodes are behind and determine
   when the failure occurred.
2. **Clean up Spock** – Drop subscriptions to the failed node and remove it
   from the cluster.
3. **Identify missing data** – Run ACE table-diff on all replicated tables
   with `--preserve-origin node_name` and `--until <failure_time>`.
4. **Repair affected tables** – Run ACE table-repair with the following
   command options:

    - `--recovery-mode`
    - `--source-of-truth n3`
    - `--preserve-origin` (critical for maintaining replication metadata)

5. **Validate the recovery** – Re-run table-diff to confirm all tables match
   across surviving nodes.

The process is similar for a multi-node failure with these key differences:

1. **Assess the damage** – Identify which nodes are behind and when each
   failure occurred.
2. **Clean up Spock** – Drop subscriptions to all failed nodes and remove
   all of the failed nodes from the cluster.
3. **Identify missing data** – Run table-diff once for each failed origin,
   for each table:

    - `--preserve-origin n1` with n1's failure time
    - `--preserve-origin n4` with n4's failure time

4. **Repair affected tables** – Run table-repair for **each** diff file
   produced:

   - Use the same `--source-of-truth` (e.g., n3) for all repairs
   - Always include `--preserve-origin`

5. **Validate recovery** – Re-run table-diff to confirm all tables match

Critical Reminder

In both scenarios, always use --preserve-origin to ensure that origin ID and commit timestamp are preserved for every repaired row. Check all replicated tables to ensure the lagging node is fully recovered.

Recovering from Catastrophic Node Failure

Single Node Failure vs. Multiple Node Failure

Scenario 1: One Node has Failed, and One Node is Lagging

Scenario 2: Multiple Nodes have Failed, and One Node is Lagging

Implementing Repairs in Both Cases

What Happens in a Catastrophic Node Failure of a Single Node

Before You Begin

Overview of the Recovery Workflow

Phase 1: Assess the Damage

Phase 2: Spock Cleanup

Phase 3: Identify All of the Missing Data

Step 1: Get a List of All Tables to Check

Step 2: Run table-diff on Every Table

Step 3: Review the Diff Reports

Phase 4: Repair All Affected Tables

Repair Command for Each Table

Phase 5: Validate That All Tables Match

Why Preserve Origin ID and Timestamp?

Important Considerations when using ACE

Multi-Table Recovery

Large Tables

When to Specify --source-of-truth

Catastrophic Failure vs. Network Partition

Troubleshooting

ACE says LSN Information is Missing

More than One node is behind

Auto source-of-truth selection fails or ties

Origin metadata missing for some rows

Tables still differ after repair

Source node fails mid-replay

Recovery fails entirely (non-Spock failure)

Post-Recovery Validation Checklist

Replication Status

Replication Slot Cleanup

Data Integrity Check

Final Sign-off

Summary

Critical Reminder

See also