Managing Exceptions during Replication
The processes documented on this page apply to users of Spock extension version 4.0 and higher. You can use the pgEdge Platform UM module to upgrade your installed version of the Spock extension.
Spock extension version 4.0 (and later) includes functionality that allows you to write some exceptions to the exception log table. This functionality can prevent interruptions in replication, and allow you to gracefully handle replication errors at a time that best suits your schedule and the server traffic. The exception table also provides visibility into the history of replication errors, and provides enough information to later remediate the errors with tools like pgEdge ACE.
pgEdge Platform clusters running Spock extension 4.0 and later also include a mechanism that allows you to make changes on one node, while not replicating those changes to other nodes (a repair mode). You can use repair mode when fixing data anomalies on a single node, to ensure the modifications don’t propagate to other nodes in the cluster.
Protective Pause Mode If you do not configure your cluster to log errors to the exception log table and the cluster encounters an error in replication, the Spock extension will enter a protective pause mode. During a protective pause, replication stops and the extension continues to retry the failing operation until the issue is resolved. No other transactions are replicated to standby nodes while the conflict remains.
One example of a replication error that can induce a protective pause occurs if you attempt to UPDATE a row that exists only on one node with a unique primary key. The Spock extension will attempt to update the row on other nodes until it is successful, not replicating any operations that come after the update until the conflict is resolved. When the UPDATE is successful, and the protective pause is lifted, the Spock extension will be self-healing (whether the conflict is resolved automatically or manually with repair mode), and replication will continue.
Enabling Exception Logging
You can customize exception logging behavior related to a failed INSERT
, UPDATE
, or DELETE
statement for your pgEdge Platform cluster by setting the GUC values that follow. These GUCs allow you to specify behaviors on a granular level, applying the actions you choose for operations within a transaction, rather than to an entire transaction. Note that changes to these GUCs require a system restart to apply.
To set the values and reload the server, use the pgEdge CLI db guc-set
command:
./pgedge db guc-set guc_name guc_value
For example, to set spock.exception_behaviour
to discard
, navigate into the pgedge
directory and invoke the command:
./pgedge db guc-set spock.exception_behaviour discard
The default values for this GUC is transdiscard
; by default, if an error occurs, all operations within a transaction are rolled back if one operation returns an ERROR. If spock.exception_logging is set to all
(the default), the server will record the transaction in the exception log.
Exception logging behavior is controlled by the following GUCs:
spock.exception_logging
Use this GUC to specify which operations/transactions are written to the exception log table:
all
(the default) - Setspock.exception_logging
toall
to instruct the server to record all transactions that contain one or more failed operations in thespock.exception_log
table. Note that this setting will log all operations that are part of a transaction if one operation fails.discard
- Add a row to thespock.exception_log
table for any discarded operation; successful transactions are not logged.none
- Instructs the server to not log any operation or transactions to the exception log table.
spock.exception_behaviour
Use this GUC to specify the commit behavior of pgEdge platform when it encounters an ERROR within a transaction:
transdiscard
(the default) - Setspock.exception_behaviour
totransdiscard
to instruct the server to discard a transaction if any operation within that transaction returns anERROR
.discard
- Setspock.exception_behaviour
todiscard
to continue processing (and replicating) without interrupting server use. The server will commit any operation (any portion of a transaction) that does not return anERROR
statement.sub_disable
- Setspock.exception_behaviour
tosub_disable
to disable the subscription for the node on which the exception was reported. Transactions for the disabled node will be added to a queue that is written to the WAL log file; when the subscription is enabled, replication will resume with the transaction that caused the exception, followed by the other queued transactions. You can use thespock.alter_sub_skip_lsn
function to skip the transaction that caused the exception and resume processing with the next transaction in the queue.
Note that the value you choose for spock.exception_behaviour
could potentially result in a large WAL log if transactions are allowed to accumulate.
The Table Structure
You can use information stored in the exception log table to identify operations or transactions that need to be repaired or rolled back. The exception log table contains the following information:
Column | Type |
---|---|
remote_origin | oid NOT NULL |
remote_commit_ts | timestamptz NOT NULL |
command_counter | integer NOT NULL |
remote_xid bigint | NOT NULL |
local_origin | oid |
local_commit_ts | timestamptz |
table_schema | text |
table_name | text |
operation | text |
local_tup | jsonb |
remote_old_tup | jsonb |
remote_new_tup | jsonb |
ddl_statement | text |
ddl_user | text |
error_message | text NOT NULL |
retry_errored_at | timestamptz NOT NULL |
PRIMARY KEY | Compound: remote_origin, remote_commit_ts, command_counter |
Using spock.repair_mode() when Repairing Transactions
Spock extension versions 4.0 (and later) include a function that you can control at the PSQL command line that places your pgEdge Platform cluster into repair mode while you modify data. The function is useful when fixing errors recorded in the exception log table, but can also be used in other situations where you need to interrupt replication.
Before you start repairing a transaction, set spock.repair_mode
to true
to interrupt replication to other nodes:
SELECT spock.repair_mode(true);
When you COMMIT
or ROLLBACK
a transaction, spock.repair_mode
is returned to false
automatically.
In the following example, the row that contains (3
,Smith
,sales
) already exists on n2
, but does not exist on n1
. To remediate the issue, you can perform the following steps on n1
to add this row without it replicating to n2
:
demo=# BEGIN;
BEGIN
demo=*# SELECT spock.repair_mode('True');
-[ RECORD 1 ]----------
repair_mode | 0/1C5AFD0
demo=*# INSERT INTO employeeid VALUES (`3`,`Smith`,`sales`);
INSERT 0 1
demo=*# COMMIT;
Repairing Transactions when spock.exception_behaviour
is in sub_disable
Mode
If you use spock.exception_behaviour
in sub_disable
mode, a failed operation will disable the subscription for the node on which the exception was reported until you manually intervene. The transaction that caused the exception is the first transaction added to a queue that is then written to the WAL log file until that node is brought back into the subscription. To repair a node with sub_disable
on, use the following steps:
Based on the entries in the exception log, repair the disabled node. After repairing the node, you can:
- Perform the first transaction in the WAL log manually, and use
spock.alter_sub_skip_lsn
to skip the first transaction in the WAL log when you resume processing. - Resume transaction processing with the transaction that caused the node to become disabled, followed by the rest of the queue from the WAL log file.
If you opt to manually perform the transaction that took the node out of the cluster, you will need to skip that transaction in the WAL log before you enable the subscription. To skip the first transaction in the WAL log, use the spock.alter_sub_skip_lsn
function, specifying the transaction LSN from the exception log. To skip the transaction, connect with a Postgres client (like psql) and use the command:
SELECT spock.alter_sub_skip_lsn(LSN_of_transaction);
Then, exit the client application and return the node to the cluster with the command:
pgedge spock sub-enable subscription_name db_name
Note that if you return the node to a cluster without repairing the cause of the initial failure, replaying the transaction that caused the initial failure will cause the failure to recur.
Logged Exception Types
The following examples demonstrate the types of replication errors that might be logged to the exception table. This is not a comprehensive list of logged errors.
Missing Primary Key Exception
In the following example:
- Node
n1
has two records in our sample table (foo
) with primary key values of1
and2
. - Node
n2
has three records in our sample table (foo
) with primary key values of1
,2
, and3
.
Both tables are added to replication, and then the row with a primary key value of 3
is updated on n2
. This row is successfully updated on n2
, but when replicated to n1
a row with a primary key value of 3
is not found.
If this error occurs while the exception behavior GUCs are set to their default values, the following entry is captured in the spock.exception_log
table:
demo=# SELECT * FROM spock.exception_log;
-[ RECORD 1 ]----+---------------------------------------------------------------------------------------------------
remote_origin | 26863
remote_commit_ts | 2024-07-02 19:10:04.317346+00
command_counter | 1
remote_xid | 756
local_origin |
local_commit_ts |
table_schema | public
table_name | foo
operation | UPDATE
local_tup |
remote_old_tup |
remote_new_tup | [{"value": 3, "attname": "a", "atttype": "int4"}, {"value": 2, "attname": "b", "atttype": "int4"}]
ddl_statement |
ddl_user |
error_message | logical replication did not find row to be updated in replication target relation (public.foo)
retry_errored_at | 2024-07-02 19:10:05.058789+00
Duplicate Primary Key Exception
In the following example:
- Node
n1
has one record in our sample table (foo
) with a primary key value of1
. - Node
n2
has two records in our sample table (foo
) with primary key values of1
and2
.
After adding both tables to the replication set, we update the foo
table on n1
, adding a row with a primary key value of 2
. The row is successfully inserted on n1
, but when replicated to n2
, the addition results in a duplicate primary key error.
If this error occurs while the exception behavior GUCs are set to their default values, the following entry is captured in the spock.exception_log
table:
remote_origin | 49708
remote_commit_ts | 2024-07-02 19:07:09.602248+00
command_counter | 1
remote_xid | 747
local_origin |
local_commit_ts |
table_schema | public
table_name | foo
operation | INSERT
local_tup |
remote_old_tup |
remote_new_tup | [{"value": 2, "attname": "a", "atttype": "int4"}, {"value": 3, "attname"
: "b", "atttype": "int4"}]
ddl_statement |
ddl_user |
error_message | duplicate key value violates unique constraint "foo_pkey"
retry_errored_at | 2024-07-02 19:07:10.298847+00
DDL Exception
In the following example:
- Node
n1
has one row in our sample table (foo
) with columnsa
,b
, andc
. - Node
n2
has one row in our sample table (foo
) with columnsa
andb
.
If we then enable Auto DDL (to replicate DDL statements through the default
replication set) and manually perform an update statement to drop column c
from n1
, the DDL statement is replicated to n2
. Since the table (foo
) on n2
does not have this column, the update statement returns an ERROR
: “column "c" of relation "foo" does not exist”
.
If this error occurs while the exception behavior GUCs are set to their default values, the following entry is captured in the spock.exception_log
table:
demo=# select * from spock.exception_log;
-[ RECORD 1 ]----+-------------------------------------------------------------------------
remote_origin | 49708
remote_commit_ts | 2024-07-02 17:01:32.837093+00
command_counter | 1
remote_xid | 744
local_origin |
local_commit_ts |
table_schema | spock
table_name | queue
operation | INSERT
local_tup |
remote_old_tup |
remote_new_tup | [{"value": "2024-07-02T17:01:32.83675+00:00", "attname": "queued_at", "a
tttype": "timestamptz"}, {"value": "pgedge", "attname": "role", "atttype": "name"}, {"value
": ["default_insert_only"], "attname": "replication_sets", "atttype": "_text"}, {"value": "
D", "attname": "message_type", "atttype": "char"}, {"value": "SET search_path TO \"$user\",
public; alter table foo drop column c", "attname": "message", "atttype": "json"}]
ddl_statement |
ddl_user |
error_message |
retry_errored_at | 2024-07-02 17:01:33.609419+00
-[ RECORD 2 ]----+-------------------------------------------------------------------------
remote_commit_ts | 2024-07-02 17:01:32.837093+00
command_counter | 2
remote_xid | 744
local_origin |
local_commit_ts |
table_schema |
table_name |
operation | DDL
local_tup |
remote_old_tup |
remote_new_tup |
ddl_statement | SET search_path TO "$user", public; alter table foo drop column c
ddl_user | pgedge
error_message | column "c" of relation "foo" does not exist
retry_errored_at | 2024-07-02 17:01:33.610043+00
Exception Log Table Limitations
The following cases are not currently captured by the exception log table:
- A mismatched number of columns: If
n1
has 3 columns in a table, whilen2
has 2 columns in the same table, anINSERT
into the table onn1
will cause uncaptured exception onn2
. - Data anomalies resulting from columns with different data types.