Configuring Advanced Features
ACE (the Anti-Chaos Engine)

ACE (the Anti-Chaos Engine)

In an eventually consistent system such as pgEdge, your nodes have the potential to go out of sync (diverge) due to replication lag, network partition, or node failure. Database differences can also arise out of planned or unplanned downtime or unexpected cloud provider failures. The Anti Chaos Engine (ACE) allows for reconciliation of such differences across databases by performing efficient table comparisons and reporting found differences in output formats that are both machine and user friendly. You can also use ACE to repair the discrepancies by specifying the node that is considered your source-of-truth and instructing ACE to update the differences between the nodes.

A cluster using ACE

Use Cases

Planned maintenance: If you need to take a database node down for administrative purposes, its state may fall behind the state of other nodes. When you bring the node online again, you can use ACE to compare node states and bring the node up to speed.

Node failures: Network failure can cause unplanned node downtime; use ACE to repair the node’s state and return it to the cluster.

Network Partition: In a typical deployment of pgEdge, a database cluster is often spread across multiple regions and data centers. Due to the volatility that comes with a large-scale deployment, network partitions are common. You can use ACE to return a cluster to a consistent state if the state is disrupted due to partition-related irregularities.

Installing ACE

ACE is installed by the pgEdge Platform installer. The following commands describe installing ACE on a management system (that is not a member of a replication cluster):

  1. Navigate to the directory where ACE will be installed.

  2. Invoke the pgEdge installer in that location with the command:

python3 -c "$(curl -fsSL https://pgedge-download.s3.amazonaws.com/REPO/install.py)

  1. Create a directory named cluster in the pgedge directory created by the pgEdge installer.

  2. Place a .json file named cluster_name.json in cluster/cluster_name/cluster_name.json on the ACE host. For example, if your cluster name is us_eu_backend, the cluster definition file for this should be placed in /pgedge/cluster/us_eu_backend/us_eu_backend.json. The .json file contains connection information for each node in your cluster; if you used a .json file to deploy your cluster and already have the file in place, you can skip this step.

After updating the .json file to describe your cluster, and moving it into place, you're ready to invoke ACE commands.

Creating a .json File

You can use the pgedge cluster json-template command to create a template for the .json file required for ACE. For example, after creating the template for a three node cluster (in our example, named demo), and updating the file contents, the resulting demo.json file will look similar to this:

{
  "name": "demo",
  "style": "remote",
  "create_date": "2024-03-07T16:14:30.864336",
  "remote": {
    "os_user": "susan",
    "ssh_key": "~/.ssh/smd_key_pair.pem"
  },
  "database": {
    "databases": [
      {
        "username": "lcusr",
        "password": "1safepassword",
        "name": "lcdb"
      }
    ],
    "pg_version": 16
  },
  "node_groups": {
    "remote": [
      {
        "region": "",
        "availability_zones": "",
        "instance_type": "",
        "nodes": [
          {
            "name": "n1",
            "is_active": true,
            "ip_address": "18.216.182.44",
            "ip_address_private": "10.1.0.44",
            "port": 6432,
            "path": "~/pgedge/demo"
          }
        ]
      },
      {
        "region": "",
        "availability_zones": "",
        "instance_type": "",
        "nodes": [
          {
            "name": "n2",
            "is_active": true,
            "ip_address": "54.241.206.173",
            "ip_address_private": "10.2.0.102",
            "port": 6432,
            "path": "~/pgedge/demo"
          }
        ]
      },
      {
        "region": "",
        "availability_zones": "",
        "instance_type": "",
 
        "nodes": [
          {
            "name": "n3",
            "is_active": true,
            "ip_address": "34.212.52.56",
            "ip_address_private": "10.3.0.46",
            "port": 6432,
            "path": "~/pgedge/demo"
          }
        ]
      }
    ]
  }
}
 

Where:

  • cluster_name is a string value that specifies the name of the cluster.

The next .json block starts with the keyword style, and provides the connection properties to the host:

  • style specifies if the cluster resides on a remote host or a localhost.
  • os_user_name is the name of a non-root operating system user on the host.
  • path_to_ssh_key is the complete path and name to the public ssh key on the host.

The databases block contains one set of authentication properties for each database in the cluster.

  • version is the PostgreSQL version running on the database.
  • database_user is the name of the database superuser. ACE must be able to perform SELECT, UPDATE, DELETE and INSERT operations with the user specified.
  • password is the password associated with the database_user.
  • database_name is the name of the database; note that each database name in a given .json file must be unique.

Each nodes clause contains the connection properties for a single node. Each set of properties contains:

  • name is the name of the node.
  • is_active is the state of the node.
  • ip_address is the IP address of the node.
  • ip_address_private is the (optional) private IP address of a node
  • port is the port that PostgreSQL monitors for connections to the server.
  • path is the path to the directory in which pgedge resides.

ACE Commands

The following commands are currently supported by the ACE module:

ACE table-diff

Use the table-diff command to compare the tables in a cluster and produce a csv or json report showing any differences. The syntax is:

$ ./pgedge ace table-diff cluster_name schema.table_name --block_rows=sample_size --max_cpu_ratio=%_cpus --output=json|csv

  • cluster_name is the name of the pgEdge cluster in which the table resides.
  • schema.table_name is the schema-qualified name of the table that you are comparing across cluster nodes.
  • --max_cpu_ratio specifies the percentage of CPU power you are allotting for use by ACE. A value of 1 instructs the server to use all available CPUs, while .5 means use half of the available CPUs. The default is .6.
  • --block_rows specifies the number of tuples to be used at a time during table comparisons. If block_rows is set to 1000, then a thousand tuples are compared per job across tables. If the total number of rows is n, then there are n/block_rows number of jobs that are scheduled among max_cpu_ratio * cpus * 2 processes.
  • --output specifies the output type;choose from json or csv when including the --output parameter to select the output type for a difference report. By default, the report written to diffs/<timestamp>/diff.json in json.
  • --nodes specifies a comma-delimited list of nodes on which the command will be executed.
  • --diff_file specifies a file that contains mismatched blocks identified during a table comparison. You can specify the file name when re-invoking the table-diff command to recheck a table for mismatches. When you use a --diff_file in conjunction with the table-diff command, the entire block identified in the file is checked; in contrast, the table-rerun command targets a specific row within a block, and so the recheck is performed on fewer rows.

ACE repset-diff

Use the repset-diff command to loop through the tables in a replication set and produce a csv or json report showing any differences. The syntax is:

$ ./pgedge ace repset-diff cluster_name repset_name

  • cluster_name is the name of the cluster in which the replication set is a member.
  • repset_name is the name of the replication set in which the tables being compared reside.
  • --max_cpu_ratio specifies the percentage of CPU power you are allotting for use by ACE. A value of 1 instructs the server to use all available CPUs, while .5 means use half of the available CPUs. The default is .6.
  • --block_rows specifies the number of tuples to be used at a time during table comparisons. If block_rows is set to 1000, then a thousand tuples are compared per job across tables. If the total number of rows is n, then there are n/block_rows number of jobs that are scheduled among max_cpu_ratio * cpus * 2 processes.
  • --output specifies the output type;choose from json or csv when including the --output parameter to select the output type for a difference report. By default, the report written to diffs/<timestamp>/diff.json in json.
  • --nodes specifies a comma-delimited list of nodes on which the command will be executed.

ACE diff-schemas

Use the diff-schemas command to compare the schemas in a cluster and produce a csv or json report showing any differences. The syntax is:

$ ./pgedge ace diff-schemas cluster_name node_one node_two schema_name --output=json|csv

  • cluster_name is the name of the cluster in which the table resides.
  • node_one is the name of the node on which the schema you are comparing resides; you will be comparing the schema to the same schema on node_two.
  • schema_name is the name of the schema you will be comparing.
  • output specifies the output type;choose from json or csv when including the --output parameter to select the output type for a difference report. By default, the report written to diffs/<timestamp>/diff.json in json.

ACE diff-spock

Use the diff-spock command to compare the meta-data on two cluster nodes, and produce a csv or json report showing any differences. The syntax is:

$ ./pgedge ace diff-spock cluster_name node_one node_two --output=json|csv

  • cluster_name is the name of the cluster in which the table resides.
  • node_one is the name of the node you will be comparing to node_two.
  • output specifies the output type;choose from json or csv when including the --output parameter to select the output type for a difference report. By default, the report written to diffs/<timestamp>/diff.json in json.

ACE table-repair

Use the table-repair command to correct any discrepancies or inconsistencies identified by ACE. The syntax is:

$ ./pgedge ace diff-spock cluster_name diff_file source_of_truth schema.table_name --dry-run=false

  • cluster_name is the name of the cluster in which the table resides.
  • diff_file is the path and name of the file that contains the table differences.
  • source_of_truth is the name of the node that contains the table that has been manually confirmed to contain accurate rows. The table on other nodes will be updated to match the content of this table.
  • schema.table_name is the schema-qualified name of the table that you are comparing across cluster nodes.
  • --dry_run - Include the --dry_run clause to perform a test application of the differences between the source_of_truth and the other nodes in the cluster. The default is false.

ACE table-rerun

If you suspect that the differences reported by ACE are merely temporary (possibly caused due to replication lag) you can use ACE to re-run a previous table-diff run to check if the differences still exist. The syntax is:

$ ./pgedge ace table-rerun <cluster_name> schema.table_name --diff_file=/path/to/diff_file.json

  • cluster_name is the name of the cluster in which the table resides.
  • diff_file is the path and name of the file that contains the table differences.
  • schema.table_name is the schema-qualified name of the table that you are comparing across cluster nodes.

The table-rerun option reads the diffs from the provided diff file and rechecks the tables to see if those tuples still differ. If ACE identifies any differences, it writes them to a new diff file and reports the file path.