Version:

Using coldfront

ColdFront offers two operating modes. Pick one per table; both can coexist in the same database. The two modes compare as follows:

	Tiered (hot + cold)	Decoupled (iceberg-only)
Where rows live	Hot in PG heap (recent), cold in Iceberg (archived)	Everything in Iceberg
Setup	Create a partitioned table; let the archiver convert it on first run	One SQL call: `coldfront.create_iceberg_table(...)`
Archiver	Required (cron, moves old partitions to cold)	Not used
Best when	Workload has a recent-row OLTP part that benefits from PG indexes + transactional ergonomics	Pure analytic / append-mostly; you want zero PG storage and stateless compute

Once the table exists, the SQL surface is identical: SELECT, INSERT, UPDATE, DELETE all work normally against the relation name (e.g. events).

Prerequisites (both modes)

The stack must already be running with PG + pg_duckdb + coldfront + Lakekeeper + S3-compatible storage: three services - PostgreSQL + pg_duckdb, Lakekeeper, and any S3-compatible object store (SeaweedFS, MinIO, GCS, etc.). The one-time setup below brings it up and bootstraps it.

One-time setup

Bring up the end-user stack (the example uses SeaweedFS, gated behind the local-store compose profile; host ports are published so the localhost commands below work directly). For the image build itself, see installation.md:

docker compose --profile local-store up -d --build

Then bootstrap Lakekeeper, create the warehouse, and pre-create the Iceberg namespace:

# 1. Bootstrap Lakekeeper
curl -X POST http://localhost:8181/management/v1/bootstrap \
  -H "Content-Type: application/json" -d '{"accept-terms-of-use":true}'

# 2. Create warehouse (adjust endpoint/credentials for your S3 store)
curl -X POST http://localhost:8181/management/v1/warehouse \
  -H "Content-Type: application/json" -d '{
    "warehouse-name": "wh",
    "storage-profile": {
      "type": "s3", "bucket": "iceberg", "region": "us-east-1",
      "endpoint": "http://seaweedfs:8333", "path-style-access": true,
      "flavor": "s3-compat", "sts-enabled": false,
      "remote-signing-enabled": false
    },
    "storage-credential": {
      "type": "s3", "credential-type": "access-key",
      "aws-access-key-id": "admin", "aws-secret-access-key": "adminsecret"
    }
  }'

# 3. Create the Iceberg namespace in the new warehouse.
#    REQUIRED for decoupled (iceberg-only) mode: the Iceberg CREATE SCHEMA is
#    deferred to transaction COMMIT but CREATE TABLE is POSTed eagerly, so
#    coldfront.create_iceberg_table — which runs both in one
#    transaction — would 404 against a cold warehouse. Pre-creating the
#    namespace here (its own committed REST call) makes the function's in-txn
#    CREATE SCHEMA IF NOT EXISTS a no-op so the table create succeeds. The
#    archiver (tiered mode) creates the namespace itself and does not need this.
WID=$(curl -s http://localhost:8181/management/v1/warehouse \
  | grep -oE '"warehouse-id":"[^"]+"' | head -1 | cut -d'"' -f4)
curl -X POST "http://localhost:8181/catalog/v1/$WID/namespaces" \
  -H "Content-Type: application/json" -d '{"namespace":["public"]}'

Then install the extensions and set the cold-tier credentials, once per database:

CREATE EXTENSION IF NOT EXISTS pg_duckdb;
CREATE EXTENSION IF NOT EXISTS coldfront;

SELECT coldfront.set_storage_secret('admin', 'adminsecret', 'seaweedfs:8333');

The secret is stored in the coldfront.storage_secret table (excluded from pg_dump; in a Spock mesh it replicates by value once added to the default repset - one-time mesh setup step 4) and materialized as a DuckDB PERSISTENT SECRET that loads at instance init. There is no per-session setup: the Iceberg catalog ice attaches lazily by the coldfront C hook on the first query that touches a tiered/decoupled view (read or write).

For a real cloud-S3 setup, see object_store.md.

Mode 1 - Tiered (hot + cold)

Create a partitioned table normally:

CREATE TABLE events (
    id     bigint GENERATED ALWAYS AS IDENTITY,
    ts     timestamptz NOT NULL,
    status text,
    data   jsonb,
    PRIMARY KEY (id, ts)
) PARTITION BY RANGE (ts);

CREATE TABLE p_2026_04 PARTITION OF events
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Register it in coldfront.partition_config (the in-DB per-table config; the YAML holds only the connection + cold-store settings):

./bin/archiver register --config config.yaml --table events \
    --period monthly --hot-period "1 month"
# optional: --retention "5 years" DROPs cold data past that age
# (must exceed --hot-period; omit = keep forever)

Run the archiver (typically via cron):

./bin/archiver --config config.yaml

The first run renames events → _events, creates the unified view events, and registers it. From then on every cycle (1) tiers partitions older than hot_period from hot PG to cold Iceberg and advances the watermark, and (2) if retention_period is set, drops cold Iceberg rows older than it. The data lifecycle is hot → hot_period → cold → retention_period → gone; omit retention_period to keep cold data forever.

Inbound foreign keys

A tiered table cannot be the target of an enforced foreign key from another table over its archivable range. Archiving physically removes the rows from PostgreSQL (export to Iceberg, then DETACH + DROP the partition), and PostgreSQL cannot enforce a foreign key against the cold tier. So the DETACH is refused (SQLSTATE 23503) and the archiver fails fast, naming the blocking constraint.

This is inherent, not an edge case: referencing rows rarely have a hot-only lifecycle. They usually outlive the partition they point at, so the foreign key still pins those rows in PostgreSQL by the time the partition ages into the cold tier.

The foreign key is fundamentally incompatible with archiving the rows it references, so before archiving such a table, drop it:

ALTER TABLE event_logs DROP CONSTRAINT event_logs_events_fkey;

Better still, do not point an enforced foreign key at a tiered table over its archivable range in the first place.

2-level (LIST → RANGE) tiered tables

A table partitioned LIST (region) → RANGE (ts) can be tiered too - the same sub_partition block as the standalone partition manager (Mode 3), so a partition-manager-managed table can be "upgraded" to tiered by pointing the archiver at it:

    - source_table: regional
      partition_column: ts          # the RANGE (time) column — required for 2-level
      partition_period: monthly
      hot_period: 1 month
      # retention_period: 5 years   # optional cold expiry (region-agnostic, by ts)
      sub_partition:
        values_source: "SELECT region FROM regions"

One Iceberg table holds every region (region is just a column); the archiver tiers leaves a whole ts period at a time across all regions before advancing the shared hot/cold watermark, so a period only becomes cold once it is cold for every region. id mode is not supported in tiered mode (the cold tier is time-keyed).

Mode 2 - Decoupled (iceberg-only)

Single call:

SELECT coldfront.create_iceberg_table(
    p_schema  => 'public',
    p_table   => 'events',
    p_columns => '[
      {"name":"id",     "type":"bigint"},
      {"name":"ts",     "type":"timestamptz"},
      {"name":"status", "type":"text"},
      {"name":"data",   "type":"jsonb"}
    ]'::jsonb
);

That single statement provisions:

ice.public.events on the attached Iceberg catalog
a PG-side wrapper view public.events with proper PG-typed columns
a coldfront.tiered_views registry row - every INSERT, UPDATE, and DELETE on the view is intercepted by the coldfront C hook and rewritten to a single duckdb.raw_query(...) against ice.public.events

Spock's ddl_sql repset replicates the CREATE VIEW, but the registry row does not propagate with it: run create_iceberg_table() (idempotent, name-keyed) on each node to arm the write hook there. Each node also needs the bakery armed via coldfront._ensure_claims_replicated() - see the one-time mesh setup below.

Mode 3 - Standalone partition manager (no cold tier)

You don't need Iceberg at all. If automated PostgreSQL partition maintenance is all you want - declarative time- or id-based RANGE partitioning with a premade forward window and automatic age-out of old partitions - ColdFront's partitioner binary is the whole product: stock PostgreSQL (or a Spock mesh), no cold tier, no DuckDB, no Iceberg, nothing to preload. Each invocation makes one reconcile pass per managed table: premake the forward window, ensure the partition covering now exists, and detach-then-drop partitions past retention (DETACH ... CONCURRENTLY, never a bare DROP of attached data). Build it with make build and run it from cron:

go build -o bin/partitioner ./cmd/partitioner   # or: make build
./bin/partitioner --config config.yaml

A partition-only config omits the iceberg: and s3: sections entirely (supply either all of them or none - a half-filled cold config is rejected):

postgres:
  dsn: "host=localhost dbname=mydb"

Register each managed table in coldfront.partition_config:

./bin/partitioner register --config config.yaml --table events \
    --period monthly --retention "12 months" \
    --strategy drop   # drop (DETACH+DROP, destroy; default)
                      #   | detach (DETACH only, keep as a
                      #     standalone table - data preserved)

Operating it

Schedule one pass per period or more often - a cron line, or a systemd oneshot service plus timer (a failed pass then surfaces as a failed unit, so alerting is free):

17 * * * * postgres /usr/local/bin/partitioner --config /etc/coldfront/partitioner.yaml >> /var/log/coldfront-partitioner.log 2>&1

Keep the following operational behaviour in mind when scheduling it:

Exit codes. 0 = every table reconciled (a self-healed behind condition still exits 0); non-zero = at least one table failed (N table(s) failed), each logged with its [schema.table] prefix. Alert on non-zero.
Behind-detection. If the table already has a past partition but none covers now at the start of a pass (a lagging cron - live inserts had no home), the pass heals it (creates the current partition), logs a WARNING (self-healed) line, and still exits 0. Monitor for that warning in the log rather than via the exit code, and widen future_partitions or run more often. A fresh table (only just-premade future partitions) is not behind: its first reconcile succeeds cleanly.
Retention strategy. With the default expiration_strategy: drop, expiry is DETACH CONCURRENTLY + DROP TABLE - the data is gone, so back up before shrinking retention_period. Set expiration_strategy: detach to instead leave the expired partition as a standalone table (detached from the parent, data preserved) and reclaim it yourself. A partition is expired only once its entire range is older than now − retention_period, computed with calendar-accurate PostgreSQL interval arithmetic (a real month, leap years correct). detach is partition-only - the tiered archiver always drops after exporting to cold.

Primary keys on time-partitioned tables (id mode)

PostgreSQL requires a unique index to include the partition key, so a plain PRIMARY KEY (id) is impossible on a table partitioned by a separate ts column. Partition instead by RANGE on a time-ordered id and the partition key is the key - a single-column PRIMARY KEY (id):

    - source_table: events
      partition_column: id
      partition_period: monthly
      retention_period: 12 months
      part_mode: id
      id_scheme: snowflake             # snowflake | uuidv7

uuidv7 reads the RFC 9562 leading-millisecond timestamp; snowflake decodes the pgEdge snowflake extension's layout. Either way the manager computes the month/day partition bounds as id values, so id-order equals time-order and the forward/retention schedule is unchanged.

Two-level (LIST → RANGE) sub-partitioning

For a table partitioned by LIST (region) whose children are themselves RANGE-partitioned by time, add a sub_partition block. values_source is a query returning the current level-1 values; the manager provisions and maintains a RANGE sub-tree under each, creating sub-trees for newly appearing values automatically:

    - source_table: events
      partition_column: ts             # the level-2 RANGE column
      partition_period: monthly
      retention_period: 12 months
      sub_partition:
        values_source: "SELECT code FROM regions"

Managing partitioned tables (CLI)

ColdFront splits configuration into two kinds. Connection config - the Postgres DSN, and (tiered archiver only) the Iceberg/S3 connection - stays in a small per-node YAML and is never replicated. Per-table lifecycle lives in coldfront.partition_config, a name-keyed table that replicates by value across a Spock mesh (like tiered_views/archive_watermark), so every node reads identical config - no per-node file syncing. Manage it with the CLI below (both partitioner and archiver expose these subcommands; with no subcommand they do their normal reconcile/archive run).

register is the primary way to manage tables: one command adds or adopts one table, validated on the spot. import and export are bulk helpers for (re)configuring a machine - seed a fresh node from a YAML, or dump the live config to git and replay it on another node - not the day-to-day path. Every write (register, import, set) goes through the same validation, so no command can leave partition_config in a state another command would reject.

The data lifecycle is hot PG → hot_period → cold Iceberg → retention_period → dropped (tiered) or hot PG → retention_period → dropped (partition-only). Setting hot_period makes a table tiered; omitting it makes it partition-only.

hot_period and retention_period are native PostgreSQL intervals - use any interval syntax (1 month, 90 days, 1 year 2 mons, 5 years). Expiry boundaries are computed with calendar-accurate interval arithmetic (now() - period: real months, leap years), and retention_period must exceed hot_period (validated at register/set time).

The CLI exposes the following subcommands:

Command	Purpose
`register`	add/adopt a table - validates the PRIMARY KEY covers the partition key
`list`	show managed tables and their lifecycle
`set`	change fields, or `--disable`/`--enable` a table
`remove`	stop managing a table (the table itself is left intact)
`import`	bulk-load a machine's tables from a YAML's `archiver.tables` (provision/restore; each table validated as `register` does)
`export`	dump the active (enabled) config to YAML or SQL - a git-reviewable backup to replay on another node

# Partition-only: keep 3 future partitions, drop those older than 12 months.
partitioner register --config cf.yaml --table events --period monthly --retention "12 months"

# Partition-only, but DETACH (preserve) expired partitions instead of dropping them.
partitioner register --config cf.yaml --table events --period monthly \
    --retention "12 months" --strategy detach

# Tiered: tier to cold Iceberg after 1 month, then drop cold data after 5 years.
archiver register --config cf.yaml --table events --period monthly \
    --hot-period "1 month" --retention "5 years"

# id mode — a real single-column PRIMARY KEY (id) on a snowflake-keyed table.
partitioner register --config cf.yaml --table events --period monthly \
    --column id --part-mode id --id-scheme snowflake --retention "1 year"

# 2-level LIST(region) → RANGE(ts), tiered; region values come from a table.
archiver register --config cf.yaml --table regional --period monthly --column ts \
    --hot-period "1 month" --sub-values-source "SELECT region FROM regions"

partitioner list   --config cf.yaml                      # what's managed
partitioner set    --config cf.yaml --table events --retention "24 months"
partitioner set    --config cf.yaml --table events --disable   # pause (keeps the row)
partitioner remove --config cf.yaml --table events       # unregister, keep the table
partitioner import --config tables.yaml                  # register every table in a YAML at once
partitioner export --config cf.yaml > managed.yaml       # active config, git-reviewable (--format sql for INSERTs)

Run partitioner (or archiver) with no arguments, help, or --help for the command overview; every subcommand has detailed --help with worked examples. The write commands accept --print-sql (emit the SQL without running it - review/commit it); register and import also accept --dry-run. set --enable/--disable (mutually exclusive) pause/resume a table without removing it; a disabled table is skipped by reconcile and omitted from export. Per table, only the cadence and a lifecycle boundary (hot_period or retention_period) are required; partition_column is auto-detected from pg_catalog for flat tables (required for 2-level). register writes a row whose CHECK constraints enforce the lifecycle rules at write time.

Managing tables: the archiver and partitioner read the managed set only from coldfront.partition_config. register adds one table; import adds every table in a YAML archiver.tables list. Both write through the same validation (the PK must cover the partition key; retention must exceed hot_period), so an imported table is checked exactly as a registered one. export dumps the active rows back to YAML or SQL for git.

Storage backends

Configure exactly one cold-store backend:

S3 - any S3-compatible store (SeaweedFS, MinIO). Set endpoint, use_ssl: true for a TLS endpoint, and url_style: path (default) or vhost.
Virtual-hosted cloud S3 (AWS S3 is the canonical one) - omit endpoint (and the endpoint arg to set_storage_secret) so DuckDB uses the native per-Region virtual-hosted + HTTPS addressing; just set region to your bucket's Region. This is required for Regions launched after 2019-03-20 (e.g. ap-south-2), whose DNS does not route path-style requests and returns HTTP 400. The Lakekeeper warehouse profile must be a virtual-hosted s3 profile (flavor: aws, path-style-access: false, no custom endpoint); the full walkthrough is object_store.md.
Google Cloud Storage - not a separate backend: use s3: pointed at GCS's S3-interoperability endpoint with an HMAC key pair (endpoint: storage.googleapis.com, use_ssl: true, access_key/secret_key = the HMAC id/secret). Lakekeeper's warehouse uses an s3 profile (flavor: s3-compat, path-style) at the same endpoint. Verified end-to-end (iceberg read+write over interop). Lakekeeper's native gcs profile is service-account only and is not used.
Azure ADLS Gen2 - a supported cold-store backend; the access key rides inside connection_string.

For an Azure ADLS Gen2 cold tier, set the credential with set_storage_secret_azure() instead of set_storage_secret() - it takes a CONFIG-provider connection string. The storage-account access key rides inside AccountKey=…; the DuckDB azure secret has no separate account-key parameter, so shared-key auth lives entirely in the connection string:

SELECT coldfront.set_storage_secret_azure(
    'DefaultEndpointsProtocol=https;AccountName=<account>;AccountKey=<key>;EndpointSuffix=core.windows.net');

It writes the same coldfront.storage_secret row (replicated, pg_dump-excluded) and materializes a TYPE azure PERSISTENT SECRET. The Azure cold tier is subject to the soft-delete / change-feed restriction in Gotchas.

Vended (minted) credentials

Vended credentials let a deployment run with no object-store credential stored in the database, in a DuckDB secret file, or in an archiver config; this suits compliance environments that forbid persisting long-term keys. Lakekeeper mints a short-lived, per-table credential (an S3 STS access key, secret, and session token) at read and write time, and ColdFront uses it directly. The long-term credential lives only in the Lakekeeper warehouse.

Enable it with a single call that stores no credential:

SELECT coldfront.set_storage_secret_vended();

This writes a coldfront.storage_secret row marked as vended, so nothing is materialized as a DuckDB secret; ensure_attached() then attaches the catalog with credential vending turned on. The archiver reads the same row and skips its own credential setup, so a vended deployment omits the s3:/azure: block from the archiver config entirely. The compactor likewise needs no credential in its config.

Vended mode targets the two clouds that mint scoped credentials: AWS S3 (STS) and Azure ADLS Gen2 (SAS). The set_storage_secret_vended() call above enables AWS S3; the same call with 'azure' enables Azure through the identical path:

SELECT coldfront.set_storage_secret_vended('azure');

Vending requires a Lakekeeper warehouse configured to mint credentials:

AWS S3: flavor: aws with sts-enabled: true, an assume-role-arn for a bucket-scoped IAM role, and an external-id on the warehouse credential that the role's trust policy also requires. The full warehouse and IAM-role setup is in object_store.md.
Azure ADLS Gen2: an adls warehouse with sas-enabled (on by default). Lakekeeper vends a per-container SAS token; the warehouse credential can be a shared-access-key, client-credentials, or azure-system-identity.

Google Cloud Storage over the S3-interoperability endpoint has no STS to mint against, so GCS stays on static HMAC credentials.

The compactor runs fully under vended credentials: compaction, snapshot expiry, and orphan-file reclaim all use the minted per-table credentials.

Switching a running deployment between static and vended credentials changes the attach mode, which is fixed per PostgreSQL backend at attach time; open a new session (or restart the archiver and compactor, which are short-lived processes) so the new mode takes effect.

Reading + writing (identical for both modes)

The same SQL works against the relation name in either mode, as the following examples show:

-- Reads — pg_duckdb handles the iceberg side; PG handles the heap side
SELECT count(*) FROM events;
SELECT id, status, data->>'k' FROM events WHERE ts >= '2026-04-01';

-- Inserts, updates, and deletes all go through the coldfront C hook,
-- which rewrites the query into one PG-set-based statement (hot side)
-- + one duckdb.raw_query (cold side). For iceberg-only mode every write
-- goes cold; for tiered mode the hook splits by ts vs the watermark.
INSERT INTO events (ts, status, data) VALUES (now(), 'ok', '{"k":1}');
UPDATE events SET status = 'fixed' WHERE id = 123;
DELETE FROM events WHERE ts < '2025-01-01';

-- Bulk INSERT shapes are all set-based — no per-row work:
INSERT INTO events (ts, status, data) VALUES (...), (...), (...);
INSERT INTO events (ts, status, data) SELECT ts, status, data FROM staging;
INSERT INTO events (ts, status, data) SELECT now() + i*'1s'::interval, 'ok', '{}'
                                       FROM generate_series(1, 1000) i;

-- Transactions work; ROLLBACK undoes Iceberg writes too
BEGIN;
  UPDATE events SET status = 'pending' WHERE id = 1;
  SELECT status FROM events WHERE id = 1;   -- sees 'pending' (read-your-own-write)
ROLLBACK;
SELECT status FROM events WHERE id = 1;     -- back to whatever it was

Supported column types

The following PostgreSQL column types are supported:

bigint · integer · smallint · real · double precision · boolean · timestamp with time zone · timestamp without time zone · date · time without time zone · uuid · text · varchar(N) · char(N) · bytea · numeric(P,S) (P ≤ 38) · jsonb / json · interval

Anything else (unbounded numeric, xml, tsvector, range/multirange types, custom enums, arrays, composite types) is rejected at table-creation time. We refuse silent fallback to varchar - losing precision/identity is worse than no support.

char(N) is stored and read as varchar. The data round-trips losslessly: values, comparisons, and length() match a hot PG table, where length() already ignores char(N) trailing padding. The only difference is cosmetic: a cold char(N) column reads back unpadded with pg_typeof varchar, because pg_duckdb has no fixed-length char type. Use text or varchar if blank-padded display matters.

json, jsonb and interval are stored as varchar in Iceberg (no native primitive). On read, interval is view-cast back to the rich PG type; json and jsonb surface as DuckDB's json (the equivalent of PG's jsonb), not the rich PG jsonb type, because Iceberg-backed reads run entirely in DuckDB. Queries like data->>'key' and data->'key' work, and ColdFront translates the ::jsonb cast and jsonb_array_length on read. The jsonb-only operators (?, @>, <@, #>, #>>) and most jsonb functions (jsonb_typeof, jsonb_extract_path, jsonb_extract_path_text, jsonb_set, jsonb_path_*, jsonb_each, jsonb_object_keys) are not supported on tiered or decoupled data: DuckDB either lacks them or its same-named function differs in signature or result. Reach into the document with ->/->>.

These limits apply only to reads that go through a tiered or iceberg-only view, which pg_duckdb plans entirely in DuckDB. Ordinary (non-tiered) PostgreSQL tables are untouched by ColdFront and keep full jsonb support, as does the hot partition table itself.

A hot-only read is the exception. When a SELECT reads a tiered view directly (no join, CTE, or sub-query) and its WHERE provably restricts to the hot tier, ColdFront rewrites it to read the hot partition table in plain PostgreSQL: the full jsonb operator and function set works, and the query skips DuckDB entirely. Such a read surfaces data as native jsonb rather than the view's json. Reads that span tiers, are cold-only, or reach the view through a join or sub-query stay in DuckDB, where the limits above apply.

inet/cidr/oid are not supported: pg_duckdb cannot process them (inet Oid 869, oid Oid 26) in any Iceberg-backed query, and every cross-tier read is planned by pg_duckdb - so no cast makes them readable. Store IP data as text and oid values as bigint (you can still index/compare them; cast on the hot side only if needed).

Gotchas

Keep the following caveats in mind when running either mode:

jsonb reads: surface as json. The ->/->> operators work, and ColdFront translates the ::jsonb cast and jsonb_array_length; other jsonb operators and functions are unsupported on cold or cross-tier reads. A hot-only read of the view runs in PostgreSQL with full jsonb (see Supported column types).
Cross-tier isolation: a long-running SELECT that touches the Iceberg side multiple times within one transaction may see writes from other sessions interleaved between scans. PG's repeatable-read does not extend across the pg_duckdb boundary. Read-your-own-write within one tx works (verified) - it's only cross-statement consistency vs. concurrent writers that's weaker.
Crash-mid-commit (decoupled mode): if a backend crashes between Iceberg snapshot commit and PG commit, S3 objects can be orphaned. Iceberg housekeeping reclaims them - not corrupting, but a real failure mode.
Concurrent writes from multiple PG nodes (decoupled mode): serialized PG-side by the bakery protocol - every iceberg-only INSERT goes through coldfront._exec_iceberg_with_claim, which holds a globally-ordered snowflake ticket and waits for its turn before committing to Lakekeeper. No 409 conflicts, no app-level retry. The protocol is Lamport-1978 mutex with the Ricart-Agrawala (1981) deferred-reply optimisation; claims and acks replicate as Spock rows and it stays safe under Spock's asymmetric apply (modelled in docs/formal/Bakery_v2.tla). The bakery requires the dblink + snowflake extensions, the coldfront.dblink_self GUC, and a one-time SELECT coldfront._ensure_claims_replicated() call on every node after spock mesh setup; see architecture_decoupled.md. Sync-rep is not required. The throughput ceiling is Lakekeeper's commit rate, not the writer count.
Direct table access: _events is the hot heap (tiered mode only). ice.public.<name> is the Iceberg table - only addressable via iceberg_scan(...) or duckdb.raw_query('… ice.… …'), never via PG-native 3-part names.
Tiered INSERT with omitted IDENTITY column (e.g. INSERT INTO events (ts, status, data) VALUES … where id is GENERATED ALWAYS AS IDENTITY): the cold side falls back to a plpgsql cursor loop that calls nextval() per row so cold ids share the hot side's sequence. Correctness is full; throughput is lower than the set-based fast path. Either supply id explicitly in the INSERT, or use a partition-column predicate that proves the rows are all hot, to stay on the fast path. For very large historical seeds (mostly-cold), prefer iceberg-only mode where ids come from your source data.

Distributed setup (3-node mesh, decoupled mode)

For multi-writer iceberg workloads, run coldfront on N PG nodes in a Spock mesh against the same Lakekeeper + S3. The bakery serializes commits PG-side so writers never collide at the catalog.

Per-node `postgresql.conf`

The configuration below applies to each node in the mesh:

wal_level = logical
shared_preload_libraries = 'snowflake,spock,pg_duckdb,coldfront'

# Spock prereq + DDL replication. The ddl_sql repset (and with it the
# one-node provisioning promise for wrapper views) only works with DDL
# replication on; allow_ddl_from_functions covers DDL issued inside
# coldfront's plpgsql helpers.
track_commit_timestamp = on
spock.enable_ddl_replication = on
spock.allow_ddl_from_functions = on
spock.include_ddl_repset = on

# Keep pg_stat_replication.reply_time fresh on every walsender.  PG
# default is 10 s; with the bakery's 5 s liveness window that would
# false-positive every idle peer as "dead" on the first claim after a
# quiet period.  1 s leaves comfortable margin.
wal_receiver_status_interval = 1s

# Per-node — any distinct integer 1..1023 (must be unique per node; the value
# is otherwise arbitrary — the bakery matches acks by spock node name, not by id).
snowflake.node = 1

# DSN used by the bakery's autonomous-tx claim INSERT/DELETE via dblink.
# Unix socket avoids TCP overhead. The claim session only touches the
# coldfront.claims/claim_acks heap tables, so it never attaches the
# Iceberg catalog (the lazy catalog-attach hook fires only on a tiered
# view). application_name=coldfront_dblink marks the session as bakery
# traffic.
coldfront.dblink_self = 'host=/tmp dbname=coldfront user=coldfront application_name=coldfront_dblink'

coldfront.warehouse = 'wh'
coldfront.lakekeeper_endpoint = 'http://lakekeeper:8181/catalog'

The bakery has no peer-ack timeout. R-A's only failure mode is a dead peer (would wait forever), closed by a liveness check inside the wait-loop: a peer whose pg_stat_replication.reply_time is older than coldfront.peer_alive_window_ms (default 5000) is implicitly treated as already-acked. Raise this on slow/lossy WAN links if false-positive dead-peer rulings become a problem. An alive peer that hasn't acked is either deferring legitimately (R-A's defer rule) or about to ack - either way, waiting is correct, not a failure.

Sync-rep (synchronous_standby_names) is not required by the bakery - the R-A ack barrier is what serialises iceberg commits. You can still enable it cluster-wide if you want stronger durability for non-bakery writes, but it plays no part in iceberg-commit serialisation.

One-time mesh setup - must be done in this order on every node, because coldfront._ensure_claims_replicated() calls spock.repset_add_table and so requires the local spock node to already exist:

-- 1. Extensions, in dependency order. dblink + snowflake are bakery
-- prereqs (R-A's autonomous-tx claim/ack INSERTs go through dblink;
-- claim tickets come from snowflake.nextval).
CREATE EXTENSION IF NOT EXISTS dblink;
CREATE EXTENSION IF NOT EXISTS snowflake;
CREATE EXTENSION IF NOT EXISTS spock;
CREATE EXTENSION IF NOT EXISTS pg_duckdb;
CREATE EXTENSION IF NOT EXISTS coldfront;

-- 2. Spock node + full-mesh subscriptions (each node has N-1 subs).
SELECT spock.node_create('n<i>', 'host=<this_node_priv_ip> user=coldfront dbname=coldfront port=5432');

-- on n1 (n2 and n3 symmetric):
SELECT spock.sub_create('sub_n1_from_n2', 'host=<n2> user=coldfront dbname=coldfront port=5432',
                        ARRAY['default','default_insert_only','ddl_sql'],
                        false, false, '{}', '0', false);
SELECT spock.sub_create('sub_n1_from_n3', 'host=<n3> user=coldfront dbname=coldfront port=5432',
                        ARRAY['default','default_insert_only','ddl_sql'],
                        false, false, '{}', '0', false);

-- 3. **Required** on every node, after spock setup: register the R-A
-- bakery tables (coldfront.claims + coldfront.claim_acks) in the local
-- node's default repset.  Without this on a peer, the peer's ack INSERTs
-- never replicate back to the originating writer and every claim on the
-- originator waits forever at the ack barrier.
SELECT coldfront._ensure_claims_replicated();

-- 4. On every node: replicate the cold-store secret by value, so a
-- set_storage_secret() call on one node reaches all peers.
-- (coldfront.partition_config self-registers the same way the first time
-- the archiver/partitioner touches it, so it needs no manual step.)
SELECT spock.repset_add_table('default', 'coldfront.storage_secret'::regclass, false);

-- 5. Tiered mesh only, on every node: replicate the registry + watermark
-- so a tiered table provisioned on one node is fully usable on peers
-- (both tables are name-keyed, so the rows are node-independent).
SELECT spock.repset_add_table('default', 'coldfront.tiered_views'::regclass, false);
SELECT spock.repset_add_table('default', 'coldfront.archive_watermark'::regclass, false);

synchronize_structure := false, synchronize_data := false - tables already exist on every node from the coldfront extension; no initial copy needed.

Verify before benching - insert a sentinel claim on each node and read it back from every other node. All N×(N-1) directions must show the row before traffic starts. ci/journey.sh story_mesh_substrate is a copyable reference.

Tuning knobs

The following GUCs adjust write behaviour and execution; tune them as needed:

coldfront.allow_mixed_writes (bool, default on) - controls what happens for tiered-mode UPDATE/DELETE whose WHERE can't be proven to target one tier. on emits a dual-tier CTE; off rejects with an error and a hint. Not relevant in decoupled mode (every write is single-tier by definition).
duckdb.force_execution - bench it before flipping: on a mixed workload it helps count(distinct) and similar but regresses index lookups, top-K with PK ordering, and JSON access. Default off.

Going deeper

For deeper detail on each mode's internals, see the architecture references:

Tiered architecture, watermark, archiver, transparent UPDATE/DELETE, concurrency: architecture.md.
Decoupled mode internals, ACID model, distributed scaling: architecture_decoupled.md.