Probes System
This document provides a detailed explanation of how the Collector probe system works internally.
What is a Probe?
A probe is a data collection unit that performs the following tasks:
- The probe queries a specific PostgreSQL system view or function.
- The probe collects metrics at a configured interval.
- The probe stores the results in a partitioned metrics table.
- The probe manages data retention through automated cleanup.
The Collector includes 34 built-in probes covering the most important PostgreSQL statistics views.
Probe Types
Probes are categorized by their scope.
Server-Scoped Probes
Server-scoped probes collect server-wide statistics and execute once per monitored connection. Examples include the following probes:
pg_stat_activitymonitors current database activity.pg_stat_replicationmonitors replication status, lag, and WAL receiver statistics.pg_replication_slotsmonitors replication slot usage and statistics.pg_stat_recovery_prefetchmonitors recovery prefetch statistics.pg_stat_subscriptionmonitors logical replication subscriptions and statistics.pg_stat_connection_securitymonitors SSL and GSSAPI connection security.pg_stat_iomonitors I/O and SLRU cache statistics.pg_stat_checkpointermonitors checkpointer and background writer statistics.pg_stat_walmonitors WAL generation and archiver statistics.pg_settingsmonitors PostgreSQL configuration settings with change tracking.pg_hba_file_rulesmonitorspg_hba.confauthentication rules with change tracking.pg_ident_file_mappingsmonitorspg_ident.confuser mappings with change tracking.pg_server_infomonitors server identification and configuration with change tracking.pg_node_roledetects node roles for cluster topologies.pg_databasemonitors the database catalog with XID wraparound indicators.
Database-Scoped Probes
Database-scoped probes collect per-database statistics and execute once for each database on a monitored server. Examples include the following probes:
pg_stat_databasemonitors database-wide statistics.pg_stat_database_conflictsmonitors recovery conflict statistics.pg_stat_all_tablesmonitors table access and I/O statistics.pg_stat_all_indexesmonitors index usage and I/O statistics.pg_statio_all_sequencesmonitors sequence I/O statistics.pg_stat_user_functionsmonitors user function statistics.pg_stat_statementsmonitors query performance statistics.pg_extensionmonitors installed extensions with change tracking.
Probe Lifecycle
This section describes the lifecycle of a probe from initialization through cleanup.
1. Initialization
At startup, the Collector performs the following steps:
- The Collector loads probe configurations from
the
probestable. - The Collector creates a probe instance for each enabled probe.
- The Collector logs initialized probes with their intervals and retention settings.
2. Scheduling
Each probe runs on an independent schedule:
- The probe executes immediately on startup.
- A timer triggers based on
collection_interval_seconds. - The probe executes against all monitored connections.
- The process repeats until shutdown.
3. Execution
The execution flow differs based on probe scope.
For server-scoped probes, the scheduler performs these steps:
- Get all monitored connections from the datastore.
- For each connection in parallel:
- Acquire a connection from the monitored pool.
- Execute the SQL query.
- Release the monitored connection.
- Acquire a datastore connection.
- Ensure a partition exists for the current week.
- Store metrics using the COPY protocol.
- Release the datastore connection.
For database-scoped probes, the scheduler performs these steps:
- Get all monitored connections from the datastore.
- For each connection in parallel:
- Acquire a connection from the monitored pool.
- Query
pg_databasefor the database list. - For each database, acquire a connection and execute the SQL query.
- Release the monitored connection.
- Acquire a datastore connection.
- Ensure a partition exists.
- Store all collected metrics using the COPY protocol.
- Release the datastore connection.
4. Storage
The Collector stores metrics using the PostgreSQL COPY protocol by following these steps:
- Build the column list.
- Build the values array from collected metrics.
- Create a
CopyFromsource. - Execute the COPY command.
- Commit the transaction.
The COPY protocol is much faster than individual INSERT statements.
5. Partition Management
Before storing metrics, the system manages partitions by following these steps:
- Calculate the current week start (Monday).
- Check whether a partition exists for that week.
- If the partition does not exist, create a partition with a range from Monday 00:00:00 to the next Monday 00:00:00.
- Store metrics in the partition.
6. Cleanup
The garbage collector manages data retention:
- The garbage collector runs daily, with the first run 5 minutes after startup.
- For each probe, the collector calculates the
cutoff as
NOW() - retention_days, finds partitions entirely before the cutoff, and drops those partitions. - For change-tracked probes such as
pg_settings,pg_hba_file_rules, andpg_ident_file_mappings, the collector identifies the most recent snapshot for each connection and protects those partitions from deletion.
Probe Interface
All probes implement the MetricsProbe interface.
The following code shows the interface definition:
type MetricsProbe interface {
// GetName returns the probe name
GetName() string
// GetTableName returns the metrics table name
GetTableName() string
// GetQuery returns the SQL query to execute
GetQuery() string
// Execute runs the probe and returns metrics
Execute(
ctx context.Context,
conn *pgxpool.Conn,
) ([]map[string]interface{}, error)
// Store saves metrics to the datastore
Store(
ctx context.Context,
conn *pgxpool.Conn,
connectionID int,
timestamp time.Time,
metrics []map[string]interface{},
) error
// EnsurePartition creates partition if needed
EnsurePartition(
ctx context.Context,
conn *pgxpool.Conn,
timestamp time.Time,
) error
// GetConfig returns probe configuration
GetConfig() *ProbeConfig
// IsDatabaseScoped returns true for
// per-database probes
IsDatabaseScoped() bool
}
Probe Implementation Example
The following code shows a simplified example of a probe implementation:
type PgStatActivityProbe struct {
BaseMetricsProbe
}
func (p *PgStatActivityProbe) GetName() string {
return "pg_stat_activity"
}
func (p *PgStatActivityProbe) GetTableName() string {
return "pg_stat_activity"
}
func (p *PgStatActivityProbe) IsDatabaseScoped() bool {
return false // Server-scoped
}
func (p *PgStatActivityProbe) GetQuery() string {
return `
SELECT
datid, datname, pid, usename,
application_name, client_addr,
backend_start, state, query
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
`
}
func (p *PgStatActivityProbe) Execute(
ctx context.Context,
conn *pgxpool.Conn,
) ([]map[string]interface{}, error) {
rows, err := conn.Query(ctx, p.GetQuery())
if err != nil {
return nil, err
}
defer rows.Close()
return utils.ScanRowsToMaps(rows)
}
func (p *PgStatActivityProbe) Store(
ctx context.Context,
datastoreConn *pgxpool.Conn,
connectionID int,
timestamp time.Time,
metrics []map[string]interface{},
) error {
if err := p.EnsurePartition(
ctx, datastoreConn, timestamp); err != nil {
return err
}
columns := []string{
"connection_id", "collected_at",
"datid", "datname",
}
var values [][]interface{}
for _, metric := range metrics {
row := []interface{}{
connectionID, timestamp,
metric["datid"],
}
values = append(values, row)
}
return StoreMetricsWithCopy(
ctx, datastoreConn,
p.GetTableName(), columns, values)
}
Error Handling
The probe system handles errors at several levels.
Probe Execution Errors
If a probe fails to execute:
- The system logs the error with context including the probe name, connection name, and error message.
- The system does not store metrics for that interval.
- Other probes continue without interruption.
- The probe retries on the next scheduled interval.
Connection Errors
If a connection cannot be acquired:
- The system logs the error with timeout information.
- The system skips probe execution for that connection.
- The probe retries on the next interval.
Storage Errors
If metrics cannot be stored:
- The system logs the error.
- Metrics are lost for that interval.
- The probe continues on the next interval.
Performance Considerations
This section covers performance factors to consider when working with probes.
Query Optimization
Probe queries should follow these guidelines:
- Use appropriate WHERE clauses to filter data.
- Avoid expensive operations such as sorts and aggregates when possible.
- Return only the columns that are needed.
- Use indexes when available.
Collection Frequency
Balance data freshness against system load using these guidelines:
- Fast-changing data such as replication lag should use 30-60 second intervals.
- Moderate data such as table statistics should use 300 second (5 minute) intervals.
- Slow-changing data such as archiver statistics should use 600+ second (10+ minute) intervals.
Concurrent Execution
Probes execute in parallel with the following constraints:
- Each probe has its own goroutine.
- Connection pools limit concurrent connections per server.
- The system prevents overwhelming monitored servers.
Storage Efficiency
The COPY protocol is much faster than INSERT because the protocol provides bulk loading, minimal protocol overhead, and efficient server-side processing.
See Also
The following resources provide additional details.
- Adding Probes covers how to create custom probes.
- Probe Reference provides the complete probe list.
- Scheduler explains how probes are scheduled.
- Architecture describes the overall system design.