Alert Rules
Alert rules define the conditions that trigger threshold-based alerts. Each rule specifies a metric to monitor, a comparison operator, and a threshold value. The alerter includes 24 built-in rules and supports custom rules.
Rule Structure
Each alert rule contains the following fields:
| Field | Description |
|---|---|
name |
A human-readable name for the rule. |
description |
A detailed explanation of what the rule detects. |
category |
The category grouping for the rule. |
metric_name |
The metric identifier to evaluate. |
default_operator |
The comparison operator. |
default_threshold |
The threshold value for comparison. |
default_severity |
The alert severity (critical, warning, info). |
default_enabled |
Whether the rule is enabled by default. |
required_extension |
An optional PostgreSQL extension required. |
is_built_in |
Indicates whether the rule is built-in. |
Comparison Operators
The alerter supports six comparison operators:
>triggers when the metric value is greater than the threshold.>=triggers when the metric value is at least the threshold.<triggers when the metric value is less than the threshold.<=triggers when the metric value is at most the threshold.==triggers when the metric value equals the threshold.!=triggers when the metric value does not equal the threshold.
Severity Levels
Alert rules use three severity levels:
criticalindicates a severe issue requiring immediate attention.warningindicates a potential problem that should be investigated.infoindicates an informational condition for awareness.
Rule Categories
Built-in rules are organized into the following categories:
- Connection rules monitor database connections and session state.
- Replication rules monitor replication lag and slot status.
- Performance rules monitor query performance and locking.
- Storage rules monitor disk usage and table maintenance.
- System rules monitor CPU, memory, and system resources.
Hierarchical Overrides
Alert thresholds can be customized at multiple levels of the server hierarchy. The alerter resolves the effective threshold using the following precedence order:
- Server overrides apply to a specific connection.
- Cluster overrides apply to all servers in a cluster.
- Group overrides apply to all clusters in a group.
- Global defaults apply when no override exists.
An override specifies the following fields:
| Field | Description |
|---|---|
rule_id |
The alert rule to override. |
scope |
The override level: server, cluster, or group. |
scope_id |
The identifier for the connection, cluster, or group. |
database_name |
An optional database within the connection. |
operator |
The comparison operator for this override. |
threshold |
The threshold value for this override. |
severity |
The severity level for this override. |
enabled |
Whether the rule is enabled at this scope. |
When evaluating a rule for a server, the alerter checks for a server-level override first. If none exists, the alerter checks the cluster that contains the server, then the group that contains the cluster. If no override exists at any level, the alerter uses the global default values.
Override Evaluation Order
The alerter evaluates thresholds using a strict precedence order for each server and database combination:
- The alerter checks for a server-level override first.
- The alerter checks the cluster override if no server override exists.
- The alerter checks the group override if no cluster override exists.
- The alerter applies global defaults when no overrides exist at any level.
A NULL database name in an override acts as a
wildcard. The wildcard override matches any database on
the server. A database-specific override takes
precedence over a wildcard override at the same scope
level.
Auto-Detected Clusters
The system automatically detects cluster membership by analyzing replication topology. The alerter identifies clusters through Spock replication, binary replication, and logical replication connections between servers.
When the alerter detects that servers participate in the same replication topology, the system groups the servers into an auto-detected cluster. The auto-detected cluster and the parent group appear in the scope dropdown when a user edits overrides for any member server.
Editing Overrides from Alerts
Users can edit an override directly from an alert instance. The edit button on an alert instance opens the Edit Override dialog for the associated rule and scope.
The Edit Override dialog allows the user to adjust the threshold, operator, severity, and enabled state. The scope dropdown displays the available override levels: server, cluster, and group. The dialog pre-selects the scope that matches the alert's originating context.
Scope Disabling Logic
The Edit Override dialog disables scope levels that would have no practical effect. Scopes above the highest existing override are disabled in the dropdown. For example, if a server-level override exists for a rule, the cluster and group scope options are disabled. Editing the cluster or group override would have no effect because the more specific server-level override takes precedence.
This behavior prevents users from creating overrides that the alerter would never apply. The dialog displays a tooltip on disabled options to explain why the scope is unavailable.
Managing Overrides
Overrides are managed through the Alert Overrides tab in the server, cluster, or group edit dialogs. The override panel shows all alert rules with their current effective settings. Rules without an override at the current level appear dimmed to indicate that the setting is inherited from a higher level or the global default.
Enabling and Disabling Rules
Rules can be enabled or disabled globally or at any level of the hierarchy. A disabled rule is not evaluated during threshold checks. You can disable built-in rules that do not apply to your environment or enable rules that require specific PostgreSQL extensions.
To disable a rule globally, set default_enabled to
false in the rule definition. To disable a rule for
a specific scope, create an override with enabled set
to false.
Creating Custom Rules
Custom rules extend the built-in rule set with
organization-specific monitoring requirements. Custom
rules follow the same structure as built-in rules but
have is_built_in set to false.
When creating custom rules, consider the following guidelines:
- The metric must be collected by the collector.
- The metric name must match the collector's metric naming convention.
- The threshold should reflect your organization's operational requirements.
- The severity should match the impact of the condition.
Alert Lifecycle
When a threshold is violated, the alerter creates an
alert with status active. The alert remains active
until one of the following occurs:
- The condition resolves and the alerter clears the alert automatically.
- An operator acknowledges the alert manually.
- An operator marks the alert as a false positive.
The alerter updates the metric_value field of active
alerts on each evaluation cycle. This update reflects
the current value even if the threshold remains
violated.
Automatic Alert Clearing
The alerter automatically clears threshold alerts when
the triggering condition returns to normal. The alert
cleaner worker runs every 30 seconds and re-evaluates
active alerts. When a metric value no longer violates
the threshold, the alerter marks the alert as cleared
and records the cleared_at timestamp.
Blackout Interaction
During an active blackout period, the alerter suppresses new alerts for the affected connection or database. Existing active alerts are not cleared during a blackout; the blackout only prevents new alerts from being created.
Example Rule Configuration
In the following example, a rule monitors connection utilization:
name: High Connection Utilization
description: >
Alerts when database connections exceed 80%
of max_connections
category: connection
metric_name: connection_utilization_percent
default_operator: ">"
default_threshold: 80.0
default_severity: warning
default_enabled: true
In the following example, a group-level override adjusts the threshold for all servers in a development group:
rule_id: 1
scope: group
scope_id: 3
operator: ">"
threshold: 95.0
severity: info
enabled: true
In the following example, a server-level override further customizes the threshold for one production server with higher connection requirements:
rule_id: 1
scope: server
scope_id: 5
operator: ">"
threshold: 90.0
severity: warning
enabled: true
Related Documentation
- Notification Channels covers alert delivery configuration.
- Probes describes the data collection that feeds alert rules.