Alerter Architecture
The pgEdge AI DBA Workbench Alerter is a standalone background service that monitors collected metrics and generates alerts. The alerter evaluates threshold-based rules and uses AI-powered anomaly detection to identify potential issues in PostgreSQL databases.
Purpose
The alerter serves as the monitoring brain of the AI DBA Workbench. The collector gathers metrics from monitored PostgreSQL instances and stores the metrics in the datastore. The alerter periodically evaluates these metrics against configured rules and baselines to detect problems.
The alerter provides the following capabilities:
- The threshold engine evaluates metrics against configurable limits.
- The anomaly detection system identifies unusual metric patterns.
- The baseline calculator maintains statistical profiles for normal behavior.
- The blackout scheduler suppresses alerts during maintenance windows.
- The alert lifecycle manager tracks alert states and automatic resolution.
- The notification system sends alerts through multiple channels.
Key Concepts
Threshold Alerts
Threshold alerts trigger when a metric value crosses a configured boundary. Each alert rule specifies a metric name, comparison operator, and threshold value. The alerter includes 24 built-in rules covering common PostgreSQL monitoring scenarios such as connection utilization, replication lag, and disk usage.
Anomaly Detection
The anomaly detection system uses a tiered approach to identify unusual metric values. Tier 1 performs statistical analysis using z-score calculations. Tier 2 searches for similar past anomalies using vector embeddings. Tier 3 uses LLM classification to determine if an anomaly is a real issue or a false positive. See Anomaly Detection for full details.
Baselines
The alerter calculates metric baselines from historical data. Baselines include statistical measures such as mean, standard deviation, minimum, and maximum values. The alerter generates three types of baselines:
- Global baselines aggregate all historical data for a metric.
- Hourly baselines capture patterns by hour of day.
- Daily baselines capture patterns by day of week.
Blackout Periods
Blackout periods suppress alert generation during scheduled maintenance windows. The blackout system supports both manual and scheduled blackouts across four hierarchical scope levels.
Scope Levels
Blackouts apply at four levels; each level cascades downward:
- An estate blackout suppresses alerts for all infrastructure.
- A group blackout suppresses alerts for every cluster in the group.
- A cluster blackout suppresses alerts for all servers in the cluster.
- A server blackout suppresses alerts for a single server only.
A blackout at a higher scope automatically applies to all children. For example, a group-level blackout silences alerts for every cluster and server within the group.
Manual Blackouts
A manual blackout defines a fixed time range with explicit start and end timestamps. Administrators create manual blackouts for one-time maintenance events such as upgrades or migrations.
Scheduled Blackouts
A scheduled blackout uses a cron expression to define recurring maintenance windows. The blackout scheduler activates these windows automatically at the specified times. See the Cron Expressions documentation for expression syntax details.
REST API Endpoints
The server exposes the following endpoints for blackout management:
GET /api/v1/blackoutsretrieves all active blackouts.POST /api/v1/blackoutscreates a new manual blackout.DELETE /api/v1/blackouts/:idremoves an existing blackout.GET /api/v1/blackout-schedulesretrieves all schedules.POST /api/v1/blackout-schedulescreates a recurring schedule.DELETE /api/v1/blackout-schedules/:idremoves a schedule.
RBAC Requirements
The manage_blackouts permission controls access to blackout
operations. Users without this permission can view blackout
status but cannot create or delete blackouts.
Alert Lifecycle
Alerts progress through several states during their lifecycle:
- Active alerts indicate an ongoing condition requiring attention.
- Acknowledged alerts have been reviewed by an operator.
- Cleared alerts indicate the condition has resolved.
The alerter automatically clears threshold alerts when the triggering condition returns to normal.
High-Level Architecture
The alerter consists of a main engine that coordinates multiple background workers. Each worker handles a specific responsibility:
Alert Engine (Coordinator)
|
+-----------+---------+---------+-----------+
| | | | |
v v v v v
Threshold Baseline Anomaly Blackout Alert
Evaluator Calculator Detector Scheduler Cleaner
| | | | |
+-----------+---------+---------+-----------+
|
Datastore
(PostgreSQL)
Engine Components
Alert Engine
The alert engine serves as the central coordinator. The engine initializes all workers, manages configuration reloading, and handles graceful shutdown. The engine creates a cancellable context that workers use to detect shutdown requests.
Threshold Evaluator
The threshold evaluator runs at a configurable interval, defaulting to 60 seconds. During each evaluation cycle, the evaluator performs these steps:
- Retrieve all enabled alert rules from the datastore.
- For each rule, fetch the latest metric values.
- Check for active blackouts that would suppress alerts.
- Retrieve effective thresholds including per-connection overrides.
- Compare metric values against thresholds.
- Create or update alerts for threshold violations.
Baseline Calculator
The baseline calculator refreshes metric baselines at a configurable interval, defaulting to one hour. The calculator generates three types of baselines:
- Global baselines aggregate all historical data.
- Hourly baselines capture patterns for each hour of the day.
- Daily baselines capture patterns for each day of the week.
The calculator uses a configurable lookback period, defaulting to 7 days, to gather historical data for baseline calculations.
Anomaly Detector
The anomaly detector implements a tiered detection system:
- Tier 1 uses z-score calculations to identify statistical anomalies.
- Tier 2 uses vector embeddings to find similar past anomalies.
- Tier 3 uses LLM classification to determine alert or suppress decisions.
The detector creates anomaly candidates that progress through each tier. The final decision determines whether to create an alert or suppress the anomaly as a false positive.
Blackout Scheduler
The blackout scheduler runs every minute to check for scheduled blackouts. The scheduler evaluates cron expressions against the current time in the configured timezone. When a schedule matches, the scheduler creates a manual blackout entry with the configured duration.
Alert Cleaner
The alert cleaner runs every 30 seconds to check for resolved conditions. The cleaner retrieves active threshold alerts and re-evaluates the triggering conditions. When a condition no longer violates the threshold, the cleaner marks the alert as cleared.
Retention Manager
The retention manager runs daily to clean up old data. The manager deletes cleared and acknowledged alerts older than the configured retention period. The manager also removes processed anomaly candidates past retention.
Notification Workers
The notification system includes two workers:
- The notification worker processes pending and retry notifications.
- The reminder worker sends periodic reminders for active alerts.
Data Flow
Metric Evaluation Flow
Collector --writes--> Datastore <--reads-- Alerter --writes--> Alerts
The collector writes metrics to the datastore. The alerter reads metrics and evaluates the metrics against rules. When thresholds are violated, the alerter writes alerts to the alerts table.
Anomaly Detection Flow
Tier 1 --> Candidate --> Tier 2 --> Tier 3 --> Alert or Suppress
(z-score) (store) (embedding) (LLM)
Tier 1 creates anomaly candidates for values exceeding the z-score threshold. These candidates are stored and processed by Tier 2, which generates embeddings and searches for similar past anomalies. Tier 3 uses LLM classification to make the final decision.
Database Schema
The alerter uses several tables in the datastore.
Alert Tables
alertsstores all triggered alerts with their current status.alert_rulesdefines threshold-based alert rules.alert_thresholdsstores per-connection threshold overrides.alert_acknowledgmentsrecords user acknowledgments.
Anomaly Tables
anomaly_candidatesstores candidates progressing through tiers.anomaly_embeddingsstores vector embeddings for similarity search.metric_baselinesstores calculated baseline statistics.
Blackout Tables
blackoutsstores active and historical manual blackouts.blackout_schedulesstores recurring blackout schedules.
Notification Tables
notification_channelsdefines notification destinations.notification_channel_connectionslinks channels to connections.notification_historytracks notification delivery status.notification_reminder_statetracks reminder progress.
LLM Integration
The alerter integrates with LLM providers for Tier 2 and Tier 3 processing.
Embedding Providers
Embedding providers generate vector representations of anomaly context. The alerter supports the following providers:
- Ollama with models like
nomic-embed-text. - OpenAI with
text-embedding-3-small. - Voyage with
voyage-3-lite.
Reasoning Providers
Reasoning providers classify anomalies as real issues or false positives. The alerter supports the following providers:
- Ollama with models like
qwen2.5:7b-instruct. - OpenAI with
gpt-4o-mini. - Anthropic with
claude-3-5-haiku.
Configuration Reloading
The alerter supports configuration reloading without restart.
Sending a SIGHUP signal triggers the engine to reload the
configuration file and apply reloadable settings to all workers.
Graceful Shutdown
The alerter handles SIGINT and SIGTERM signals for graceful
shutdown. When the alerter receives a shutdown signal, the engine
cancels the shared context and waits for all workers to complete
their current operations before exiting.
Integration Points
The alerter integrates with other AI DBA Workbench components:
- The collector provides the metric data that the alerter evaluates.
- The server exposes APIs for managing alert rules and viewing alerts.
- The client displays alerts and provides acknowledgment interfaces.
- LLM providers power the Tier 3 anomaly classification.