Anomaly Detection
The alerter includes an AI-powered anomaly detection system that identifies unusual patterns in metric data. The system uses a tiered approach to balance detection accuracy with computational efficiency.
Overview
Anomaly detection complements threshold-based alerting by identifying conditions that deviate from normal behavior without requiring explicit thresholds. This approach is valuable when normal values vary over time or when the expected range is not well understood.
The anomaly detection system provides the following capabilities:
- Statistical analysis identifies values that deviate from baselines.
- Embedding similarity finds patterns matching known anomalies.
- LLM classification determines if anomalies require attention.
- Historical learning improves accuracy over time.
Tiered Architecture
The anomaly detection system uses three tiers:
Tier 1: Statistical Analysis (z-score)
- Fast, runs on every evaluation cycle
- Creates candidates for values exceeding threshold
|
v
Tier 2: Embedding Similarity (pgvector)
- Generates vector embeddings for anomaly context
- Searches for similar past anomalies
- May suppress based on similarity to false positives
|
v
Tier 3: LLM Classification
- Analyzes anomaly context with reasoning model
- Determines alert or suppress decision
- Provides reasoning for the decision
Tier 1: Statistical Analysis
Tier 1 performs z-score analysis to identify statistical outliers. The z-score measures how many standard deviations a value is from the mean.
Z-Score Calculation
The z-score formula is:
z-score = (current_value - baseline_mean) / baseline_stddev
A high absolute z-score indicates the value is unusual relative to the baseline. The default sensitivity threshold is 3.0, meaning values more than 3 standard deviations from the mean are flagged as candidates.
Configuration
Tier 1 settings are configured in the anomaly.tier1 section:
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable Tier 1 detection |
default_sensitivity |
3.0 |
Z-score threshold |
evaluation_interval_seconds |
60 |
Evaluation interval |
Baseline Selection
The alerter selects the most appropriate baseline for each evaluation:
- If an hourly baseline exists for the current hour, the alerter uses the hourly baseline.
- If a daily baseline exists for the current day, the alerter uses the daily baseline.
- Otherwise, the alerter uses the global baseline.
This approach accounts for time-based patterns in metric values.
Variance Floor and Warmup Gate
Tier 1 applies two pre-checks before the z-score comparison.
The first protects the divisor from collapsing below a sensible
floor; the second suppresses detection on baselines that have
not yet observed enough data to be trustworthy. Both checks
live in alerter/src/internal/engine/anomalies.go;
the effectiveStdDev helper implements the floor, and the
isBaselineWarm helper implements the warmup gate.
Motivating Failure Mode
On a young datastore with roughly 26 hours of data, the alerter
produced 5,765 anomaly candidates in 24 hours;
pg_stat_activity.max_query_duration_seconds averaged a |z| of
609 with a peak of 5,862. The peak value is not anomaly
detection; it is diagnostic of a baseline whose stored standard
deviation had collapsed several orders of magnitude below the
metric's natural variation. The existing stddev == 0 guard
correctly skipped baselines with an exactly zero divisor, but
it could not catch the near-zero case that produced the runaway
scores.
Hybrid Variance Floor
The effectiveStdDev helper raises the divisor to a hybrid
floor before the z-score is computed. The floor combines a
relative term, scaled by the absolute baseline mean, and an
absolute term that acts as a safety net when the mean
approaches zero.
relative_floor = |baseline.mean| * variance_floor.relative_pct
floor = max(relative_floor, variance_floor.absolute_floor)
effective_stddev = max(baseline.stddev, floor)
The defaults are relative_pct = 0.05 and absolute_floor =
0.001. The relative term dominates for most non-zero metrics,
while the absolute term keeps small-mean metrics from
collapsing the floor to zero. When both knobs are zero, the
floor collapses and the existing stddev == 0 guard handles
the degenerate case.
Warmup Gate Semantics
The isBaselineWarm helper gates detection on two conditions
per period_type: a minimum sample count and a minimum
wall-clock span between the earliest recorded sample and now.
Both must hold for the baseline to be considered warm. Each of
the three period_type values (all, hourly, and daily)
carries its own threshold pair, configured under
anomaly.tier1.warmup in the alerter YAML.
The gate fails closed in two distinct ways. When
SampleCount falls below MinSamples, the gate reports the
baseline cold and detection is skipped. When MinSpanHours is
greater than zero and EarliestSampleAt is the zero value,
the gate also reports cold; this covers rows written before
the metric_baselines.earliest_sample_at column was added and
fallback baselines that lack raw sample timestamps. Setting
both min_samples: 0 and min_span_hours: 0 for a
period_type disables warmup suppression for that type; the
zero value on MinSpanHours also skips the
EarliestSampleAt.IsZero() check, so a stale row does not
spuriously fail closed when the operator has explicitly opted
out.
An unrecognised period_type falls back to the daily
thresholds, which are the strictest of the three defaults.
This is defensive only; the column is enum-constrained at
write time.
Z-Score Cap
After the floored divisor produces a z-score, Tier 1 clamps
the signed value to ±max_z_score when the cap is positive. A
cap of zero disables the clamp. The cap applies symmetrically
to both extreme positive and extreme negative scores, so a
runaway divisor cannot drive either tail beyond the configured
bound.
Tier 2: Embedding Similarity
Tier 2 uses vector embeddings to find similar past anomalies. This tier helps the system learn from historical decisions.
Embedding Generation
The alerter builds a text representation of the anomaly context including:
- The metric name and current value.
- The z-score and deviation direction.
- The connection and database identifiers.
- The baseline statistics.
An embedding provider converts this text into a high-dimensional
vector. The alerter stores these embeddings in the
anomaly_embeddings table.
Similarity Search
The alerter searches for similar past anomalies using vector similarity. The search uses cosine distance through the pgvector extension. Results include past anomalies above the similarity threshold along with their final decisions.
Suppression Logic
Tier 2 may suppress an anomaly based on similar past anomalies:
- If the most similar anomaly was suppressed and similarity exceeds the suppression threshold, the current anomaly is also suppressed.
- If the most similar anomaly was alerted and similarity exceeds the threshold, the current anomaly is passed to Tier 3.
- If similarity is below the threshold, the anomaly proceeds to Tier 3.
Configuration
Tier 2 settings are configured in the anomaly.tier2 section:
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable Tier 2 detection |
suppression_threshold |
0.85 |
Similarity threshold for suppression |
similarity_threshold |
0.3 |
Minimum similarity to consider a match |
Tier 3: LLM Classification
Tier 3 uses LLM reasoning to classify uncertain anomalies. The LLM receives context about the anomaly and similar past anomalies, then determines whether to alert or suppress.
Classification Prompt
The alerter builds a classification prompt containing:
- The current anomaly details including metric, value, and z-score.
- Baseline statistics for context.
- Similar past anomalies with their decisions.
- Instructions for the classification response.
Response Parsing
The alerter expects a JSON response with:
decision: eitheralertorsuppress.confidence: a value from 0 to 1.reasoning: an explanation of the decision.
If the response cannot be parsed as JSON, the alerter falls back to keyword matching in the response text.
Fail-Safe Behavior
When the LLM call fails, the alerter defaults to alerting. This fail-safe ensures that potential issues are not missed due to LLM unavailability.
Configuration
Tier 3 settings are configured in the anomaly.tier3 section:
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable Tier 3 detection |
timeout_seconds |
30 |
Timeout for LLM API calls |
LLM Provider Configuration
The alerter supports multiple LLM providers for embeddings and reasoning.
Embedding Providers
Configure the embedding provider in the llm section:
| Provider | Model | Dimensions |
|---|---|---|
| Ollama | nomic-embed-text |
768 (resized to 1536) |
| OpenAI | text-embedding-3-small |
1536 |
| Voyage | voyage-3-lite |
1024 (resized to 1536) |
Reasoning Providers
Configure the reasoning provider in the llm section:
| Provider | Model |
|---|---|
| Ollama | qwen2.5:7b-instruct |
| OpenAI | gpt-4o-mini |
| Anthropic | claude-3-5-haiku-20241022 |
| Gemini | gemini-2.0-flash |
Example Configuration
In the following example, the configuration uses Ollama for local LLM processing:
llm:
embedding_provider: ollama
reasoning_provider: ollama
ollama:
base_url: http://localhost:11434
embedding_model: nomic-embed-text
reasoning_model: qwen2.5:7b-instruct
In the following example, the configuration uses OpenAI for cloud-based processing:
llm:
embedding_provider: openai
reasoning_provider: openai
openai:
api_key_file: /etc/ai-workbench/openai-api-key.txt
embedding_model: text-embedding-3-small
reasoning_model: gpt-4o-mini
Baseline Calculation
The anomaly detection system depends on accurate baselines. The baseline calculator runs periodically to refresh baselines from historical data.
Baseline Types
The alerter calculates three baseline types:
allbaselines aggregate all historical values.hourlybaselines group values by hour of day (0-23).dailybaselines group values by day of week (0-6).
Baseline Statistics
Each baseline stores the following values:
mean: the average value.stddev: the standard deviation.min: the minimum observed value.max: the maximum observed value.sample_count: the number of samples.
Lookback Period
The baseline calculator uses a configurable lookback period to gather historical data. The default is 7 days. A longer lookback period provides more stable baselines but may not reflect recent changes in workload.
Enabling and Disabling
Anomaly detection can be enabled or disabled at multiple levels:
- Globally through the
anomaly.enabledconfiguration option. - Per-tier through each tier's
enabledoption. - Per-metric through the metric definition's
anomaly_enabledflag.
Disabling Tier 2 causes candidates to pass directly to Tier 3. Disabling Tier 3 causes all Tier 1 candidates that pass Tier 2 to generate alerts.
Monitoring Anomaly Detection
The alerter logs anomaly detection activity at debug level. Enable debug logging to see:
- Tier 1 candidates created for z-score violations.
- Tier 2 similarity search results.
- Tier 3 LLM classification decisions.
- Final decisions and alert creation.
The anomaly_candidates table stores all candidates with their
tier results and final decisions. You can query this table to
analyze anomaly detection effectiveness.