Changelog
All notable changes to the pgEdge AI DBA Workbench are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[1.0.0] - 2026-06-08
This release is the first general-availability release of the pgEdge AI DBA Workbench, graduating the project from beta to a stable, supported 1.0 line.
Added
- Add a
password_fileoption to the server'sdatabase:YAML block, allowing the server to read the datastore password from a file. The server uses this file only when no inlinepasswordvalue and no CLI password flag are set, bringing the server in line with the collector and alerter. (#267)
Changed
-
Unify how the server, collector, and alerter read secrets from files onto a single hardened helper. The change covers database and user passwords, service tokens, LLM and embedding API keys, server and notification secrets, proxy header values, and binary encryption keys. Several behaviours below are backward-incompatible, so operators upgrading the collector or alerter should review them before rolling out the new binaries. (#267)
-
Trim only the trailing newline from a secret file, rather than stripping all surrounding whitespace. This applies to passwords, tokens, API keys, server and notification secrets, and proxy header values. Secrets that legitimately contain leading, trailing, or interior spaces are now preserved verbatim. Operational: operators who relied on the old whitespace-stripping behaviour may need to re-create affected secret files so the stored value matches the intended secret. (#267)
-
Treat a configured-but-empty secret or password file as a hard startup error, instead of silently treating the file as "no value". Previously an empty file fell through to
.pgpassor a passwordless connection, which could mask a misconfigured deployment. Operators must ensure every configured secret file contains a value before starting the new binaries. (#267) -
Accept any owner-only permission mode, such as
0400or0600, for the binary key loaders that read the server encryption key and the alerter notification secret key, instead of requiring exactly0600. Group- or world-accessible modes such as0640and0644are still rejected. (#267) -
Expand a leading
~in a secret or server-secret file path to the user's home directory consistently across all components. (#267) -
Allow the server's
-db-password-fileflag to accept paths containing.., so legitimate relative paths now work. The flag previously rejected any path containing... (#267) -
Raise the web client's minimum Node.js version to 20.19 and enforce it through an
enginesfield in the clientpackage.json(^20.19.0 || >=22.12.0), with a matching.nvmrc. The client's build tooling, including Vite 8 and the@csstoolspackages, requires Node.js 20.19 or later, so older releases such as Node.js 18 failed the build with a cryptic error. The documented prerequisite now states the supported range (Node.js 20.19 or later on the 20.x line, or 22.12 or later) everywhere it appears. (#272)
Fixed
-
Fix the AI assistant "Ellie" and the client-side analysis prompts generating SQL against the old combined
pg_stat_bgwritercolumns such ascheckpoints_timed,checkpoints_req, andbuffers_checkpoint, which fail on PostgreSQL 17 and later. Those checkpoint statistics moved into the newpg_stat_checkpointerview in PostgreSQL 17. The system prompts now instruct the model to querypg_stat_checkpointeron PostgreSQL 17 and later and the combinedpg_stat_bgwriterview on PostgreSQL 16 and earlier, choosing the view based on the target server's version. (#286) -
Fix the
get_metric_baselinesMCP tool returning no baselines when themetric_namefilter used an unqualified shorthand such ascache_hit_ratio, because the tool matched the stored fully-qualified name such aspg_stat_database.cache_hit_ratioonly on an exact match. The filter now applies a parameterised, case-insensitive containment match, with literal%and_matched literally, and a filter that matches nothing returns the available fully-qualified metric names within the caller's RBAC scope so the model can retry. (#287) -
Fix the alerter failing to load its server secret when
secret_filewas unset, with no fallback search. The alerter now resolves a default secret file the way the collector and server already do; whensecret_fileis not set, it searches the per-user config directory first and then/etc/pgedge. This brings the three services to parity and makes the search order documented inexamples/ai-dba-alerter.yamlaccurate. (#291) -
Fix the alerter's Gemini provider lacking a configurable embedding model. The
llm.geminisection now exposes anembedding_modeloption that defaults togemini-embedding-001; the alerter previously offered no embedding model field for Gemini. When Gemini is the embedding provider, the configured model must match the model that the KB Builder used to produce the knowledgebase. (#284) -
Fix the documented Gemini embedding model list, which referenced the deprecated
text-embedding-004andembedding-001models that Google has retired. The supported models are nowgemini-embedding-001(the default),gemini-embedding-2, andgemini-embedding-2-preview, all of which output 3072 dimensions rather than the previously documented 768. The server's knowledgebase status log now reports the Gemini API key as loaded. (#285) -
Fix the Admin Panel's Create Group, Edit Group, and Create Token dialogs accepting unsupported special characters and over-long Name values. RBAC group and token names now reuse the app's shared name validator on both the client and the server: after trimming, a name must be non-empty, at most 255 characters, and may contain only letters, numbers, spaces, and the characters
. _ - ( ). Creating a group with a name that already exists now returns a clear "A group with this name already exists" conflict (HTTP 409) instead of a generic server error. (#273) -
Fix the Admin Panel's Create User and Edit User dialogs accepting invalid usernames and email addresses and surfacing unhelpful errors. A username with special characters, an invalid email such as
notanemailortest@, or a duplicate username previously returned a generic server error or stored silently. The client and server now apply matching rules: the server validates the username, validates the email format, and returns HTTP 400 with the specific rule, or HTTP 409 "Username already taken" for a duplicate. The dialogs show inline, field-level errors and block the save until the input is valid. (#271) -
Resolve two moderate-severity npm vulnerabilities reported in the web client's dependency tree by updating the affected transitive dependencies. (#272)
-
Fix the server silently swallowing read errors for configured LLM, embedding, and knowledgebase API-key files. The server previously proceeded with no key when such a file was unreadable or empty; an unreadable or empty configured key file now fails loudly at startup. Leaving a key-file path unset remains valid and is not an error. (#267)
-
Fix name fields for server groups, clusters, and servers accepting special characters such as
<>!@#$%. Validation previously rejected only empty input, so disallowed characters were stored unchanged. The server and client now apply the same rule: after trimming, a name must be non-empty, at most 255 characters, and may contain only letters, numbers, spaces, and the characters. _ - ( ). The server returns HTTP 400 for invalid names across all cluster-group, cluster, and server-connection create and update endpoints, and the client shows an inline validation error and blocks the save. (#269) -
Fix the alerter intermittently and silently suppressing anomaly alerts that should have fired. The Tier 3 LLM-based classifier parsed model responses non-deterministically, so the same response could resolve to either "alert" or "suppress" from one run to the next. The parser now strips markdown code fences and extracts an embedded JSON object before falling back to keyword matching, and the keyword fallback applies a deterministic, fail-safe precedence (alert before suppress, keep before clear) instead of iterating a randomly ordered map. An over-broad suppress keyword that matched phrases such as "deviation from normal behavior" was also tightened. The same anomaly condition now produces the same decision every time, so genuine alerts fire reliably. (#264)
-
Fix the Add Cluster Group, Add Cluster, and Add Server dialogs accepting Name, Host, Maintenance Database, and Username values longer than 255 characters, which the database rejected through its
VARCHAR(255)columns and surfaced as a generic, unhelpful server error. The web client now caps those fields at 255 characters so over-length input cannot be submitted, and the server validates their length and returns a clear 400 response (for example "Name must be 255 characters or less") instead of a generic 500. The server counts characters rather than bytes to match theVARCHAR(255)limit. (#270) - Fix Ellie, the AI chat assistant, hanging in an effectively
infinite tool-validation loop for regular, non-superuser
users. When a regular user selected a connection shared with
them and asked Ellie to describe it, the query-validation
tool kept succeeding while the tools that execute the query
failed with an access error. The assistant retried slightly
different queries indefinitely, repeatedly showing
Validating queryand never responding. The chat loop now applies a repeated-failure circuit breaker: when the same tool fails with the same error three times, the assistant stops retrying and returns a clear message. The message names the failing tool, shows the underlying error, and suggests a likely permissions or connection-access problem to raise with an administrator. (#268)
Security
- Log a warning recommending
chmod 600when reading any secret file from a group- or world-readable path. The read still succeeds, so existing deployments continue to work while operators tighten the affected file permissions. (#267)
[1.0.0-beta3] - 2026-05-26
Added
-
Add a hybrid variance floor and a per-
period_typewarmup gate to the alerter's Tier 1 anomaly detector, plus a symmetric z-score cap as defence in depth. Three new YAML blocks underanomaly.tier1expose the controls:max_z_score(default100.0, set to0to disable the clamp),variance_floorwithrelative_pct(default0.05) andabsolute_floor(default0.001), andwarmupwithmin_samplesandmin_span_hourspairs forall,hourly, anddailybaselines. A newmetric_baselines.earliest_sample_atcolumn is added by an idempotentALTER TABLEin the consolidated collector schema migration; the column is populated automatically by the alerter's baseline builder at the next refresh, and the warmup gate fails closed on NULL values until the column is populated. Theanomaly_candidatesandalertstables are deliberately left untouched to preserve audit history. Operational: operators upgrading an existing deployment mustTRUNCATE TABLE metric_baselinesafter stopping the alerter and before starting the new binary, so the detector rebuilds baselines under the new logic from a clean slate. -
Add a startup datastore schema health check to the MCP server. The server now reads the collector-owned
schema_versiontable and probes a small set of critical dashboard tables before wiring any handlers; a missing or pre-v4 collector schema, or a partial drop of any probed relation, surfaces an operator-actionable error that names the affected datastore and the recommended action, then exits non-zero. This replaces the previous silent failure mode where the server would come up against an empty database and 500 on every dashboard endpoint. -
Add the
get_timeline_eventsMCP tool, which exposes the unified incident-investigation timeline that was previously only reachable through the/api/v1/timeline/eventsREST endpoint. The tool returns configuration changes, HBA and ident edits, server restarts, extension changes, alert fired/cleared/acknowledged events, and blackout start/end markers in a single TSV result ordered most recent first. Ellie and other MCP clients can now correlate alerts with the underlying changes that may have caused them without falling back to direct REST calls. The tool gates results through the same RBAC checks asget_alert_history. (#250)
Security
- Fix the server creating the SQLite
auth.dbfile with world-readable mode0644on first start, exposing bcrypt password hashes and token hashes to other local users on the host. The previousos.Chmod(0600)call ran before themodernc.org/sqlitedriver had created the file on disk, so the chmod silently failed with ENOENT and the file was left at the operating system's umask default. The fix moves the chmod after thePRAGMA foreign_keysquery, which forces the driver to round-trip and create the file, and adds a post-chmodos.Statverification that logs a warning if the resulting mode is not0600. The server applies the chmod unconditionally on every startup, so an existingauth.dbleft at0644by an earlier release is re-narrowed automatically and no manual operator action is required. On filesystems that silently ignore chmod (for example FAT or some FUSE mounts), the server now refuses to start and operators should consult the WARNING log line that reports the observed mode. (#249)
Added
- Add Gemini as a supported embedding provider in the
embeddingandknowledgebaseserver configurations; the provider supportstext-embedding-004(default) andembedding-001, both at 768 dimensions. (#246)
Changed
- Update the default Gemini chat model in the alerter
and server configurations from
gemini-2.0-flash, which is no longer available to new API users, togemini-2.5-flash. (#246)
Fixed
-
Fix the
get_timeline_eventsMCP tool emitting a flat "Alert condition no longer active" summary onalert_clearedrows, which left a reviewer scanning the timeline to read the vividalert_firedsummary (carrying the alert's frozendescriptiontext) and risk misreading an already resolved alert as still active. Cleared rows now carry a self-contained summary of the formResolved after <duration>. Fired: <original alert description>, where<duration>renders asNs,Nm Ms, orNh Mm; the underlyingalerts.descriptioncolumn is preserved unchanged, so this is a presentation change in the timeline tool rather than a rewrite of historical alert text. The tool description visible to MCP clients now also states thatalert_fired.summarydescribes the firing condition and not the current state, and instructs clients to pair eachalert_firedrow with its correspondingalert_clearedrow (matching title prefix'Alert Cleared: ') before treating an alert as ongoing. -
Fix the collector entering a restart loop when its consolidated schema migration ran against a partially populated datastore. PostgreSQL has no
ALTER TABLE ... ADD CONSTRAINT IF NOT EXISTS, so a re-run that found a pre-existing foreign key (such asfk_anomaly_candidates_embeddingonanomaly_candidates) raisedduplicate_object, aborted the surroundingpgx.Tx, and the next statement returned SQLSTATE 25P02 ("current transaction is aborted, commands ignored until end of transaction block") with a misleading error pointing at an unrelated table. The migration now guards everyADD CONSTRAINTwith a catalog check againstpg_constraint, runs the optional pgvector blocks inside SAVEPOINTs so a missing extension or schema drift cannot poison the parent transaction, and surfaces real failures with areturninstead of swallowing them throughlogger.Infof. Fresh installs are unaffected; the change matters only for recovery cases. -
Fix the web client crashing into the "Something went wrong" error boundary after the user deleted an empty cluster. The server marshaled an empty
clusterslist (and the nestedserverslist inside each cluster) as JSONnullrather than[], andMainLayout's selectionuseMemothen called.someon the null value without a guard. The server'sGetClusterTopologynow normalises both lists to a non-nil empty array, and the client's selection logic guards every read ofgroup.clustersthrough a newbuildSelectionhelper. TheClusterGroup.clustersTypeScript type is also tightened toClusterEntry[] | nullso this regression class is caught by the type system. (#242)
[1.0.0-beta2] - 2026-05-14
Added
- Document the TLS and reverse-proxy requirements for
any network-accessible deployment in a new
Administrator's Guide page at
docs/admin-guide/tls-and-reverse-proxy.md; the page states explicitly that TLS termination, HTTP to HTTPS redirection, and HSTS are operator responsibilities at the reverse proxy layer, calls out the Vite dev server on port 5173 as localhost-only and unsupported for any network-accessible use, notes that the server's built-in TLS support remains available for operators who choose to terminate at the application, and enumerates the credential-exposure risks of running the workbench over plain HTTP. Cross-reference callouts now appear in the installation, quick-start, Docker, and web-client configuration pages. (#234) - Add a Playwright-based end-to-end smoke-test suite
that drives the production client bundle in a real
browser against a real server and Postgres on every
pull request; the suite runs across a
Chromium/Firefox/WebKit matrix and is invoked
locally with
make test-e2e. Seedocs/developer-guide/e2e/index.mdfor details. (#236) - Capture Go integration coverage from the running server during the end-to-end suite and merge it into the existing Codacy partial-upload pipeline, so unit and integration coverage combine into a single reported figure. (#236)
- Add the
spock_exception_logandspock_resolutionscollector probes; both probes capture a rolling 15-minute window of the Spock extension's exception and conflict-resolution catalogs and no-op cleanly on databases without Spock installed. (#200) - Add six built-in alert rules in the
replicationcategory:spock_recent_exceptions_present,spock_recent_exceptions_high,spock_recent_resolutions_present,spock_recent_resolutions_high,replication_slot_retention_warn, andreplication_slot_retention_high; the Spock rules require thespockextension and the slot retention rules apply to every PostgreSQL deployment. (#200)
Changed
- Breaking change: the web client container now
runs as a non-root user and listens on port
8080 instead of port 80. The base image in
client/Dockerfileswitched fromnginx:stable-alpinetonginxinc/nginx-unprivileged:stable-alpine, with an explicitUSER nginxdirective. Host-side port mappings indocker-compose.yml,docker-compose.prod.yml, and the walkthrough compose file are unchanged, sohttp://localhost:3000continues to work with the defaultCLIENT_PORT. Operators running custom reverse-proxy configurations, Kubernetes manifests, or externalproxy_passupstreams that target container port 80 must update those references to 8080; this includes ServicetargetPortvalues, health probes, and any direct container-to-container references. -
Breaking change: the collector, alerter, and server no longer auto-discover configuration or secret files in the binary directory or the current working directory; review the migration steps below before upgrading. (#195)
- The new lookup order is the
--configflag, the per-user config directory, and/etc/pgedge/; the first match wins, and missing files fall through to compiled-in defaults. - The per-user path resolves to
~/.config/pgedge/<binary>.yamlon Linux (honouring$XDG_CONFIG_HOME),~/Library/Application Support/pgedge/<binary>.yamlon macOS, and%AppData%\pgedge\<binary>.yamlon Windows. - The same precedence applies to the collector and
server secret files (
ai-dba-collector.secretandai-dba-server.secret); the alerter does not use a secret file. - Production deployments that already use
/etc/pgedge/are unaffected. - Development setups that drop a YAML file next to
the binary or in the current working directory
will silently fall through to compiled-in
defaults; move the file to
/etc/pgedge/or the per-user directory, or pass--configwith an explicit path. - The alerter's
SIGHUPhandler re-runs discovery on each reload, so installing a config at a default location after startup is picked up on the next signal. - Replace the composition-rule password validator with a policy aligned to NIST SP 800-63B; the server now requires a minimum of 12 characters, enforces the 72-byte bcrypt upper bound, drops uppercase, lowercase, digit, and special-character requirements, and rejects passwords found in a built-in dictionary of approximately 10,000 common and breached entries. The web client shows live password-strength feedback as the user types, and the server remains the authoritative validator. (#177)
- Document installation paths for each deployment
method (GitHub release, Docker, RPM/DEB) in the
installation guide with a reference table. Add
cross-reference notes to the quick start, Docker,
and sub-project README files. Align manual-install
systemd service names to
pgedge-ai-dba-*to match RPM/DEB package conventions. (#173) - Reject
cors_origin: "*"at server startup when authentication is enabled. Browsers discard credentialed responses that combineAccess-Control-Allow-Origin: *withAccess-Control-Allow-Credentials: trueper the Fetch spec. Operators should configure an explicit origin or leave the option empty for same-origin deployments. (#81) - Migrate the
collector,alerter, andserver.golangci.ymlconfigurations to the golangci-lint v2 format, and update the CI workflows to installgolangci-lint/v2;make test-allnow works again on developer machines that have golangci-lint v2 installed locally. (#66) - Apply a Biome and ESLint auto-fix pass across
client/src/, clearing roughly 600 Codacy findings across 294 files; the change is a mechanical refactor with no behavior changes, and existing lint and test baselines remain unchanged. - Clear all
@typescript-eslint/no-confusing-void-expressionfindings inclient/src/across 80 files; ESLint's auto-fixer resolved 279 sites and 19 remaining cases were rewritten manually by expanding() => cond && voidFn()into explicitifblocks. No behavior changes, and all 2,604 Vitest tests pass. - Raise line coverage of
server/internal/cryptofrom 86.8% to 100%. New tests cover four previously uncovered error branches. The branches are random source failure,ReadFilefailure,WriteFilefailure, and the GCM encrypt failure path. (#78) - Add integration tests for
server/internal/memory.Storeagainst PostgreSQL. The tests cover all nine public methods:NewStore,Store,Search,GetPinned,ListByUser,GetByID,Delete,DeleteByID, andUpdatePinned. They also exercise pgvector similarity ordering, scope visibility, and ownership checks. Package line coverage for the memory store now reaches 92.5%. (#78) - Add the
-raceflag to thetestandcoverageMakefile targets in theserver,collector, andalertersub-projects. The race detector now runs in CI and on developer machines. (#78) - Auto-collapse the Server Dashboard's "System
Resources" section when its data is unavailable,
typically because the
system_statsPostgreSQL extension is not installed on the connected server; the section previously stayed expanded and rendered five empty CPU, Memory, Disk, Load, and Network IO panels that pushed the "PostgreSQL Overview" section far down the page. The collapsed header now shows the italic message "No data available. Is the system_stats extension installed?" next to the title, and the user can still expand the section manually to inspect the empty panels. The manual override is intentionally not persisted tolocalStorage, so the section returns to the user's previous expand or collapse preference once the extension is installed. The sharedCollapsibleSectioncomponent gained two new props,forceCollapsedandforceCollapsedMessage, which temporarily override the persisted state without mutating storage and render the italic header message; an anti-flicker guard delays the force-collapsed state until the initial KPI fetch completes, so the section does not briefly collapse during loading. - Bump the Go toolchain from 1.26.1 to 1.26.2
across the server, collector, alerter, and
pkgmodules and the dev-container image; the upgrade picks up upstream fixes for seven Go security advisories listed in the Security section. - Bump
github.com/jackc/pgx/v5from 5.7.6 to 5.9.2 in the server, collector, and alerter; the upgrade picks up the memory-safety and dollar-quoted-string fixes listed in the Security section. - Add a
.codacy.yamlconfiguration that suppresses confirmed false-positive findings from Codacy's Semgrep and ESLint8 engines; suppressions are scoped to specific files or to__tests__/**globs, were independently reviewed by the security-auditor agent, and mask no real vulnerabilities. - Reduce the web client's default
body1andbody2typography sizes to the MUI standards of 16px and 14px, and remove the roughly 80 inlinefontSizeoverrides that previously compensated for the oversized defaults; body text across the application now renders smaller and more consistently, while headings and subtitles are unchanged. - Add a
CHART_AXIS_LABEL_FONTSIZEtoken inclient/src/theme/tokens.tsso chart axis labels render at a consistent 14px, and add aMONO_CAPTION_SXtoken that deduplicates the monospace caption styling shared by alert thresholds, Spock node info, and cron schedules. - Consolidate four duplicated patterns in
server/srcas part of the codebase cleanup tracked in #77. The copy-pastedgetClient()helper ininternal/tools/context_aware_provider.goandinternal/resources/context_aware_registry.gonow delegates to a new(*database.ClientResolver).ResolveOrErrormethod that returns the canonical "no database connection configured" error. The 14-fieldchat.NewClientFromConfiginvocation repeated ininternal/llmproxy/proxy.go(HandleModelsandHandleChat) andinternal/overview/generator.go(createLLMClient) collapses into a newchat.NewClientFromLLMConfigfactory that takes anLLMOptionsparameter for per-call overrides such asModel,MaxTokens,Temperature,Debug, andHeaders, removing roughly 40 lines of boilerplate per call site. The twoauth.ConnectionVisibilityListeradapters ininternal/api/helpers.goandinternal/database/visibility_lister.goshare a single projection; the slice-based adapter moves into thedatabasepackage asdatabase.NewSliceVisibilityListerand the projection is exported asdatabase.ConnectionsToVisibilityInfo. The five-line closure that wired(*database.Datastore).GetConnectionSharingInfointo anauth.RBACCheckerfrominternal/tools/context_aware_provider.go,internal/resources/context_aware_registry.go, andcmd/mcp-server/handlers.gonow flows through a newauth.NewRBACCheckerForDatastoreconstructor that accepts the datastore through a smallDatastoreSharingLookupinterface and handles nil and typed-nil cases internally. The change is internal-only and behavior-preserving; no public HTTP API, MCP tool, or configuration surface changes. (#77)
- The new lookup order is the
Security
- Pick up upstream fixes for seven Go security
advisories by bumping the toolchain to 1.26.2;
the advisories are CVE-2026-32280 (certificate
chain validation denial of service), CVE-2026-33810
(DNS-constraint certificate validation bypass),
CVE-2026-32281 (certificate chain validation denial
of service), CVE-2026-32283 (TLS 1.3 key-update
denial of service), CVE-2026-32289 (
html/templatecross-site scripting), CVE-2026-32288 (archive/tardenial of service), and CVE-2026-32282 (Root.Chmodsymlink escape). - Pick up upstream fixes in
github.com/jackc/pgx/v5by bumping to 5.9.2; the advisories are CVE-2026-33816 (Critical, memory safety) and GHSA-j88v-2chj-qfwx (Low, SQL injection through dollar-quoted-string and$Nplaceholder confusion). A code audit confirmed that no query in this project mixes$$...$$literals with$Nplaceholders, so the second advisory was theoretical for our code base; the bump is still warranted as a defence-in-depth measure. - Bump the web client container's base images to
pick up upstream fixes for high-severity CVEs
flagged by Docker Scout. The builder stage moves
from
node:22-slimtonode:22-trixie-slim(Debian 13), which closes CVE-2026-33845 and CVE-2026-33846 ingnutls28. The runtime stage moves fromnginxinc/nginx-unprivileged:stable-alpinetonginxinc/nginx-unprivileged:stable-alpine-slim, which closes CVE-2026-3805 incurlon Alpine 3.23; the slim variant omitscurl, which the runtime does not need. One residual high finding, CVE-2026-33671 in thepicomatchpackage bundled inside npm, persists across all Node 22 and 24 tags pending an npm release; the residual lives only in the builder stage and never reaches the shipped image. The non-root UID 101 nginx user, port 8080, and other hardening from the earlier base-image change are preserved, anddocker scout cvesreports no vulnerable packages in the final image. -
Redact notification channel secrets from API responses;
GET /api/v1/notification-channelsandGET /api/v1/notification-channels/{id}no longer returnsmtp_username,smtp_password,webhook_url, orauth_credentials, all of which were previously emitted in plaintext after server-side decryption. Each response now includes the boolean indicatorssmtp_username_set,smtp_password_set,webhook_url_set, andauth_credentials_setso clients can show whether a secret is configured without ever reading the value. ThePUT /api/v1/notification-channels/{id}endpoint applies a three-way merge to the four secret fields; omit a field to preserve the stored value, send an empty string to clear it, or send a new value to replace it. The web admin UI for the Email, Slack and Mattermost, and Webhook channel editors now leaves secret form fields blank when editing an existing channel and preserves the stored value unless the user types a replacement. (#187) -
Fix log, SQL, and SMTP injection findings surfaced by the golangci-lint v2 upgrade; the knowledgebase search now binds its filter values through
?placeholders instead of string concatenation, the email test sender sanitizes envelope and header fields before writing them to the SMTP connection, and user-tainted values are escaped through a newlogging.SanitizeForLoghelper at log sites across the api, auth, and config packages. (#66) -
Hoist RBAC access-control checks above all datastore calls in the alert-counts, alert acknowledgement, alert unacknowledgement, alert analysis, and cluster-topology handlers; zero-grant callers now short-circuit to an empty response without touching the database, and HTTP-level regression tests cover every affected handler. (#67)
-
Extend the
manage_connectionsgate to the remaining connection paths missed by the earlier sweep under #207; the server'screateConnectionhandler now requires the permission for every new connection rather than only for shared connections, thehandleUpdateConnectionClusterhandler behindPUT /api/v1/connections/{id}/clusternow also requires the permission so that a user with only read-visibility on a connection can no longer re-home it between clusters, and the web client's Add menu hides the "Add Server" entry from users who lack the permission. The new server-side gate sits after the existing visibility check and before any body decode or datastore mutation. Unauthorized callers receive a403 Forbiddenresponse with a clear authorization error, and the OpenAPI specification and the staticdocs/admin-guide/api/openapi.jsondocument the new response on the affected paths; the previously undocumented/connections/{id}/clusterpath is now present in the specification with both its GET and PUT methods. (#233) -
Require the
manage_connectionspermission on all cluster and cluster-group mutating endpoints; users without the permission previously could create cluster groups and clusters through the REST API, and the server silently committed the rows and returned a success status even though the resulting records were invisible to the creator while remaining visible to administrators. A follow-up audit found the same gap on additional mutating routes, so the fix now gates eleven endpoints in total: creating and deleting cluster groups, adding clusters to a group, creating, updating, and deleting clusters, adding and removing cluster servers, and creating, updating, and deleting cluster relationships. Unauthorized callers now receive a403 Forbiddenresponse with a clear authorization error instead of a misleading success, and the group owner can still delete a group they own even without the permission. The web client's Add menu now hides the "Add Cluster Group" and "Add Cluster" entries from users who lackmanage_connections, and the OpenAPI specification and staticdocs/admin-guide/api/openapi.jsondocument the403 Forbiddenresponse on the affected paths. Administrators see no functional change. (#207)
Fixed
- Fix the Replication Type select in the Cluster Settings
dialog rendering blank for auto-detected clusters whose
replication_typecolumn is NULL; the dialog now derives the value from the cluster'sauto_cluster_keyprefix when no explicit replication type is set, mirroring the existing behaviour of the Topology panel. (#235) - Fix auto-detected clusters lingering in the cluster
autocomplete dropdown after their last member connection
was deleted;
DeleteConnectionnow runs in a single transaction that marks the parent clusterdismissed=TRUEwhen the cluster was auto-detected (auto_cluster_key IS NOT NULL) and no connections still reference it. The delete itself usesDELETE ... RETURNING cluster_idto close a TOCTOU window between the lookup and the delete, and the rollback path usescontext.Background()to avoid a pgx v5 close-of-closed-channel panic when the request context is canceled mid-transaction. User-created clusters and connections without a cluster assignment are unaffected. (#245) - Fix anomaly alerts displaying raw dotted metric names such
as
pg_replication_slots.max_retained_bytesin the Status Panel, and six alert rules seeded by collector migration v3 rendering as auto-title-cased fallbacks in the Admin Alert Rules panel; theFRIENDLY_ALERT_TITLESmap now includes curated labels for 15 dotted anomaly metric names and the 6 v3 rule names, with full test coverage of the map. (#244) - Fix long, non-wrapping SQL queries flowing underneath the copy and run icons in the Remediation Steps panel of the alert AI analysis view; the shared markdown styles now reserve right-side padding on each code block sized to the number of overlaid action buttons, so the SQL text never collides with the icons. (#221)
-
Fix the Active Alerts Restore button returning HTTP 500 "Failed to unacknowledge alert" for alerts that were already non-acknowledged (for example after the alerter reactivated them following a severity change); the server now maps a missing alert to 404, an alert that is not currently acknowledged to 409 Conflict, and wraps every failure path with the alert ID for actionable logs. The alerter's auto-reactivation path is also hardened against panicking on alerts with a NULL
metric_valuecolumn and captures the previous severity before the database write so the in-memory comparison cannot drift from the acknowledged state. (#227) -
Fix the alerter's
replication_slot_inactivecritical alert never firing because its metric query selected from a non-existentmetrics.pg_stat_replication_slotstable; thepg_replication_slots.inactivemetric now reads directly frommetrics.pg_replication_slots(the table the collector probe writes to) and derives the inactive state from theactivecolumn. New integration tests cover the happy path, the no-row case when every slot is active, slot deduplication per connection, and the 5-minute freshness cutoff. (#224) -
Fix Ask Ellie incorrectly reporting missing Spock replication slots on healthy Spock 6.x clusters; the assistant previously generated
WHERE plugin = 'spock'againstpg_replication_slots, but current Spock releases name the output pluginspock_output. The chat system prompt inserver/src/internal/chat/llm.gonow instructs Ellie to useplugin LIKE 'spock%'for cross-version compatibility. (#220) -
Fix three datastore schema and probe inefficiencies identified during a production performance review. Collector migration v4 adds a partial index on
anomaly_candidates(detected_at)filtered toprocessed_at IS NULL AND tier1_pass = TRUEso the alerter sweeper stops sequential-scanning the full table on every poll, and drops the redundantidx_pg_stat_all_indexes_conn_db_timeandidx_pg_stat_statements_conn_db_timeindexes (along with their attached child indexes on every existing weekly partition) which were duplicated by the more specific_objectindexes. The change-detection probes (pg_settings,pg_extension,pg_hba_file_rules,pg_ident_file_mappings) now strip theai_dba_wb_probemarker column injected byWrapQuerybefore hashing, so the live snapshot hash matches the stored snapshot hash; previously the marker caused every hourly collection to look "changed" and write a fresh snapshot, inflating thepg_settingspartitions by roughly an order of magnitude. (#219) -
Fix the Admin panels showing a success toast alongside a page-level refresh error when a save succeeded but the follow-on reload failed; the shared
useCrudPanelhook now suppresses the success toast when the post-mutation refresh fails, so the user sees only the actionable refresh error. (#215) -
Fix the
AdminMessagingChannelspanel flipping thedeleteLoadingbusy flag when a user toggled a channel's enabled state; the shareduseCrudPanel.runMutationhelper gained an independentbusyTargetoption ('save' | 'delete' | 'inline') that decouples the busy-state flag from theerrorTarget, and a new'inline'error target routes errors only to the caller through the returned result rather than tocrud.errororcrud.dialogError. Defaults preserve the previous behavior so existing call sites are unchanged, andhandleToggleEnablednow usesbusyTarget: 'inline'. (#216) - Fix the divergent error fallback wording shown by the
Admin panels when a thrown value is not an
Errorinstance. TheAdminUsers,AdminMemories,AdminTokenScopes,AdminPermissions,AdminProbes,AdminEmailChannels,AdminWebhookChannels,AdminMessagingChannels, andAdminGroupspanels, along with theuseChannelCRUDhook, now route non-Errorthrows through the sharedextractErrorMessagehelper inclient/src/components/AdminPanel/_shared/errors.ts. The helper returns the generic'An unexpected error occurred'message instead of leaking output such as[object Object]produced byString(err). Panels that pass a contextual fallback (for example, "Failed to add recipient") to the helper's second argument retain that context-specific wording. (#212) - Fix the
Chart.test.tsxregression introduced in commitaa28aa8that has been failing the CI - Client workflow on every commit since; the vitest mock specifierecharts-for-react/lib/corewas not updated toecharts-for-react/esm/corewhen production code switched to the ESM path, leaving the realReactEChartsCorerunning in tests and tripping the deliberately-narrowecharts/coremock. The change updates only the test mock specifier; no production source code was modified. - Fix the npm-install branch in
start_dev_web_client.shnever firing because an interveningechoclobbered$?before theif [ $? -eq 0 ]check; the script now uses a direct&&/||pattern that tests the previous command's exit status without an intermediate command. - Fix Ask Ellie entering a long retry loop ("Joining the relations..." / "Validating query") when the signed-in user has no MCP privileges; the chat now surfaces a clear permission-denied message immediately instead of cycling through planning steps. (#188)
- Fix wide markdown tables overflowing and clipping the
right-side columns inside the Ask Ellie chat panel;
tables returned by MCP tools such as
get_alert_history,get_alert_rules, andquery_datastorenow sit inside a horizontally scrollable wrapper, so narrow tables still fill the bubble while wide tables scroll within it instead of spilling outside. The sharedMarkdownContenthelper used by other dialogs received the same treatment. (#185) - Fix the web client rendering a blank screen on every
navigation when the LLM proxy was enabled with a
reasoning model that returns a structured
summaryobject;AIOverviewnow coerces non-string summaries to text before rendering, removing the React error that triggered the blank screen. The top-level<ErrorBoundary>has also been rewritten to always show the error message and component stack in a collapsible details block and to expose a "Reload" button, so users can recover from a crash and file actionable bug reports without rebuilding the container. (#182) - Fix stale auto-detected edges remaining in
cluster_node_relationshipsafter the cluster topology changed through failover, subscriber removal, or a new parent in a binary chain;SyncAutoDetectedRelationshipsnow replaces the auto-detected set transactionally, deleting all existingis_auto_detected = TRUErows for the cluster before inserting the freshly detected set, and thesyncRelationshipsFromTopologycaller no longer short-circuits on an empty detected slice. Manual relationships and auto-detected rows for other clusters are preserved, and a failure during the delete or insert rolls the transaction back to the prior state. (#152) - Fix the cluster Topology tab dropping cascading standbys and marking empty auto-detected nodes as expandable; persisted and manual chains such as primary -> standby -> cascading standby now render every level regardless of input order, and nodes whose children are filtered out no longer display a disclosure arrow. (#153)
-
Fix the collector probe config loader ignoring scope and silently re-enabling disabled parent overrides;
LoadProbeConfigsnow restricts its query toscope IN ('global', 'server')so cluster- and group-scoped rows no longer collapse into theconnection_id = 0bucket and get misapplied as global defaults, andEnsureProbeConfignow inherits the parent config'sis_enabledvalue when materializing a server-level row instead of hard-coding it to true. The SQL is extracted into aloadProbeConfigsQueryconstant and the value resolution moves into a pureresolveProbeConfigDefaultshelper, both covered by new unit tests. (#151) -
Fix MCP and admin scope privileges granted through a wildcard group grant (
"*") being silently dropped during token scope intersection; the intersection logic now recognises wildcard grants and preserves the scoped privileges. The fix also introduces an explicitAccessLevelNoneconstant to replace raw empty strings for "no access" semantics, improving code clarity and reducing error-prone comparisons. (#96) -
Fix the ClusterNavigator group-editing flow round-tripping string group ids (
"group-{x}") through numeric parses; the string id now travels unchanged throughhandleConfigureGroup,handleSaveGroup, and the cluster actions context, and the onestrconv.Atoi-compatible conversion happens at theGroupDialogoverride-panel boundary. Auto-detected groups without a numeric backing row now display an info alert explaining that alert, probe, and channel overrides are unavailable instead of silently passingNaNto the override panels. This removes the root cause patched symptomatically in #59. (#63)
[1.0.0-beta1] - 2026-04-21
Added
- Add the
llm.timeout_secondsserver configuration option to control the HTTP client timeout for requests to the configured LLM provider; the default remains 120 seconds. (#60) - Add a guided walkthrough example with pre-seeded demo data and an in-browser Driver.js tour covering the workbench's major features.
Changed
- Default the knowledgebase
database_pathto the pgEdge package install path at/usr/share/pgedge/postgres-mcp-kb/kb.db. (#52)
Fixed
- Remove misleading raw API key options from the
example server configuration files; the server
only accepts API keys through the corresponding
*_filevariants. (#54) - Fix spurious "partition would overlap" errors from the collector on non-UTC hosts when the weekly partition rolled over. (#55)
- Fix foreign key violations during alerter baseline calculation when metric rows outlive their connection; historical metric queries now filter through the connections table. (#56)
- Fix MCP tool invocations failing with TLS
certificate verification errors against servers
that require a custom
sslrootcert,sslcert, orsslkey; the server now forwards these fields on the database connection string. (#57) - Fix reactivated alerts continuing to appear as
acknowledged in the GUI by clearing stale
alert_acknowledgmentsrows when an alert leaves the acknowledged state; the alerts API now also exposes a nullablelast_updatedRFC3339 timestamp and the StatusPanel surfaces it alongside "Triggered" when the two differ. (#64) - Fix the cluster Topology tab "Add server" dropdown
silently excluding servers that had been re-claimed
by an auto-detected Spock cluster; the connections
API now returns the
membership_sourcefield the client filter requires. (#25, #46) - Fix dismissed auto-detected clusters reappearing
after the collector's next auto-detection run;
UpsertAutoDetectedClusterno longer clears thedismissedflag on rediscovery, andGetClusternow filters dismissed rows from single-cluster lookups. (#36) - Fix dismissed auto-detected clusters reappearing in the Server creation dialog's cluster dropdown after alert or connection context was fetched; the connection hierarchy resolver now skips dismissed rows and no longer resurrects them through its upsert fallback. (#36)
- Fix partitions not being dropped at the appropriate time by the collector. (#62)
- Fix the copy-to-clipboard button in the Admin Tokens "Token created" dialog; the button now shows a check mark and "Copied!" tooltip on success and surfaces clipboard failures through the error alert. (#71)
- Fix the StatusPanel "Restore to active" action silently failing on error; the alert now leaves the acknowledged list optimistically, rolls back and surfaces a Snackbar error on API failure, and guards against double-click submissions. (#72)
- Fix servers assigned to a manually created cluster
continuing to appear under a re-created
auto-detected cluster after the next topology
refresh; auto-detected Spock, binary-replication,
and logical-replication grouping now skip
connections with
membership_source = 'manual'. (#74) - Fix
GET /api/v1/connectionsreturning an empty array for scoped API tokens when the token owner's read access came from a wildcard group grant; the scoped connections are now returned as expected, and token scopes continue to restrict but not elevate the owner's privileges. (#83)
Security
- Fix several REST endpoints and MCP tools leaking
unshared connections owned by other users; the
connection detail, database listing, connection
context, cluster topology, alerts, alert
acknowledgement, alert analysis, timeline, and
current-connection endpoints now apply the same
ownership, sharing, group, and token-scope checks
as
GET /api/v1/connections, returning403 Forbiddenfor single-resource requests and filtering list responses to the caller's visible connections. Theget_alert_history,get_metric_baselines, andget_blackoutsMCP tools applied no connection filter for callers with zero explicit grants; all three now restrict results to connections the caller is permitted to see. The OpenAPI specification and the staticdocs/admin-guide/api/openapi.jsonnow document the403 Forbiddenresponse on the affected single-resource endpoints. (#35) - Fix server and cluster visibility leaks through the
cluster list, cluster group, and cluster-membership
REST endpoints;
GET /api/v1/clusters/list,GET /api/v1/clusters/{id},GET /api/v1/clusters/{id}/servers,GET /api/v1/cluster-groups, andGET /api/v1/cluster-groups/{id}now filter clusters, groups, and member servers to the caller's visible connections and return404 Not Foundfor clusters or groups that contain no visible members. (#35) - Fix additional RBAC leaks surfaced by the follow-up
security audit; the overview REST and SSE endpoints
(
GET /api/v1/overviewandGET /api/v1/overview/stream) now verify scope visibility for scoped requests and filterconnection_idslists to the caller's visible set; the blackout list and get endpoints (GET /api/v1/blackoutsandGET /api/v1/blackouts/{id}) hide blackouts whose referenced connection, cluster, or group is not visible to the caller; the alert, probe, and channel override list endpoints (GET /api/v1/alert-overrides/{scope}/{scopeId},GET /api/v1/probe-overrides/{scope}/{scopeId}, andGET /api/v1/channel-overrides/{scope}/{scopeId}) return404 Not Foundwhen the caller cannot see the requested scope; and the MCP connection resolver now checks RBAC before loading credentials and returns a generic "connection not found or not accessible" message for both missing and denied cases. (#35) - Fix remaining RBAC leaks flagged by the third-round
security audit; the
query_metricsandget_alert_rulesMCP tools now checkCanAccessConnectionbefore any datastore read and return a generic "connection not found or not accessible" error for both missing and denied connections, closing the ID/name enumeration that the previous error path exposed. The overview scoped-snapshot path (GET /api/v1/overview?scope_type=cluster|group) now intersects the scope's member connection IDs with the caller's visible set and generates the summary from the intersection through the existing connections- summary path, so two callers with different visibility never share a cache entry; scope denial now returns404 Not Foundto match sibling handlers. The blackout schedule list and get endpoints (GET /api/v1/blackout-schedulesandGET /api/v1/blackout-schedules/{id}) apply the same visibility filter as the blackout endpoints, and the alert override context endpoint (GET /api/v1/alert-overrides/context/{connectionId}/{ruleId}) now gates onscopeVisibleToCallerbefore reading override hierarchy. The MCP connection resolver logs RBAC denials so operators can correlate without widening the caller-visible surface. (#35)
[1.0.0-alpha3] - 2026-04-08
Added
- Add a Docker publish workflow that builds and pushes multi-platform images to GitHub Container Registry on version tags and main branch pushes.
- Add a production Docker Compose configuration using pre-built images with resource limits and log rotation.
- Add a Docker deployment guide to the documentation.
- Add a favicon to the web client.
- Replace the SQLite driver with a pure-Go implementation to support CGO-free builds.
Changed
- Improve blackout status indicators and require confirmation before deleting a blackout. (#34)
- Limit blackout scope options to relevant entries only. (#33)
- Allow servers from auto-detected clusters to be reassigned to manual clusters. (#46)
- Hide alert threshold links for users who lack the required permission. (#40)
- Display errors to users when fetching unassigned servers fails.
Fixed
- Fix the blackout dialog refreshing unnecessarily when underlying components update. (#47)
- Fix missing browser refresh after certain navigation actions and prevent title wrapping. (#45)
- Fix the replication type not carrying through to the edit dialog. (#44)
- Fix multiple potential crashes after a network failure and recovery. (#43)
- Fix a serialization error when updating server details. (#42)
- Fix a crash when clicking an empty cluster. (#39)
- Fix inconsistent alert operator values. (#38)
- Fix auto-detected clusters reappearing after deletion. (#36)
- Fix the is_shared flag for servers not being respected in all cases. (#35)
- Fix connection error alerts ignoring active blackouts. (#32)
- Fix MCP write access to databases. (#29)
- Fix SSL settings being silently dropped when creating or updating a server.
- Fix various issues with the database summary popup.
- Fix various TypeScript type safety issues in the web client.
Security
- Fix MCP memory tools bypassing RBAC checks; all authenticated users could access the datastore without proper authorization.
[1.0.0-alpha2] - 2026-03-04
Fixed
- Fix a crash in the Add Server dialog when no clusters exist in the database.
[1.0.0-alpha1] - 2026-03-02
Initial release.