Changelog
All notable changes to the pgEdge AI DBA Workbench are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
Added
- Add the
spock_exception_logandspock_resolutionscollector probes; both probes capture a rolling 15-minute window of the Spock extension's exception and conflict-resolution catalogs and no-op cleanly on databases without Spock installed. (#200) - Add six built-in alert rules in the
replicationcategory:spock_recent_exceptions_present,spock_recent_exceptions_high,spock_recent_resolutions_present,spock_recent_resolutions_high,replication_slot_retention_warn, andreplication_slot_retention_high; the Spock rules require thespockextension and the slot retention rules apply to every PostgreSQL deployment. (#200)
Changed
- Breaking change: the web client container now
runs as a non-root user and listens on port
8080 instead of port 80. The base image in
client/Dockerfileswitched fromnginx:stable-alpinetonginxinc/nginx-unprivileged:stable-alpine, with an explicitUSER nginxdirective. Host-side port mappings indocker-compose.yml,docker-compose.prod.yml, and the walkthrough compose file are unchanged, sohttp://localhost:3000continues to work with the defaultCLIENT_PORT. Operators running custom reverse-proxy configurations, Kubernetes manifests, or externalproxy_passupstreams that target container port 80 must update those references to 8080; this includes ServicetargetPortvalues, health probes, and any direct container-to-container references. -
Breaking change: the collector, alerter, and server no longer auto-discover configuration or secret files in the binary directory or the current working directory; review the migration steps below before upgrading. (#195)
- The new lookup order is the
--configflag, the per-user config directory, and/etc/pgedge/; the first match wins, and missing files fall through to compiled-in defaults. - The per-user path resolves to
~/.config/pgedge/<binary>.yamlon Linux (honouring$XDG_CONFIG_HOME),~/Library/Application Support/pgedge/<binary>.yamlon macOS, and%AppData%\pgedge\<binary>.yamlon Windows. - The same precedence applies to the collector and
server secret files (
ai-dba-collector.secretandai-dba-server.secret); the alerter does not use a secret file. - Production deployments that already use
/etc/pgedge/are unaffected. - Development setups that drop a YAML file next to
the binary or in the current working directory
will silently fall through to compiled-in
defaults; move the file to
/etc/pgedge/or the per-user directory, or pass--configwith an explicit path. - The alerter's
SIGHUPhandler re-runs discovery on each reload, so installing a config at a default location after startup is picked up on the next signal. - Replace the composition-rule password validator with a policy aligned to NIST SP 800-63B; the server now requires a minimum of 12 characters, enforces the 72-byte bcrypt upper bound, drops uppercase, lowercase, digit, and special-character requirements, and rejects passwords found in a built-in dictionary of approximately 10,000 common and breached entries. The web client shows live password-strength feedback as the user types, and the server remains the authoritative validator. (#177)
- Document installation paths for each deployment
method (GitHub release, Docker, RPM/DEB) in the
installation guide with a reference table. Add
cross-reference notes to the quick start, Docker,
and sub-project README files. Align manual-install
systemd service names to
pgedge-ai-dba-*to match RPM/DEB package conventions. (#173) - Reject
cors_origin: "*"at server startup when authentication is enabled. Browsers discard credentialed responses that combineAccess-Control-Allow-Origin: *withAccess-Control-Allow-Credentials: trueper the Fetch spec. Operators should configure an explicit origin or leave the option empty for same-origin deployments. (#81) - Migrate the
collector,alerter, andserver.golangci.ymlconfigurations to the golangci-lint v2 format, and update the CI workflows to installgolangci-lint/v2;make test-allnow works again on developer machines that have golangci-lint v2 installed locally. (#66) - Apply a Biome and ESLint auto-fix pass across
client/src/, clearing roughly 600 Codacy findings across 294 files; the change is a mechanical refactor with no behavior changes, and existing lint and test baselines remain unchanged. - Clear all
@typescript-eslint/no-confusing-void-expressionfindings inclient/src/across 80 files; ESLint's auto-fixer resolved 279 sites and 19 remaining cases were rewritten manually by expanding() => cond && voidFn()into explicitifblocks. No behavior changes, and all 2,604 Vitest tests pass. - Raise line coverage of
server/internal/cryptofrom 86.8% to 100%. New tests cover four previously uncovered error branches. The branches are random source failure,ReadFilefailure,WriteFilefailure, and the GCM encrypt failure path. (#78) - Add integration tests for
server/internal/memory.Storeagainst PostgreSQL. The tests cover all nine public methods:NewStore,Store,Search,GetPinned,ListByUser,GetByID,Delete,DeleteByID, andUpdatePinned. They also exercise pgvector similarity ordering, scope visibility, and ownership checks. Package line coverage for the memory store now reaches 92.5%. (#78) - Add the
-raceflag to thetestandcoverageMakefile targets in theserver,collector, andalertersub-projects. The race detector now runs in CI and on developer machines. (#78) - Auto-collapse the Server Dashboard's "System
Resources" section when its data is unavailable,
typically because the
system_statsPostgreSQL extension is not installed on the connected server; the section previously stayed expanded and rendered five empty CPU, Memory, Disk, Load, and Network IO panels that pushed the "PostgreSQL Overview" section far down the page. The collapsed header now shows the italic message "No data available. Is the system_stats extension installed?" next to the title, and the user can still expand the section manually to inspect the empty panels. The manual override is intentionally not persisted tolocalStorage, so the section returns to the user's previous expand or collapse preference once the extension is installed. The sharedCollapsibleSectioncomponent gained two new props,forceCollapsedandforceCollapsedMessage, which temporarily override the persisted state without mutating storage and render the italic header message; an anti-flicker guard delays the force-collapsed state until the initial KPI fetch completes, so the section does not briefly collapse during loading. - Bump the Go toolchain from 1.26.1 to 1.26.2
across the server, collector, alerter, and
pkgmodules and the dev-container image; the upgrade picks up upstream fixes for seven Go security advisories listed in the Security section. - Bump
github.com/jackc/pgx/v5from 5.7.6 to 5.9.2 in the server, collector, and alerter; the upgrade picks up the memory-safety and dollar-quoted-string fixes listed in the Security section. - Add a
.codacy.yamlconfiguration that suppresses confirmed false-positive findings from Codacy's Semgrep and ESLint8 engines; suppressions are scoped to specific files or to__tests__/**globs, were independently reviewed by the security-auditor agent, and mask no real vulnerabilities. - Consolidate four duplicated patterns in
server/srcas part of the codebase cleanup tracked in #77. The copy-pastedgetClient()helper ininternal/tools/context_aware_provider.goandinternal/resources/context_aware_registry.gonow delegates to a new(*database.ClientResolver).ResolveOrErrormethod that returns the canonical "no database connection configured" error. The 14-fieldchat.NewClientFromConfiginvocation repeated ininternal/llmproxy/proxy.go(HandleModelsandHandleChat) andinternal/overview/generator.go(createLLMClient) collapses into a newchat.NewClientFromLLMConfigfactory that takes anLLMOptionsparameter for per-call overrides such asModel,MaxTokens,Temperature,Debug, andHeaders, removing roughly 40 lines of boilerplate per call site. The twoauth.ConnectionVisibilityListeradapters ininternal/api/helpers.goandinternal/database/visibility_lister.goshare a single projection; the slice-based adapter moves into thedatabasepackage asdatabase.NewSliceVisibilityListerand the projection is exported asdatabase.ConnectionsToVisibilityInfo. The five-line closure that wired(*database.Datastore).GetConnectionSharingInfointo anauth.RBACCheckerfrominternal/tools/context_aware_provider.go,internal/resources/context_aware_registry.go, andcmd/mcp-server/handlers.gonow flows through a newauth.NewRBACCheckerForDatastoreconstructor that accepts the datastore through a smallDatastoreSharingLookupinterface and handles nil and typed-nil cases internally. The change is internal-only and behavior-preserving; no public HTTP API, MCP tool, or configuration surface changes. (#77)
- The new lookup order is the
Security
- Pick up upstream fixes for seven Go security
advisories by bumping the toolchain to 1.26.2;
the advisories are CVE-2026-32280 (certificate
chain validation denial of service), CVE-2026-33810
(DNS-constraint certificate validation bypass),
CVE-2026-32281 (certificate chain validation denial
of service), CVE-2026-32283 (TLS 1.3 key-update
denial of service), CVE-2026-32289 (
html/templatecross-site scripting), CVE-2026-32288 (archive/tardenial of service), and CVE-2026-32282 (Root.Chmodsymlink escape). - Pick up upstream fixes in
github.com/jackc/pgx/v5by bumping to 5.9.2; the advisories are CVE-2026-33816 (Critical, memory safety) and GHSA-j88v-2chj-qfwx (Low, SQL injection through dollar-quoted-string and$Nplaceholder confusion). A code audit confirmed that no query in this project mixes$$...$$literals with$Nplaceholders, so the second advisory was theoretical for our code base; the bump is still warranted as a defence-in-depth measure. - Bump the web client container's base images to
pick up upstream fixes for high-severity CVEs
flagged by Docker Scout. The builder stage moves
from
node:22-slimtonode:22-trixie-slim(Debian 13), which closes CVE-2026-33845 and CVE-2026-33846 ingnutls28. The runtime stage moves fromnginxinc/nginx-unprivileged:stable-alpinetonginxinc/nginx-unprivileged:stable-alpine-slim, which closes CVE-2026-3805 incurlon Alpine 3.23; the slim variant omitscurl, which the runtime does not need. One residual high finding, CVE-2026-33671 in thepicomatchpackage bundled inside npm, persists across all Node 22 and 24 tags pending an npm release; the residual lives only in the builder stage and never reaches the shipped image. The non-root UID 101 nginx user, port 8080, and other hardening from the earlier base-image change are preserved, anddocker scout cvesreports no vulnerable packages in the final image. -
Redact notification channel secrets from API responses;
GET /api/v1/notification-channelsandGET /api/v1/notification-channels/{id}no longer returnsmtp_username,smtp_password,webhook_url, orauth_credentials, all of which were previously emitted in plaintext after server-side decryption. Each response now includes the boolean indicatorssmtp_username_set,smtp_password_set,webhook_url_set, andauth_credentials_setso clients can show whether a secret is configured without ever reading the value. ThePUT /api/v1/notification-channels/{id}endpoint applies a three-way merge to the four secret fields; omit a field to preserve the stored value, send an empty string to clear it, or send a new value to replace it. The web admin UI for the Email, Slack and Mattermost, and Webhook channel editors now leaves secret form fields blank when editing an existing channel and preserves the stored value unless the user types a replacement. (#187) -
Fix log, SQL, and SMTP injection findings surfaced by the golangci-lint v2 upgrade; the knowledgebase search now binds its filter values through
?placeholders instead of string concatenation, the email test sender sanitizes envelope and header fields before writing them to the SMTP connection, and user-tainted values are escaped through a newlogging.SanitizeForLoghelper at log sites across the api, auth, and config packages. (#66) -
Hoist RBAC access-control checks above all datastore calls in the alert-counts, alert acknowledgement, alert unacknowledgement, alert analysis, and cluster-topology handlers; zero-grant callers now short-circuit to an empty response without touching the database, and HTTP-level regression tests cover every affected handler. (#67)
-
Require the
manage_connectionspermission on all cluster and cluster-group mutating endpoints; users without the permission previously could create cluster groups and clusters through the REST API, and the server silently committed the rows and returned a success status even though the resulting records were invisible to the creator while remaining visible to administrators. A follow-up audit found the same gap on additional mutating routes, so the fix now gates eleven endpoints in total: creating and deleting cluster groups, adding clusters to a group, creating, updating, and deleting clusters, adding and removing cluster servers, and creating, updating, and deleting cluster relationships. Unauthorized callers now receive a403 Forbiddenresponse with a clear authorization error instead of a misleading success, and the group owner can still delete a group they own even without the permission. The web client's Add menu now hides the "Add Cluster Group" and "Add Cluster" entries from users who lackmanage_connections, and the OpenAPI specification and staticdocs/admin-guide/api/openapi.jsondocument the403 Forbiddenresponse on the affected paths. Administrators see no functional change. (#207)
Fixed
- Fix the
Chart.test.tsxregression introduced in commitaa28aa8that has been failing the CI - Client workflow on every commit since; the vitest mock specifierecharts-for-react/lib/corewas not updated toecharts-for-react/esm/corewhen production code switched to the ESM path, leaving the realReactEChartsCorerunning in tests and tripping the deliberately-narrowecharts/coremock. The change updates only the test mock specifier; no production source code was modified. - Fix the npm-install branch in
start_dev_web_client.shnever firing because an interveningechoclobbered$?before theif [ $? -eq 0 ]check; the script now uses a direct&&/||pattern that tests the previous command's exit status without an intermediate command. - Fix Ask Ellie entering a long retry loop ("Joining the relations..." / "Validating query") when the signed-in user has no MCP privileges; the chat now surfaces a clear permission-denied message immediately instead of cycling through planning steps. (#188)
- Fix wide markdown tables overflowing and clipping the
right-side columns inside the Ask Ellie chat panel;
tables returned by MCP tools such as
get_alert_history,get_alert_rules, andquery_datastorenow sit inside a horizontally scrollable wrapper, so narrow tables still fill the bubble while wide tables scroll within it instead of spilling outside. The sharedMarkdownContenthelper used by other dialogs received the same treatment. (#185) - Fix the web client rendering a blank screen on every
navigation when the LLM proxy was enabled with a
reasoning model that returns a structured
summaryobject;AIOverviewnow coerces non-string summaries to text before rendering, removing the React error that triggered the blank screen. The top-level<ErrorBoundary>has also been rewritten to always show the error message and component stack in a collapsible details block and to expose a "Reload" button, so users can recover from a crash and file actionable bug reports without rebuilding the container. (#182) - Fix stale auto-detected edges remaining in
cluster_node_relationshipsafter the cluster topology changed through failover, subscriber removal, or a new parent in a binary chain;SyncAutoDetectedRelationshipsnow replaces the auto-detected set transactionally, deleting all existingis_auto_detected = TRUErows for the cluster before inserting the freshly detected set, and thesyncRelationshipsFromTopologycaller no longer short-circuits on an empty detected slice. Manual relationships and auto-detected rows for other clusters are preserved, and a failure during the delete or insert rolls the transaction back to the prior state. (#152) - Fix the cluster Topology tab dropping cascading standbys and marking empty auto-detected nodes as expandable; persisted and manual chains such as primary -> standby -> cascading standby now render every level regardless of input order, and nodes whose children are filtered out no longer display a disclosure arrow. (#153)
-
Fix the collector probe config loader ignoring scope and silently re-enabling disabled parent overrides;
LoadProbeConfigsnow restricts its query toscope IN ('global', 'server')so cluster- and group-scoped rows no longer collapse into theconnection_id = 0bucket and get misapplied as global defaults, andEnsureProbeConfignow inherits the parent config'sis_enabledvalue when materializing a server-level row instead of hard-coding it to true. The SQL is extracted into aloadProbeConfigsQueryconstant and the value resolution moves into a pureresolveProbeConfigDefaultshelper, both covered by new unit tests. (#151) -
Fix MCP and admin scope privileges granted through a wildcard group grant (
"*") being silently dropped during token scope intersection; the intersection logic now recognises wildcard grants and preserves the scoped privileges. The fix also introduces an explicitAccessLevelNoneconstant to replace raw empty strings for "no access" semantics, improving code clarity and reducing error-prone comparisons. (#96) -
Fix the ClusterNavigator group-editing flow round-tripping string group ids (
"group-{x}") through numeric parses; the string id now travels unchanged throughhandleConfigureGroup,handleSaveGroup, and the cluster actions context, and the onestrconv.Atoi-compatible conversion happens at theGroupDialogoverride-panel boundary. Auto-detected groups without a numeric backing row now display an info alert explaining that alert, probe, and channel overrides are unavailable instead of silently passingNaNto the override panels. This removes the root cause patched symptomatically in #59. (#63)
[1.0.0-beta1] - 2026-04-21
Added
- Add the
llm.timeout_secondsserver configuration option to control the HTTP client timeout for requests to the configured LLM provider; the default remains 120 seconds. (#60) - Add a guided walkthrough example with pre-seeded demo data and an in-browser Driver.js tour covering the workbench's major features.
Changed
- Default the knowledgebase
database_pathto the pgEdge package install path at/usr/share/pgedge/postgres-mcp-kb/kb.db. (#52)
Fixed
- Remove misleading raw API key options from the
example server configuration files; the server
only accepts API keys through the corresponding
*_filevariants. (#54) - Fix spurious "partition would overlap" errors from the collector on non-UTC hosts when the weekly partition rolled over. (#55)
- Fix foreign key violations during alerter baseline calculation when metric rows outlive their connection; historical metric queries now filter through the connections table. (#56)
- Fix MCP tool invocations failing with TLS
certificate verification errors against servers
that require a custom
sslrootcert,sslcert, orsslkey; the server now forwards these fields on the database connection string. (#57) - Fix reactivated alerts continuing to appear as
acknowledged in the GUI by clearing stale
alert_acknowledgmentsrows when an alert leaves the acknowledged state; the alerts API now also exposes a nullablelast_updatedRFC3339 timestamp and the StatusPanel surfaces it alongside "Triggered" when the two differ. (#64) - Fix the cluster Topology tab "Add server" dropdown
silently excluding servers that had been re-claimed
by an auto-detected Spock cluster; the connections
API now returns the
membership_sourcefield the client filter requires. (#25, #46) - Fix dismissed auto-detected clusters reappearing
after the collector's next auto-detection run;
UpsertAutoDetectedClusterno longer clears thedismissedflag on rediscovery, andGetClusternow filters dismissed rows from single-cluster lookups. (#36) - Fix dismissed auto-detected clusters reappearing in the Server creation dialog's cluster dropdown after alert or connection context was fetched; the connection hierarchy resolver now skips dismissed rows and no longer resurrects them through its upsert fallback. (#36)
- Fix partitions not being dropped at the appropriate time by the collector. (#62)
- Fix the copy-to-clipboard button in the Admin Tokens "Token created" dialog; the button now shows a check mark and "Copied!" tooltip on success and surfaces clipboard failures through the error alert. (#71)
- Fix the StatusPanel "Restore to active" action silently failing on error; the alert now leaves the acknowledged list optimistically, rolls back and surfaces a Snackbar error on API failure, and guards against double-click submissions. (#72)
- Fix servers assigned to a manually created cluster
continuing to appear under a re-created
auto-detected cluster after the next topology
refresh; auto-detected Spock, binary-replication,
and logical-replication grouping now skip
connections with
membership_source = 'manual'. (#74) - Fix
GET /api/v1/connectionsreturning an empty array for scoped API tokens when the token owner's read access came from a wildcard group grant; the scoped connections are now returned as expected, and token scopes continue to restrict but not elevate the owner's privileges. (#83)
Security
- Fix several REST endpoints and MCP tools leaking
unshared connections owned by other users; the
connection detail, database listing, connection
context, cluster topology, alerts, alert
acknowledgement, alert analysis, timeline, and
current-connection endpoints now apply the same
ownership, sharing, group, and token-scope checks
as
GET /api/v1/connections, returning403 Forbiddenfor single-resource requests and filtering list responses to the caller's visible connections. Theget_alert_history,get_metric_baselines, andget_blackoutsMCP tools applied no connection filter for callers with zero explicit grants; all three now restrict results to connections the caller is permitted to see. The OpenAPI specification and the staticdocs/admin-guide/api/openapi.jsonnow document the403 Forbiddenresponse on the affected single-resource endpoints. (#35) - Fix server and cluster visibility leaks through the
cluster list, cluster group, and cluster-membership
REST endpoints;
GET /api/v1/clusters/list,GET /api/v1/clusters/{id},GET /api/v1/clusters/{id}/servers,GET /api/v1/cluster-groups, andGET /api/v1/cluster-groups/{id}now filter clusters, groups, and member servers to the caller's visible connections and return404 Not Foundfor clusters or groups that contain no visible members. (#35) - Fix additional RBAC leaks surfaced by the follow-up
security audit; the overview REST and SSE endpoints
(
GET /api/v1/overviewandGET /api/v1/overview/stream) now verify scope visibility for scoped requests and filterconnection_idslists to the caller's visible set; the blackout list and get endpoints (GET /api/v1/blackoutsandGET /api/v1/blackouts/{id}) hide blackouts whose referenced connection, cluster, or group is not visible to the caller; the alert, probe, and channel override list endpoints (GET /api/v1/alert-overrides/{scope}/{scopeId},GET /api/v1/probe-overrides/{scope}/{scopeId}, andGET /api/v1/channel-overrides/{scope}/{scopeId}) return404 Not Foundwhen the caller cannot see the requested scope; and the MCP connection resolver now checks RBAC before loading credentials and returns a generic "connection not found or not accessible" message for both missing and denied cases. (#35) - Fix remaining RBAC leaks flagged by the third-round
security audit; the
query_metricsandget_alert_rulesMCP tools now checkCanAccessConnectionbefore any datastore read and return a generic "connection not found or not accessible" error for both missing and denied connections, closing the ID/name enumeration that the previous error path exposed. The overview scoped-snapshot path (GET /api/v1/overview?scope_type=cluster|group) now intersects the scope's member connection IDs with the caller's visible set and generates the summary from the intersection through the existing connections- summary path, so two callers with different visibility never share a cache entry; scope denial now returns404 Not Foundto match sibling handlers. The blackout schedule list and get endpoints (GET /api/v1/blackout-schedulesandGET /api/v1/blackout-schedules/{id}) apply the same visibility filter as the blackout endpoints, and the alert override context endpoint (GET /api/v1/alert-overrides/context/{connectionId}/{ruleId}) now gates onscopeVisibleToCallerbefore reading override hierarchy. The MCP connection resolver logs RBAC denials so operators can correlate without widening the caller-visible surface. (#35)
[1.0.0-alpha3] - 2026-04-08
Added
- Add a Docker publish workflow that builds and pushes multi-platform images to GitHub Container Registry on version tags and main branch pushes.
- Add a production Docker Compose configuration using pre-built images with resource limits and log rotation.
- Add a Docker deployment guide to the documentation.
- Add a favicon to the web client.
- Replace the SQLite driver with a pure-Go implementation to support CGO-free builds.
Changed
- Improve blackout status indicators and require confirmation before deleting a blackout. (#34)
- Limit blackout scope options to relevant entries only. (#33)
- Allow servers from auto-detected clusters to be reassigned to manual clusters. (#46)
- Hide alert threshold links for users who lack the required permission. (#40)
- Display errors to users when fetching unassigned servers fails.
Fixed
- Fix the blackout dialog refreshing unnecessarily when underlying components update. (#47)
- Fix missing browser refresh after certain navigation actions and prevent title wrapping. (#45)
- Fix the replication type not carrying through to the edit dialog. (#44)
- Fix multiple potential crashes after a network failure and recovery. (#43)
- Fix a serialization error when updating server details. (#42)
- Fix a crash when clicking an empty cluster. (#39)
- Fix inconsistent alert operator values. (#38)
- Fix auto-detected clusters reappearing after deletion. (#36)
- Fix the is_shared flag for servers not being respected in all cases. (#35)
- Fix connection error alerts ignoring active blackouts. (#32)
- Fix MCP write access to databases. (#29)
- Fix SSL settings being silently dropped when creating or updating a server.
- Fix various issues with the database summary popup.
- Fix various TypeScript type safety issues in the web client.
Security
- Fix MCP memory tools bypassing RBAC checks; all authenticated users could access the datastore without proper authorization.
[1.0.0-alpha2] - 2026-03-04
Fixed
- Fix a crash in the Add Server dialog when no clusters exist in the database.
[1.0.0-alpha1] - 2026-03-02
Initial release.