Best Practices for Managing Connections

Best practices for managing database connections

This article identifies strategies you can use in your applications to connect to pgEdge Cloud databases to improve the availability of your application in the following situations:

  • Planned maintenance occuring on your pgEdge Clusters / Databases
  • Unexpected downtime for a Database node due to Cloud Provider outages or temporary network issues

Leveraging DNS failover

pgEdge Cloud provides the ability to connect to databases using Latency-based routing, which automatically routes your application to the nearest pgEdge node in your database based on latency. This feature makes it easier for end users to configure their applications to connect to pgEdge Cloud databases without needing to manually manage connections to individual nodes.

For the following configurations, Latency-based routing also factors in the availability of your database:

  • Private DNS in pgEdge Cloud Enterprise AWS Clusters where the Database has the "Cloudwatch Metrics" option enabled
  • Public DNS in pgEdge Cloud Developer Edition

In the event that health checks determine that a Database node is unavailable, Latency-based routing will automatically route your application to the next nearest node that is available.

Maintenance process

pgEdge performs maintenance in a rolling manner on all databases (when possible), applying updates to each node individually. As part of this process, health checks for nodes where maintenance will be conducted will be set to fail, and pgEdge maintenance processes will wait for a period of 3 minutes to allow applications to reconnect to other available nodes, and for any existing transactions to complete. Once this waiting period is complete, pgEdge systems will perform required upgrades and maintenance before bringing the node back into the pool for Latency-based routing, and moving on to subsequent nodes.

Configuring Connection Timeouts for DNS Renegotiation

When integrating your application, it's important to ensure that connections periodically renegotiate DNS. We recommend setting a connection timeout to 30 seconds to ensure that clients do not hold connections indefinitely and instead re-resolve DNS at regular intervals.

General Best Practices

  • Configure a reasonable connection timeout (typically 30 seconds) to force reconnection and DNS resolution.
  • Ensure your application handles reconnections gracefully.
  • Use built-in client connection pool settings to enforce timeouts.

Examples in Popular Frameworks

Python (SQLAlchemy)
from sqlalchemy import create_engine
 
db_url = "postgresql://user:password@db-host/dbname"
engine = create_engine(db_url, pool_recycle=30)  # 30 seconds timeout

Docs: SQLAlchemy Connection Pooling (opens in a new tab)

Go (database/sql)
import (
    "database/sql"
    _ "github.com/lib/pq"
)
 
db, err := sql.Open("postgres", "postgres://user:password@db-host/dbname")
db.SetConnMaxLifetime(30 * time.Second) // 30 seconds timeout

Docs: database/sql Connection Settings (opens in a new tab)

Node.js (pg)
const { Pool } = require("pg");
const pool = new Pool({
  connectionString: "postgres://user:password@db-host/dbname",
  idleTimeoutMillis: 30000, // 30 seconds
});

Docs: pg Pooling Documentation (opens in a new tab)

Java (HikariCP)
HikariConfig config = new HikariConfig();
config.setMaximumPoolSize(10);
config.setIdleTimeout(30000); // 30 seconds
config.setMaxLifetime(30000); // Force reconnection every 30 seconds
HikariDataSource ds = new HikariDataSource(config);

Docs: HikariCP Configuration (opens in a new tab)

Additional Considerations

  • Some frameworks or drivers may cache DNS results internally; ensure they support periodic resolution.
  • If using Kubernetes, consider setting TTLs on service discovery mechanisms (e.g., CoreDNS) to influence how often DNS queries are re-evaluated.
  • In environments with dynamic IP addresses, consider using shorter TTLs at the DNS level.

By correctly configuring connection timeouts, your application can ensure it always connects to the most up-to-date endpoint, reducing downtime and connection failures.

Configuring retries

Retries help handle transient failures and improve application resilience when connecting to a database. Configuring retries with exponential backoff ensures that retry attempts do not overwhelm the database.

General Best Practices

  • Use exponential backoff to space out retry attempts.
  • Limit the maximum number of retries to avoid excessive delay.
  • Log retry attempts for debugging and monitoring.
  • Consider adding jitter to avoid synchronized retry spikes.

Specifying hostnames

In the event that your use case cannot utilize DNS failover, you must configure your application to be resilient to the described maintenance processes. This should be tested and verified prior to moving any application into Production.

As an alternative to Latency Based Routing / DNS failover, you can configure your application with multiple hosts for it to connect to in the event that one host is unavailable. Any libraries or frameworks which are based on libpq should be able to leverage the multiple hosts specified in a Connection URI (opens in a new tab). If your application uses another framework or libary, you may need to implement custom logic to ensure similar logic.

With this approach, you will need to manually manage the order of the hosts which are configured on your application, depending on how it is geographically distributed. You should order the hosts differently on each of your application instances to ensure that your application connects to the nearest node by default.

Long Running Processes

For scheduled jobs, background workers, or batch tasks, you should take steps to ensure they handle database interruptions gracefully.

Best Practices

  • Retry with backoff: Implement automatic reconnection with exponential backoff.
  • Use checkpointing: Periodically save progress to resume after failures.
  • Handle connection loss: Detect failures and attempt to reconnect before failing a job.
  • Break up long transactions: Where possible, split work into smaller, retryable steps.
  • Account for replication lag: If failing over to another node, ensure your job logic can handle potential data inconsistencies or conflicts.

By designing processes with these safeguards, long-running tasks can recover smoothly from database failovers or maintenance events.