Architecture
This document describes the internal architecture of the pgEdge RAG Server.
Overview
The RAG server implements a Retrieval-Augmented Generation pipeline that:
- Receives a user query
- Generates an embedding for the query
- Searches the database using hybrid search (vector + BM25)
- Builds context from the most relevant documents
- Generates an answer using an LLM with the context
flowchart LR
subgraph pipeline[RAG Pipeline]
direction LR
Q[Query] --> E[Embedding<br/>Provider]
E --> H[Hybrid<br/>Search]
H --> C[Context Builder<br/>Token Budget]
C --> CP[Completion<br/>Provider]
CP --> R[Response]
end
Components
HTTP Server
The server uses Go's standard net/http package with the following
endpoints (all under the /v1 API version prefix):
GET /v1/openapi.json- OpenAPI v3 specificationGET /v1/health- Health checkGET /v1/pipelines- List available pipelinesPOST /v1/pipelines/{name}- Execute a RAG query
All JSON responses include an RFC 8631 Link header pointing to the OpenAPI
specification for API discovery by tools like restish.
Streaming responses use Server-Sent Events (SSE) for real-time output.
Pipeline Manager
The pipeline manager (internal/pipeline) creates and manages pipeline
instances from the configuration. Each pipeline contains:
- Database connection pool
- Embedding provider
- Completion provider
- Orchestrator
Orchestrator
The orchestrator (internal/pipeline/orchestrator.go) coordinates the RAG
pipeline execution:
-
Query Embedding - Converts the query to a vector using the embedding provider
-
Hybrid Search - For each configured column pair:
-
Vector search using pgvector similarity
- BM25 text search for keyword matching
-
Results merged using Reciprocal Rank Fusion (RRF)
-
Deduplication - Removes duplicate results across column pairs
-
Context Building - Selects documents within the token budget, truncating the last document if needed to fit
-
Completion - Sends the context and query to the completion provider to generate an answer
Hybrid Search
The server combines two search methods:
Vector Search
Uses PostgreSQL's pgvector extension for semantic similarity search:
SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT $2
BM25 Search
Implements the Okapi BM25 algorithm for keyword matching:
- Tokenization with stop word removal
- IDF (Inverse Document Frequency) scoring
- Term frequency with length normalization
The BM25 implementation uses the Lucene-style IDF formula:
IDF = log(1 + (N - n + 0.5) / (n + 0.5))
Where:
- N = total number of documents
- n = number of documents containing the term
Reciprocal Rank Fusion
Results from both methods are combined using RRF:
RRF(d) = Σ 1 / (k + rank(d))
Where k=60 (the standard RRF constant). Documents appearing in both result sets receive higher combined scores.
LLM Providers
The server supports multiple LLM providers through a common interface:
type EmbeddingProvider interface {
Embed(ctx context.Context, text string) ([]float32, error)
EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
Dimensions() int
ModelName() string
}
type CompletionProvider interface {
Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
CompleteStream(ctx context.Context, req CompletionRequest) (<-chan StreamChunk, <-chan error)
ModelName() string
}
Supported providers:
| Provider | Package | Embedding | Completion |
|---|---|---|---|
| OpenAI | internal/llm/openai |
Yes | Yes |
| Anthropic | internal/llm/anthropic |
No | Yes |
| Voyage | internal/llm/voyage |
Yes | No |
| Ollama | internal/llm/ollama |
Yes | Yes |
Token Budget
The token budget prevents sending too much context to the LLM. The orchestrator:
- Estimates tokens for each document (approximately 4 characters per token)
- Includes documents until the budget is reached
- Truncates the final document at a sentence boundary if it exceeds the remaining budget
This ensures predictable LLM costs while maximizing relevant context.
Database Schema Requirements
Each table used in a pipeline must have:
- A text column containing the document content
- A vector column containing the embedding (using pgvector)
Example schema:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536) -- Adjust dimension for your model
);
-- Create index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Error Handling
The server uses structured error responses:
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message"
}
}
Error codes:
INVALID_REQUEST- Bad request format or missing fieldsPIPELINE_NOT_FOUND- Requested pipeline doesn't existEXECUTION_ERROR- Pipeline execution failedSTREAMING_ERROR- SSE streaming failedINTERNAL_ERROR- Unexpected server error
Logging
The server uses Go's structured logging (log/slog) with JSON output.
Log levels:
DEBUG- Detailed execution informationINFO- Normal operationsWARN- Non-fatal issues (e.g., search failures on one column pair)ERROR- Failures requiring attention
Concurrency
The server handles concurrent requests safely:
- Each request gets its own context
- Database connections are pooled
- BM25 index is cleared and rebuilt per-request (stateless)
- Streaming responses handle client disconnection via context cancellation