Database Setup
This guide shows how to set up your PostgreSQL database for use with the pgEdge Document Loader.
Table Requirements
The tool can work with any table structure, as long as you map the appropriate columns. However, you must create the table before running the tool.
Example Table Schemas
Minimal Schema
A minimal table with just content and filename:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
filename TEXT UNIQUE
);
Use with:
pgedge-docloader \
--source ./docs \
--db-table documents \
--col-doc-content content \
--col-file-name filename \
... other connection options ...
Recommended Schema
A recommended schema with full metadata:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT NOT NULL,
source BYTEA,
filename TEXT UNIQUE NOT NULL,
file_modified TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Index for faster lookups
CREATE INDEX idx_documents_filename ON documents(filename);
-- Index for full-text search (optional)
CREATE INDEX idx_documents_content_fts ON documents
USING gin(to_tsvector('english', content));
Use with:
pgedge-docloader \
--source ./docs \
--db-table documents \
--col-doc-title title \
--col-doc-content content \
--col-source-content source \
--col-file-name filename \
--col-file-modified file_modified \
--col-row-created created_at \
--col-row-updated updated_at \
... other connection options ...
Schema for Vector Search
For use with pgvector (semantic search):
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT NOT NULL,
source BYTEA,
filename TEXT UNIQUE NOT NULL,
file_modified TIMESTAMP,
embedding vector(1536), -- For OpenAI embeddings
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Index for vector similarity search
CREATE INDEX idx_documents_embedding ON documents
USING ivfflat (embedding vector_cosine_ops);
-- Index for filename lookups
CREATE INDEX idx_documents_filename ON documents(filename);
Note: The embedding column must be populated separately using an embedding model.
Column Data Types
The tool expects the following data types for each column type:
- doc_title:
TEXTorVARCHAR - doc_content:
TEXTorVARCHAR - source_content:
BYTEA(binary data for storing original source) - file_name:
TEXTorVARCHAR(recommend UNIQUE constraint for update mode) - file_created:
TIMESTAMPorTIMESTAMPTZ - file_modified:
TIMESTAMPorTIMESTAMPTZ - row_created:
TIMESTAMPorTIMESTAMPTZ(recommend DEFAULT CURRENT_TIMESTAMP) - row_updated:
TIMESTAMPorTIMESTAMPTZ(recommend DEFAULT CURRENT_TIMESTAMP)
Update Mode Considerations
When using --update mode:
- The
filenamecolumn should have a UNIQUE constraint - The tool matches existing rows by filename
- If a match is found, the row is updated
- If no match is found, a new row is inserted
Example schema for update mode:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
filename TEXT UNIQUE NOT NULL, -- UNIQUE constraint required
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Permissions
The database user must have the following permissions:
-- Grant INSERT permission
GRANT INSERT ON documents TO myuser;
-- Grant UPDATE permission (for --update mode)
GRANT UPDATE ON documents TO myuser;
-- Grant SELECT permission (for checking existing rows)
GRANT SELECT ON documents TO myuser;
-- Grant USAGE on sequence (for SERIAL columns)
GRANT USAGE, SELECT ON SEQUENCE documents_id_seq TO myuser;
Creating the Database
Complete example of setting up a new database:
-- Connect as superuser
psql -U postgres
-- Create database
CREATE DATABASE docdb;
-- Create user
CREATE USER docloader WITH PASSWORD 'secure_password';
-- Connect to the new database
\c docdb
-- Create table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT NOT NULL,
source BYTEA,
filename TEXT UNIQUE NOT NULL,
file_modified TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Grant permissions
GRANT SELECT, INSERT, UPDATE ON documents TO docloader;
GRANT USAGE, SELECT ON SEQUENCE documents_id_seq TO docloader;
-- Create indexes
CREATE INDEX idx_documents_filename ON documents(filename);
CREATE INDEX idx_documents_content_fts ON documents
USING gin(to_tsvector('english', content));
Verification
Verify your setup:
-- Check table structure
\d documents
-- Check permissions
SELECT grantee, privilege_type
FROM information_schema.role_table_grants
WHERE table_name = 'documents';
-- Test insert (as the docloader user)
INSERT INTO documents (title, content, filename)
VALUES ('Test', 'Test content', 'test.md');
-- Verify
SELECT * FROM documents WHERE filename = 'test.md';
-- Clean up test data
DELETE FROM documents WHERE filename = 'test.md';
Common Table Patterns
Simple Documentation Store
CREATE TABLE docs (
id SERIAL PRIMARY KEY,
title TEXT,
body TEXT,
path TEXT UNIQUE,
updated TIMESTAMP
);
Knowledge Base with Categories
CREATE TABLE knowledge_base (
id SERIAL PRIMARY KEY,
category TEXT,
title TEXT,
content_md TEXT,
content_original BYTEA,
source_file TEXT UNIQUE,
modified_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_kb_category ON knowledge_base(category);
Multi-language Documentation
CREATE TABLE documentation (
id SERIAL PRIMARY KEY,
language TEXT DEFAULT 'en',
title TEXT,
content TEXT,
filepath TEXT NOT NULL,
modified TIMESTAMP,
UNIQUE(language, filepath)
);
CREATE INDEX idx_docs_language ON documentation(language);
Next Steps
- Usage - Learn how to run the tool
- Configuration - Set up your configuration file