Supported Document Formats
The pgEdge Document Loader supports multiple document formats. All formats are automatically detected and converted to Markdown.
HTML Documents
Extensions: .html, .htm
Processing
- Converted to Markdown using a robust HTML-to-Markdown converter
- Title extracted from
<title>tag and prepended as#heading - HTML entities automatically decoded (e.g.,
—becomes—) - All HTML headings shifted down one level:
<h1>→##(level 2)<h2>→###(level 3)<h3>→####(level 4)- And so on...
- Preserves paragraphs, lists, links, and basic formatting
- Strips scripts, styles, and other non-content elements
Example
Input HTML:
<!DOCTYPE html>
<html>
<head>
<title>My Document — Getting Started</title>
</head>
<body>
<h1>Introduction</h1>
<p>This is a <strong>sample</strong> document.</p>
<h2>Overview</h2>
<p>More content here.</p>
</body>
</html>
Extracted title: My Document — Getting Started
Converted Markdown:
# My Document — Getting Started
## Introduction
This is a **sample** document.
### Overview
More content here.
Note how the title from <title> becomes #, <h1> becomes ##, and
<h2> becomes ###.
Markdown Documents
Extensions: .md
Processing
- Already in target format - passed through unchanged
- Title extracted from first level-1 heading (
# Title) - Skips YAML frontmatter when extracting title
- Preserves all Markdown formatting
Example
Input Markdown:
---
author: John Doe
date: 2024-01-15
---
# My Document
This is the content.
Extracted title: My Document (YAML frontmatter ignored)
reStructuredText Documents
Extensions: .rst
Processing
- Converted to Markdown format
- Title extracted from underlined headings (first heading found)
- Heading underlines converted to Markdown
#style - RST anchors and directives are stripped from both titles and headings
- Both underline-only and overline+underline headings are supported
Example with directives:
RST heading with directive:
`Add Named Restore Point Dialog`:index:
========================================
Extracted title: Add Named Restore Point Dialog (directive removed)
Heading Conversion
RST uses underlines (and optionally overlines) to denote headings. The heading level is determined by the order in which each underline pattern first appears in the document, not by the specific punctuation character used.
Simple underline heading:
Main Title
==========
Section
-------
Heading with overline and underline:
.. _coding_standards:
*************************
`Coding Standards`:index:
*************************
Sub heading 1
*************
Both convert to:
# Coding Standards
## Sub heading 1
Key features:
- Any punctuation character can be used for underlines
- First pattern encountered becomes level 1 (
#) - Second pattern becomes level 2 (
##), and so on - RST anchors and directives (
.. name:or.. _name:) are automatically stripped from the output - Inline directives (
:index:,:ref:, etc.) are removed from heading text - Headings with overline+underline are distinct from underline-only headings
Image Conversion
RST image and figure directives are converted to Markdown format:
.. image:: path/to/image.png
:alt: Image description
:width: 500px
Converts to:

Both .. image:: and .. figure:: directives are supported. Alt text is
extracted from the :alt: option if present. Other options (width, height,
etc.) are ignored as they are not part of standard Markdown image syntax.
Limitations
- Complex RST directives are not fully supported
- Only basic conversion is performed
- Advanced features (tables, code blocks with options) may not convert perfectly
- Image options like width, height, and alignment are not preserved in the Markdown output
Unsupported Formats
The following formats are not supported:
- Microsoft Word (
.doc,.docx) - OpenDocument (
.odt) - Rich Text Format (
.rtf) - Plain text (
.txt) - LaTeX (
.tex)
Handling Unsupported Files
Single file: If you specify an unsupported file directly, the tool will exit with an error.
$ pgedge-docloader --source file.txt ...
Error: unsupported file type: file.txt
Directory or glob: Unsupported files are skipped with an informational message.
$ pgedge-docloader --source ./docs ...
Processing files from: ./docs
Skipping unsupported file: ./docs/readme.txt
Skipping unsupported file: ./docs/image.png
Processed 10 file(s), skipped 2 file(s)
Checking Supported Formats
List all supported formats:
$ pgedge-docloader formats
Supported document formats:
.html
.htm
.md
.rst
Format Detection
Format detection is based solely on file extension (case-insensitive):
document.HTML→ Detected as HTMLREADME.MD→ Detected as Markdownguide.RST→ Detected as reStructuredText
Files without extensions or with unknown extensions are treated as unsupported.
Best Practices
HTML Documents
- Ensure documents have a
<title>tag for proper title extraction - Use semantic HTML for better Markdown conversion
- Avoid complex layouts that don't translate well to Markdown
Markdown Documents
- Use a single level-1 heading (
#) at the top for title extraction - Place YAML frontmatter before the title if using frontmatter
- Follow standard Markdown syntax for best results
reStructuredText Documents
- Use standard RST heading underlines
- Avoid complex directives that may not convert well
- Test conversion with sample documents
Future Format Support
Potential formats for future support:
- Microsoft Word (
.docx) - OpenDocument (
.odt) - EPUB (
.epub) - AsciiDoc (
.adoc)
Next Steps
- Usage - Learn how to process documents
- Configuration - Set up column mappings