Converting and Loading HTML Documents
Extensions: .html, .htm
During an HTML conversion:
- documents are converted to Markdown using a robust HTML-to-Markdown converter
- the title is extracted from
<title>tag and prepended as#heading - HTML entities automatically decoded (e.g.,
—becomes—) - All HTML headings shifted down one level:
<h1>→##(level 2)<h2>→###(level 3)<h3>→####(level 4)- And so on...
- Preserves paragraphs, lists, links, and basic formatting
- Strips scripts, styles, and other non-content elements
Example
Input HTML:
<!DOCTYPE html>
<html>
<head>
<title>My Document — Getting Started</title>
</head>
<body>
<h1>Introduction</h1>
<p>This is a <strong>sample</strong> document.</p>
<h2>Overview</h2>
<p>More content here.</p>
</body>
</html>
Extracted title: My Document — Getting Started
Converted Markdown:
# My Document — Getting Started
## Introduction
This is a **sample** document.
### Overview
More content here.
Note how the title from <title> becomes #, <h1> becomes ##, and <h2> becomes ###.