Skip to content

Best Practices

When preparing documents for extraction, you should ensure that each document type has the expected properties.

HTML Documents

HTML documents should:

  • have a <title> tag for proper title extraction.
  • use semantic HTML for efficient Markdown conversion.
  • avoid complex layouts that don't translate well to Markdown.

Markdown Documents

Markdown documents should:

  • use a single level-1 heading (#) at the top of each file for title extraction.
  • place YAML frontmatter before the title (if using frontmatter).
  • follow standard Markdown syntax for best results.

reStructuredText Documents

reStructuredText documents should:

  • use standard RST heading underline formats.
  • avoid complex directives that may not convert well.
  • test conversion with sample documents.