Introduction to Python Markdown

Python Markdown is a very nice implementation. It’s not blazingly fast, first because Python itself isn’t very fast (compared to Perl) and secondly because it takes a multiphase approach. With no extensions enabled, there are 17 steps (called processors) in four phases, and each processor examines the entire document. Extensions may add processors to one or more phases; for example, the footnotes extension has processors that run in all four phases.

In the list below there are both standard processors identified by their tag in bold, and extensions, which are italicized. Currently Python Markdown ships with 16 standard extensions, plus an additional one named extra, a special-purpose extension that imports seven of the most useful ones. (In the list below these are noted with (Extra).)

Phase 1: Pre-processors munge the input text:

  • normalize_whitespace: normalize whitespace for consistant parsing
  • gentoc_remove: remove a Table of Contents section generated by genTOC (mine)
  • meta: process a metadata block
  • auc_headers: Automatic Underline Headers preprocessor (mine)
  • fenced_code: process and pygmentize code fenced code blocks (Extra)
  • html_block (if safeMode is not ‘escape’)
  • footnotes: process footnotes (Extra)
  • abbr: process abbreviations (Extra)
  • reference: remove reference definitions from text and store for later use

Phase 2: Block parser processors parse the high-level structural elements of the pre-processed text into an ElementTree:

  • admonition: Adds rST-style admonitions
  • markdown_in_html: Process markdown inside <html> blocks (Extra)
  • empty: process blocks that are empty or start with an empty line
  • indent: process children of list items
  • def_list (indent): process indentations for definition lists (Extra)
  • auc_headers: process Automatic Underline Headers (mine)
  • code: process code blocks
  • tables: process Markdown tables (Extra)
  • hashheader: process hash headers (ATX headers)
  • setextheader: process Setext-style headers
  • hr: Process horizontal rules (HTML <hr>)
  • sane_lists: sanely process ordered and unordered lists
  • def_list: process definition lists (Extra)
  • olist: process ordered list blocks
  • ulist: process umordered list blocks
  • quote: process block-quote blocks
  • paragraph: process Paragraph blocks

Phase 3: Tree processors are run against the ElementTree:

  • footnotes: process footnotes (Extra)
  • autoxref: Link [inline references] or [inline references] to headers (mine)
  • codehilite: Pygmentize code blocks
  • inline: apply inlines such as italics, bold, code, etc
    • backtick
    • escape
    • reference
    • link
    • image_link
    • image_reference
    • short_reference
    • autolink
    • automail
    • linebreak
    • html
    • entity
    • wikilinks (extension)
    • not_strong
    • em_strong
    • strong_em
    • strong
    • emphasis
    • emphasis2
    • smart_strong (extension)
    • smarty (extension)
    • nl2br (extension)
  • attrlist: add attributes from Markdown to HTML objects (Extra)
  • inline: called a second time?
  • toc: add IDs to headers and create a Table of Contents
  • prettify: add linebreaks to the HTML document

Phase 4: Post-processors are run against the text after the ElementTree has been serialized into text:

  • raw_html: Restore raw html to the document
  • h1h2_uplinks: Add “TOC” and “Top” links to H1 and H2 headers
  • amp_substitute: Restore valid entities
  • footnotes: footnotes post-processor (Extra)
  • unescape: Restore escaped chars