commarkk

Contents

The project of late has been commarkk. It’s an ambitious attempt to implement the CommonMark specification using Perl and the parsing strategy outlined in Appendix A of the spec.

It started when I did another round of maintenance on my venerable notes2html application, in an attempt to make it play nicely with both MultMarkdown (slow, but has nice features like tables and automatic cross-references) and cmark (a CommonMark implementation written in C: fast, but lacks MultiMarkdown’s feature set.) That work ballooned the program from 550 lines of Perl code to almost 1100, and it still wasn’t working properly.

In an attempt to fix the issues with notes2html I decided to have a go at implementing the block structure parser as outlined in the CommonMark spec. It turned out to be more work than I had expected. A lot more work!

Block parser

Nearly three weeks have passed, and I’ve done little else in my ample spare time but work on the block parser. It’s almost complete. To date I’ve implmented:

  • Headings at levels 1 thorugh 6 in ATX (prefixed with ‘#’) mode, setext (Structured Extended Text) mode, underlined with ‘===’ and ‘—‘), and AUC (Arbitrary Underline Character) mode, which came from notes2html
  • Thematic breaks (e.g. ‘ * ’, which renders in HTML as <hr>)
  • Ordered and unordered lists, at multiple levels of indentation
  • Indented code blocks
  • Fenced code blocks (starting and ending with a line of backticks or tildes)
  • Link reference definitions
  • Seven different types of raw HTML blocks
  • Definition lists (a MultiMarkdown extension)
  • Comment lines prefixed with % (seen on discussion forums but not implemented elsewhere)
  • Metadata blocks (a MultiMarkdown extension)
  • Blockquotes at multiple levels of indentation, which are very tricky because they can encapsulate pretty much all other types listed here

Still to do are tables (from MultiMarkdown:), image links, and some additional types of inline reference links used in MultiMarkdown: footnotes, glossary, and citation. There’s also extended text links, which is unique to commarkk.

As of this writing, the parser is about 645 lines of code when line-level comments, tracing code, and most blank lines are removed. 70 lines are used by some code analyzes link referenecs, which I intend to break out to a separate module beucase the inline parser will it as well.

(By violating a lot of Perl conventions and increasing the line width to about 120 characters, I can actuallly get the code down to 370 lines!)

Still to do: inline parser, renderer, and wrapper

And that’s just the block parser. There are other pieces that need to be done as well:

  • The inline parser, which recognizes boldface and emphasis markers (and possibly strikethrough and underline), and locates inline links with the goal of turning them into proper links with text and destinaton.
  • The renderer, which takes the result of the block and inline parse operations and generates a document. For now I intend only to write an HTML renderer; I don’t know enough LaTeX to write one for that document processor!
  • A wrapper routine that accepts input from the user, calls the parsers and the user’s renderer of choice (“any renderer you want, so long as it’s HTML”), and writes the resulting stream to the user’s selected output (stdout, string, file handle, or file.)
  • A wrapper program that does the CSS work that notes2html current does.

Speed: fast!

A goal of commarkk has been to write it for speed. One reason MultiMarkdown scales so poorly is every single line of input is tested against every possible Markdown block and inline type. By contrast, commarkk tries as far as possible to identify only the processing that needs to be done for a given line or block, then goes ahead and does it.

But that makes for tricky programming, because there are many special cases in the specification. Some block types can interrupt a paragraph, while at other times most processing can be bypassed altogether, such as when a line is simply being appended to a current paragraph, or an HTML or code block is being processed. Blank lines can close current blocks, or not, depending on where certain text margins are on the line. So there’s a lot of examination of the processing flow, seeing what’s being worked on right now, and how the next line (which might close a current lock and/or introduce a new one) might affect it.

But the payoff is speed. The entry [Elapsed time for MultiMarkdown to format a large file] shows that my 2016 Olinia notebook takes almost 11 minutes to render using MultiMarkdown on my Dell laptop. At this stage of the project, the commarkk block parser can parse the entire file in two seconds.

The Raspberry Pi 1 can parse the file in 66 seconds (compare with 12,145 for MultiMarkdown,) and the Raspberry Pi 3 can do it in 8.5 seconds (compare with 3640 for MultiMarkdown. cmark, on the other hand, can do it in 0.13 seconds!)