Markdown extension notes; new extension h1h2_uplinks

Contents

A decent home for my extensions

This week I figured out how to put my Python Markdown extensions into a directory of my choosing instead of forcing them to be in /usr/lib/python#.#/site-packages/markdown/extensions.

First up, I couldn’t use a directory path containing markdown/extensions (my initial attempt was /home/brian/projects/python/markdown/extensions). All attempts to import a module failed:

[brian@sparrow ~]$ export PYTHONPATH='/home/brian/projects/python'
[brian@sparrow ~]$ python3
>>> import markdown
>>> import markdown.extensions.auc_headers
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'markdown.extensions.auc_headers'

After trial and error, I determined that directories on PYTHONPATH will be used for importing if they have a file named __init__.py in them, but with the major side-effect of disabling access to the module that would otherwise be found on sys.path.

In the end I set up /home/brian/projects/python/markdown_extensions. Note that the directory is markdown_extensions as opposed to extensions being a subdirectory under markdown. I could have used pretty much anything other projects/markdown/, such as projects/md/xtn–the only requirement being I shouldn’t replicate the name of an existing module directory.

The next piece was making a minor change to the extensions themselves. Instead of importing markdown modules in this fashion:

from __future__ import absolute_import
from __future__ import unicode_literals
from . import Extension
from ..preprocessors import Preprocessor
from ..blockprocessors import BlockProcessor
from ..util import etree
import re

I do the following instead:

from __future__ import absolute_import
from __future__ import unicode_literals
from markdown import Extension
from markdown.preprocessors import Preprocessor
from markdown.blockprocessors import BlockProcessor
from markdown.util import etree
import re

The final part was updating jmd to determine the correct path to use for the extensions:

# Track down the path to the markdown extensions
for PY_PATH in $(echo -e "import sys\nfor p in sys.path: print(p)" | python3)
do
    [ -d $PY_PATH/markdown/extensions ] && MARKDOWN_EXT_PATH="$PY_PATH/markdown/extensions"
done
[ "$MARKDOWN_EXT_PATH" ] || die "Unable to locate path to markdown/extensions"

# Append extensions to MARKDOWN_PY:
#   "markdown.extensions" is in $MARKDOWN_EXT_PATH
#   "markdown_extensions" is in ~/brian/projects/python
# "extra" includes abbr, attr_list, def_list, fenced_code, footnotes, tables, smart_strong
# "headerid" is not in the list because "toc" does this work now
export PYTHONPATH="/home/brian/projects/python"
for EXTENSION in extra meta sane_lists smarty toc urilfy \
    gentoc_remove auc_headers autoxref h1h2_uplinks
do
    [ -f "$MARKDOWN_EXT_PATH/$EXTENSION.py" ] && X="markdown.extensions" || X="markdown_extensions"
    MARKDOWN_PY="$MARKDOWN_PY -x $X.$EXTENSION"
done

From the module:

On every H1 and H2 header in a doument, add a pair of links: one with the label “TOC” that goes to the Table of Contents entry for the header, and another with the label “Top” that goes to the top of the file. These links make it easier to navigate a large document, and are especially useful on touch screen devices.

This works by first adding id attributes to entries in the table of contents, to serve as the targets for the TOC links mentioned above. Since I can’t tell which entries are <h1> and <h2> (well, I could but I would have to do two passes over the entire file) I simply added IDs to all of them:

<div id="toc" class="toc">
<ul>
<li><a id="toc-0001" href="#week-of-october-22-28">Week of October 22-28</a><ul>
<li><a id="toc-0002" href="#markdown-extensions">Markdown extensions</a><ul>
<li><a id="toc-0003" href="#a-decent-home-for-my-extensions">A decent home for my extensions</a></li>
</ul>
</li>
<li><a id="toc-0004" href="#new-extension-h1h2_uplinks">New extension: h1h2_uplinks</a></li>
<li><a id="toc-0005" href="#pygments">Pygments</a></li>
</ul>
</li>
<li><a id="toc-0006" href="#week-of-october-15-21">Week of October 15-21</a><ul>
<li><a id="toc-0007" href="#python-markdown">Python Markdown</a></li>
<li><a id="toc-0008" href="#python-markdown-extensions">Python Markdown extensions</a><ul>
<li><a id="toc-0009" href="#gentoc_remove">gentoc_remove</a></li>
<li><a id="toc-0010" href="#auc_headers-arbitrary-underline-character-headers">auc_headers (Arbitrary Underline Character headers)</
a></li>
<li><a id="toc-0011" href="#autoxref">autoxref</a></li>
<li><a id="toc-0012" href="#autoxref-extension">autoxref extension</a></li>
</ul>
</li>
</ul>
</li>
  <!-- (485 lines skipped) -->
<li><a id="toc-0332" href="#titanitechcom-prank-successfailure-list">Titanitech.com prank success/failure list</a></li>
<li><a id="toc-0333" href="#computer-assistance-client-list">Computer Assistance: Client List</a></li>
<li><a id="toc-0334" href="#current-medications">Current medications</a></li>
</ul>
</li>
</ul>
</div>

The above is output from markdown, but it’s horribly formatted. Not that it matters–HTML is processed as a stream and doesn’t necessarily need whitespace and line breaks. They’re useful for humas, not computers. Here’s the same text after being run through tidy -i --wrap:

<h2 id="table-of-contents">Table of Contents</h2>
<div id="toc" class="toc">
  <ul>
    <li>
      <a id="toc-0001" href="#week-of-october-22-28" name="toc-0001">Week of October 22-28</a>
      <ul>
        <li>
          <a id="toc-0002" href="#markdown-extensions" name="toc-0002">Markdown extensions</a>
          <ul>
            <li><a id="toc-0003" href="#a-decent-home-for-my-extensions" name="toc-0003">A decent home for my extensions</a></li>
          </ul>
        </li>
        <li><a id="toc-0004" href="#new-extension-h1h2_uplinks" name="toc-0004">New extension: h1h2_uplinks</a></li>
        <li><a id="toc-0005" href="#pygments" name="toc-0005">Pygments</a></li>
      </ul>
    </li>
    <li>
      <a id="toc-0006" href="#week-of-october-15-21" name="toc-0006">Week of October 15-21</a>
      <ul>
        <li><a id="toc-0007" href="#python-markdown" name="toc-0007">Python Markdown</a></li>
        <li>
          <a id="toc-0008" href="#python-markdown-extensions" name="toc-0008">Python Markdown extensions</a>
          <ul>
            <li><a id="toc-0009" href="#gentoc_remove" name="toc-0009">gentoc_remove</a></li>
            <li><a id="toc-0010" href="#auc_headers-arbitrary-underline-character-headers" name="toc-0010">auc_headers (Arbitrary Underline Character headers)</a></li>
            <li><a id="toc-0011" href="#autoxref" name="toc-0011">autoxref</a></li>
            <li><a id="toc-0012" href="#autoxref-extension" name="toc-0012">autoxref extension</a></li>
          </ul>
        </li>
      </ul>
    </li>
        <!-- (653 lines skipped) -->
        <li><a id="toc-0332" href="#titanitechcom-prank-successfailure-list" name="toc-0332">Titanitech.com prank success/failure list</a></li>
        <li><a id="toc-0333" href="#computer-assistance-client-list" name="toc-0333">Computer Assistance: Client List</a></li>
        <li><a id="toc-0334" href="#current-medications" name="toc-0334">Current medications</a></li>
      </ul>
    </li>
  </ul>
</div>

The next part was updating the <h1> and <h2> headings. Output from markdown usually is as follows:

<h1 id="week-of-october-22-28">Week of October 22-28</h1>
<h2 id="markdown-extensions">Markdown extensions</h2>

The h1h2_uplinks extension updates them to:

<div class="h1 header">
  <h1 id="week-of-october-22-28">Week of October 22-28</h1>
  <div class="up-links"><a href="#toc-0001">TOC</a> | <a href="#toc">Top</a></div>
</div>
<div class="h2 header">
  <h2 id="markdown-extensions">Markdown extensions</h2>
  <div class="up-links"><a href="#toc-0002">TOC</a> | <a href="#toc">Top</a></div>
</div>

The final piece is some CSS to position the up-links div:

div.heading { position: relative; }
div.up-links { text-align: right; position: absolute; bottom: 2px; right: 0px; }
div.h1 div.up-links { bottom: 25%; right: 5px; }
div.up-links a { color: grey; }

Here’s the extension’s code. The first part is a the extension object itself, which has a property called toc_id that’s used to stash information between the calls to the tree processor and the post-processor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/python3
class H1H2uplinksExtension(Extension):
    """ Uplinks Extension. """

    def extendMarkdown(self, md, md_globals):
        """ Add pieces to Markdown. """
        md.registerExtension(self)

        # h1h2_id maps slugified H1/H2 headers (eg 'this-is-a-header') to their
        # IDs (eg, 'toc-####'). We need to stash the dict somewhere that's
        # accessible from both the tree processor and the post-processor, so
        # we make it a property of the Extension object.
        self.h1h2_id = dict()

        ## Add the tree processor to the list
        md.treeprocessors.add(
            "h1h2_uplinks", H1H2uplinksTreeprocessor(self),"<inline"
        )

        # Insert the post-processor after inserting raw HTML
        md.postprocessors.add(
            "h1h2_uplinks", H1H2uplinksPostprocessor(self), ">raw_html"
        )

The tree processor identifies the H1 and H2 headers and makes note of them in the toc_id dictionary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/python3
class H1H2uplinksTreeprocessor(Treeprocessor):
    """
    Track down the H1 and H2 headers, assign them IDs (e.g. 'toc-0001') and store
    the header/ID mappings in 'id'
    """
    call_num = 0

    def __init__(self, extobj):
        self.h1h2_id = extobj.h1h2_id

    def run(self, doc):
        self.call_num = self.call_num + 1

        h1h2_counter = 0
        for elem in doc:
            if elem.tag in ('h1', 'h2'):
                h1h2_counter = h1h2_counter + 1
                target = "toc-{:04d}".format(h1h2_counter)
                self.h1h2_id[slugify(str(elem.text), '-')] = target

Finally the post-processor adds id attributes to the H1 and H2 headers in the table of contents, and adds the up-links <div> wrappers to the headers in the main body of the file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/usr/bin/python3
class H1H2uplinksPostprocessor(Postprocessor):
    """ Add "id=" attributes to H1 and H2 level entries in the Table of Contents  """
    call_num = 0

    def __init__(self, extobj):
        self.h1h2_id = extobj.h1h2_id

    def run(self, text):
        """
        Add 'id' attributes to links in the Table of Contents

        Parameter:
        * text: the HTML stream generated by markdown

        Return: Updated text

        Note: this method is actually called twice: once when the Table of
        Contents is generated, and again when the document is complete. This
        method runs only on the second call; ergo, if the 'toc' extensions is
        not loaded, this extension won't do anything.
        """

        self.call_num = self.call_num + 1
        if self.call_num == 1:
            return text

        # Locate the <div> holding the Table of Contents
        match = re.search(r'(?P<toc><div class="toc">.*?</div>)', text, re.DOTALL)
        if not match:
            return text

        start, end = match.span(1)
        RE_A = re.compile(r'(?P<P1> *(?:<li>)<a )(?:id="toc-[0-9]{4}" )?(?P<P2>href="#(?P<id>.*?)">.*</a>.*)')
        RE_H1_H2 = re.compile(r' *<(?P<h1_h2>h[12]) id="(?P<id>.*?)"')

        new_doc = []
        # Phase 1: add 'id' attributes to items in the Table of Contents.
        for line in match.group('toc').split("\n"):
            # If line is '<a href="..."> ... </a>", add "id='toc-####'" to it
            m = RE_A.match(line)
            if m and m.group('id') in self.h1h2_id:
                line = '{}id="{}" {}'.format(
                    m.group('P1'), self.h1h2_id[m.group('id')], m.group('P2')
                )
            # Add this line to the new TOC code
            new_doc.append(line)

        # Phase 2: add 'up-links' div to h1 and h2 elements in the HTML
        for line in text[end+1:].split('\n'):
            match = RE_H1_H2.match(line)
            if match and match.group('id') in self.h1h2_id:
                new_doc.append('<div class="{} header">'.format(match.group('h1_h2')))
                new_doc.append('  ' + line)
                new_doc.append('  <div class="up-links"><a href="#{}">TOC</a> | '
                    '<a href="#top">Top</a></div>'.format(self.h1h2_id[match.group('id')]))
                new_doc.append('</div>')
            else:
                new_doc.append(line)

        return '{}\n{}'.format(text[0:start-1], '\n'.join(new_doc))