Python Markdown extensions: gentoc_remove, auc_headers, autoxref

Contents

This week I wrote the following extensions to Python Markdown.

gentoc_remove

This one was pretty straightforard. Here’s the code:

/home/brian/projects/python-markdown-extensions/gentoc_remove.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/python
"""
genTOC Table of Contents remover Extension for Python-Markdown
==============================================================

Removes a text-mode Table of Contents entry from the input document and
replaces it with "[TOC]" for the 'toc' extension.

Expects to find "Table of Contents" within the first 100 lines of the input
file, with either an ATX '#'/'##' header or an Setext/AUC underline

License: [BSD](http://www.opensource.org/licenses/bsd-license.php)
"""

from __future__ import absolute_import
from __future__ import unicode_literals
from . import Extension
from ..preprocessors import Preprocessor
import re

class GenTOCRemoveExtension(Extension):
    """ GenTOCRemove Extension. """

    def extendMarkdown(self, md, md_globals):
        """ Add pieces to Markdown. """
        md.registerExtension(self)

        # Insert the preprocessor after whitespace normalisation
        md.preprocessors.add(
            "gentoc_remove", GenTOCRemovePreprocessor(self), ">normalize_whitespace"
        )

class GenTOCRemovePreprocessor(Preprocessor):
    """ Remove a Table of Contents section generated by genTOC """

    def run(self, lines):
        """
        Remove a Table of Contents section generated by genTOC

        Keywords:
        * lines: A list of lines of text

        Return: A list of lines of text with the Table of Contents replaces with '[TOC]'
        """

        RE_ToC = re.compile('Table of Contents', re.IGNORECASE)
        RE_ToC_line = re.compile(r'\s*\d+\s+.+$')
        RE_blank_line = re.compile(r'\s*$')
        i = -1
        toc_start = -1  # First available line after '[TOC]' marker
        toc_end = 0     # Last line of the ToC, befire blank lines
        skip_AUC_line = False   # AUC = "Arbitrary Underline Character"
        for l in lines:
            i = i + 1
            if RE_ToC.match(l):
                # We found a candidate ToC line
                if re.match(r'\s*#', l):
                    toc_start = i + 1
                    continue
                elif re.match(r'\s*(([^\s])(\2+))\s*$', lines[i+1]):
                    toc_start = i + 2
                    skip_AUC_line = True
                    continue

            # Stop processing if no "Table of Contents" after 100 lines
            if toc_start == -1 and i > 100:
                break

            # If we're in a Table of Contents section, check for a valid ToC line
            # (e.g. "  55  This is a heading"). If it's non-blank and not a ToC
            # line, we've found the end of the Table of Contents
            if toc_start > -1 and not RE_blank_line.match(l):
                if RE_ToC_line.match(l):
                    toc_end = i
                elif skip_AUC_line:
                    skip_AUC_line = False
                else:
                    break

        if toc_start > -1:
            lines.insert(toc_start, "[TOC]")
            del lines[toc_start+1 : toc_end+2]

        return lines

def makeExtension(*args, **kwargs):
    """ Return an instance of the GenTOCRemoveExtension """
    return GenTOCRemoveExtension(*args, **kwargs)

auc_headers (Arbitrary Underline Character headers)

Process header entries with “underline” characters beyond the standard “===” is level 1, “—” is level 2. Underline characters are assigned levels as they are encountered.

In order for a header to be included, its underline character must start at the same column as the header, consist of entirely the same character, and be the same length as the header (plus or minus one character.)

However, original style setext headers consisting of:

  • a blank line, followed by
  • a line of text, followed by
  • a line consisting of a single (or multiple) ‘=’ or ‘-‘ starting at column one (after accounting for leading blockquote markers)

are left to the setext handler. These “short style” headers are always handled as level 1 or 2, regardless of when they were encounterd in the document. For example:

This is an AUC Level 1 Header
*****************************
This is an AUC Level 2 Header
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> This is an AUC Level 3 Header
> =============================
This is an AUC Level 4 Header
-----------------------------
This is an original markdown level 1 header
===
> >This is an original markdown level 2 header
> >-

This extension was a bit more complicated, because I had to write both a pre-processor and a block processor. The pre-processor runs quite early in the process to find headers with that use underlines consiting of tildes (~~~) or backticks, because another pre-processor can mistake them for code fence markers. Then the block processor locates the AUC-underlined headers and changes them to ATX headers prefixed with 1 to 6 # characters.

autoxref

Automatically creates header links in the document, allowing one to do something like the following:

autoxref extension

Using the autoxref extension, all headers are automatically set up as link targets, allowing one to link to a section of the document by including the header’s text within square brackets (as I did in this sentence.)

This one was by far the trickiest to write. I had at least two false starts before I determined I could implement it as a tree processor, and that it had to run just before the inline processor. Once I had that figured out the code was rather straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/usr/bin/python
class AutoxrefTreeProcessor(Treeprocessor):
    """ Automatically create links to headers referenced in a block """

    # Detect an inline reference
    RE = re.compile(r'(?P<lead>.*?)\[(?P<ref>[^]]+)\](?P<tail>..)?')

    def __init__(self, md):
        super(AutoxrefTreeProcessor, self).__init__(md.parser)
        self.md = md


    def _process_match(self, match, headers):
        """ Process an instance of [text that might be a header link] """
        lead = match.group('lead')
        ref = slugify(match.group('ref'), '-')
        tail = match.group('tail')
        text = match.group('lead') + '[' + match.group('ref') + ']'
        if (not tail or (tail[0:1] != '(' or tail == '[]')) \
                and ref in headers:
            text = "%s(#%s)" % (text, ref)
        if tail and tail != '[]':
            text = text + tail
        pos = tail and match.end(3) or match.end(2)+1
        return text, pos

    def run(self, doc):
        """
        Seeks out inline references in the form [reference] or [reference][]
        and attempts to resolve them to headings in the document.

        Parameters:

        * lines: A list of lines of text

        """
        # Get a list of headings, slugified
        headers = {}
        for elem in doc.iter():
            if elem.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6'):
                headers[slugify(elem.text, '-')] = None

        # Look for [link references] in various elements.
        for elem in doc.iter():
            if not elem.tag in ('p', 'li', 'td', 'dt', 'dd', 'article', 'caption'):
                continue
            if not elem.text:
                continue

            # Loop through all occurrences of '[link reference]' in the element
            new_text = ''
            p = 0          # Last known position within elem.text
            for m in re.finditer(self.RE, elem.text):
                # (Account for text skipped from the last match)
                if p < m.start(1):
                    new_text = new_text + elem.text[p:m.start(1)]
                # Process this match
                text, p = self._process_match(m, headers)
                new_text = new_text + text
            # Append any trailing text
            if p < len(elem.text):
                new_text = "%s%s" % (new_text, elem.text[p:])
            elem.text = new_text