Python Markdown extensions: gentoc_remove, auc_headers, autoxref
Contents
This week I wrote the following extensions to Python Markdown.
gentoc_remove
This one was pretty straightforard. Here’s the code:
/home/brian/projects/python-markdown-extensions/gentoc_remove.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
#!/usr/bin/python """ genTOC Table of Contents remover Extension for Python-Markdown ============================================================== Removes a text-mode Table of Contents entry from the input document and replaces it with "[TOC]" for the 'toc' extension. Expects to find "Table of Contents" within the first 100 lines of the input file, with either an ATX '#'/'##' header or an Setext/AUC underline License: [BSD](http://www.opensource.org/licenses/bsd-license.php) """ from __future__ import absolute_import from __future__ import unicode_literals from . import Extension from ..preprocessors import Preprocessor import re class GenTOCRemoveExtension(Extension): """ GenTOCRemove Extension. """ def extendMarkdown(self, md, md_globals): """ Add pieces to Markdown. """ md.registerExtension(self) # Insert the preprocessor after whitespace normalisation md.preprocessors.add( "gentoc_remove", GenTOCRemovePreprocessor(self), ">normalize_whitespace" ) class GenTOCRemovePreprocessor(Preprocessor): """ Remove a Table of Contents section generated by genTOC """ def run(self, lines): """ Remove a Table of Contents section generated by genTOC Keywords: * lines: A list of lines of text Return: A list of lines of text with the Table of Contents replaces with '[TOC]' """ RE_ToC = re.compile('Table of Contents', re.IGNORECASE) RE_ToC_line = re.compile(r'\s*\d+\s+.+$') RE_blank_line = re.compile(r'\s*$') i = -1 toc_start = -1 # First available line after '[TOC]' marker toc_end = 0 # Last line of the ToC, befire blank lines skip_AUC_line = False # AUC = "Arbitrary Underline Character" for l in lines: i = i + 1 if RE_ToC.match(l): # We found a candidate ToC line if re.match(r'\s*#', l): toc_start = i + 1 continue elif re.match(r'\s*(([^\s])(\2+))\s*$', lines[i+1]): toc_start = i + 2 skip_AUC_line = True continue # Stop processing if no "Table of Contents" after 100 lines if toc_start == -1 and i > 100: break # If we're in a Table of Contents section, check for a valid ToC line # (e.g. " 55 This is a heading"). If it's non-blank and not a ToC # line, we've found the end of the Table of Contents if toc_start > -1 and not RE_blank_line.match(l): if RE_ToC_line.match(l): toc_end = i elif skip_AUC_line: skip_AUC_line = False else: break if toc_start > -1: lines.insert(toc_start, "[TOC]") del lines[toc_start+1 : toc_end+2] return lines def makeExtension(*args, **kwargs): """ Return an instance of the GenTOCRemoveExtension """ return GenTOCRemoveExtension(*args, **kwargs) |
auc_headers (Arbitrary Underline Character headers)
Process header entries with “underline” characters beyond the standard “===” is level 1, “—” is level 2. Underline characters are assigned levels as they are encountered.
In order for a header to be included, its underline character must start at the same column as the header, consist of entirely the same character, and be the same length as the header (plus or minus one character.)
However, original style setext headers consisting of:
- a blank line, followed by
- a line of text, followed by
- a line consisting of a single (or multiple) ‘=’ or ‘-‘ starting at column one (after accounting for leading blockquote markers)
are left to the setext handler. These “short style” headers are always handled as level 1 or 2, regardless of when they were encounterd in the document. For example:
This is an AUC Level 1 Header ***************************** This is an AUC Level 2 Header ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > This is an AUC Level 3 Header > ============================= This is an AUC Level 4 Header ----------------------------- This is an original markdown level 1 header === > >This is an original markdown level 2 header > >-
This extension was a bit more complicated, because I had to write both a
pre-processor and a block processor. The pre-processor runs quite early in the
process to find headers with that use underlines consiting of tildes (~~~
) or
backticks, because another pre-processor can mistake them for code fence markers.
Then the block processor locates the AUC-underlined headers and changes them to
ATX headers prefixed with 1 to 6 #
characters.
autoxref
Automatically creates header links in the document, allowing one to do something like the following:
autoxref extension
Using the autoxref extension, all headers are automatically set up as link targets, allowing one to link to a section of the document by including the header’s text within square brackets (as I did in this sentence.)
This one was by far the trickiest to write. I had at least two false starts before I determined I could implement it as a tree processor, and that it had to run just before the inline processor. Once I had that figured out the code was rather straightforward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
#!/usr/bin/python class AutoxrefTreeProcessor(Treeprocessor): """ Automatically create links to headers referenced in a block """ # Detect an inline reference RE = re.compile(r'(?P<lead>.*?)\[(?P<ref>[^]]+)\](?P<tail>..)?') def __init__(self, md): super(AutoxrefTreeProcessor, self).__init__(md.parser) self.md = md def _process_match(self, match, headers): """ Process an instance of [text that might be a header link] """ lead = match.group('lead') ref = slugify(match.group('ref'), '-') tail = match.group('tail') text = match.group('lead') + '[' + match.group('ref') + ']' if (not tail or (tail[0:1] != '(' or tail == '[]')) \ and ref in headers: text = "%s(#%s)" % (text, ref) if tail and tail != '[]': text = text + tail pos = tail and match.end(3) or match.end(2)+1 return text, pos def run(self, doc): """ Seeks out inline references in the form [reference] or [reference][] and attempts to resolve them to headings in the document. Parameters: * lines: A list of lines of text """ # Get a list of headings, slugified headers = {} for elem in doc.iter(): if elem.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6'): headers[slugify(elem.text, '-')] = None # Look for [link references] in various elements. for elem in doc.iter(): if not elem.tag in ('p', 'li', 'td', 'dt', 'dd', 'article', 'caption'): continue if not elem.text: continue # Loop through all occurrences of '[link reference]' in the element new_text = '' p = 0 # Last known position within elem.text for m in re.finditer(self.RE, elem.text): # (Account for text skipped from the last match) if p < m.start(1): new_text = new_text + elem.text[p:m.start(1)] # Process this match text, p = self._process_match(m, headers) new_text = new_text + text # Append any trailing text if p < len(elem.text): new_text = "%s%s" % (new_text, elem.text[p:]) elem.text = new_text |