Python Markdown: ideas, bugs; extensions percent_comments and linebreak_plus

Contents

INFO level logging showing Markdown progress (abandoned)

The challenge: create a way to log information at INFO level about Markdown’s progress whit newlines where I want them instead of being applied to every log line.

Logging has formatters, handlers, and filters:

  • Formatters determine how the message is set up for display
  • Handlers determine where log messages are written (console, file, stream …)
  • Filters provide finer grained determination for what messages to display

Ergo, I need a special formatter. Next, how I set up an INFO level logger that uses my formatter when I want it to, but uses the “standard” formatter for everything else?

First, I can also set up a custom logging level. The default levels are:

Level Numeric value
CRITICAL 50
ERROR 40
WARNING 30
INFO 20
DEBUG 10
NOTSET 0

From the python-docs/howto/logging file:

Levels can also be associated with loggers, being set either by the developer or through loading a saved logging configuration. When a logging method is called on a logger, the logger compares its own level with the level associated with the method call. If the logger’s level is higher than the method call’s, no logging message is actually generated. This is the basic mechanism controlling the verbosity of logging output.

(Alternatively, if the method call’s level is <= the logger’s level, then a logging message is generated and sent to the logging system. For example, a log level of 30 will display messages generated by logging.warning, logging.info, and logging.debug.)

(“Method call” refers to debug(), info(), warning(), etc.)

If the overall logging level is 25, then logger.warning will generate a log message (because 25 < WARNING[30]) but logger.info will not (because 25 > INFO[20]).

Next, can I set a custom formatter for level 21? Maybe. It’s worth noting that the setFormatter method is used by the handler. Does this mean I can set up a custom handler for level 21? I think so. I set up a handler that outputs to the console, add a custom formatter to it, the tell that handler to handle messages at level 21, not 20. (In practice this didn’t work: I got duplicate INFO messages because the two handlers each wrote a message to the console.)

Next, can I get the formatter to not output a newline? The logging-cookbook file shows how one can set up a custom Class for a message with a __str__ method that is invoked with the logger calls str() on that object. I should be able to set up that class in the main file (markdown.__init__.py) and import it into the classes where I need to invoke it.

Finally, is there an equivalent to Perl’s $| command to make stdout and stderr non-buffering? Well, I may not have to worry about this; a comment at StackOverflow indicates stdout in Python is always non-buffered.

After working at this for a while, I determined the following:

  • At true INFO level (20), I don’t want any progress messages appearing, only regular INFO messages
  • At level 25 I want progress messages and regular INFO messages appearing, but don’t want WARNING messages appearing.
  • I need a special level called progress at level 25; then I can call logger.progress for these messages. They won’t appear at INFO level because INFO==20, and 25 > 20.
  • However, the logging document says this:

    … it is possibly a very bad idea to define custom levels if you are developing a library. That’s because if multiple library authors all define their own custom levels, there is a chance that the logging output from such multiple libraries used together will be difficult for the using developer to control and/or interpret, because a given numeric value might mean different things for different libraries.

  • Given that, it’s rather strange that the logging system allows for “in between” levels such as 15 and 25, because practically they’re of no use. Possibly it’s a case of designing for the future, only to discover the future didn’t turn out as expected.

I could, of course, bypass the logging mechanism altogether and simply write what I need to stderr. But even that has issues:

  • In Python 2, stderr is unbuffered
  • Starting in Python 3.0, stderr is buffered, but that can be overridden by starting the interpreter with -u or using the PYTHONUNBUFFERED environment variable
  • The help text for python2 indicates it honours PYTHONUNBUFFERED as well.
  • Starting in Python 3.3, the print() function has a flush=True parameter

This is actually a decent argument for programming in Perl. Perl 5 solved a lot of these issues twenty years ago–Perl 5.004 was released in May 1997. By contrast, there is still a lot of Python 2 code out there, and (for Markdown at least) the expectation is that libraries should support both Python 2 and Python 3.

Issue in spantable.py when working with Python Markdown 2.6.11

Curiously, markdown_py-3 is version 2.6.7 of Python Markdown, while the version installed with Python 2 is a more up-to-date 2.6.11. Between the two versions a few things were changed, which broke the spantable extension I require for better table formatting. The issue was an older version of the XML Element module did not support a default keyword on an attribute get method:

colspan = cell_obj.get('colspan', default='1')

I had to update a couple of instances of that to read:

cell_obj.get('colspan')
if not colspan:
    colspan = 1

Finally there was an issue where Python 2 and Python 3 disagreed about being able to convert something to an int, so I had to work wound that:

if text == None:
    if c is not None:
        colspan = c.get('colspan')
        if not colspan:
            colspan = 1
        try:                            # Added
            colspan = int(colspan)      # (Works in v2, not in v3)
        except TypeError:               # Added
            pass                        # Added
        c.set('colspan', str(colspan + 1))
    else:
        # if this is the first cell, then fall back to creating an empty cell
        text = ''

Progress logging using sys.stderr.write

I first had to determine how to set up a progress or show_progress parameter and get it propogated to the Markdown instance. Note that Markdown much prefers to use kwargs (keyword arguments).

When run as a command line program, the process flow is:

__main.py__::run()  __init.py__::markdownFromFile()  convertFile()  convert()

When run as a module, the process flow is:

__init.py__.markdown()  convert()

The process flow for convert is:

Preprocessors  BlockProcessors  Treeprocessors  serialize  Postprocessors

The actual argument parsing is done in the __init__ function in the __init__.py file. Keyword parameters are stored as properties in the markdown object; ergo, self.show_progress will return True or False.

Now I need a place to stash the terminal width. One possibility to put the progress logging into a class and store it as a class property, determined in the class’s __init__ function. I put this class into its own module and stored it in markdown_extensions/show_progress.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#!/usr/bin/python
"""
Progress display for Python Markdown
====================================

Note: This file is in the markdown_extensions directory because it seems to be
a good home for it. However, this is not a regular extension and cannot be
included simply by adding '-x markdown_extensions.show_progress' on the command
line. This code needs additional support in __init__py, __main__.py, and
blockparser.py in order to work. See "Adding show_progress to Python Markdown"
at the end of this file for details.

When enabled, this code displays Markdown's progress on stderr as it goes
through its various phases and processors:

  >>> import sys
  >>> sys.path.append('/home/brian/projects/python')
  >>> import markdown
  >>> markdown.markdown("This is a test", show_progress = True)
  Preprocessors: NormalizeWhitespace, HtmlBlock, Reference
  Block processors: Empty, ListIndent, Code, HashHeader, SetextHeader, HR,
    OList, UList, BlockQuote, Paragraph
    Processing 2 blocks
  Tree processors: Inline, Prettify
  Post-processors: RawHtml, AndSubstitute, Unescape
  '<p>This is a test</p>'

Or from the command line:

  [user@host ~]$ export PYTHONPATH='/home/brian/projects/python'
  [user@host ~]$ echo "This is a test" | markdown_py-3 --progress
"""
import sys
import re

class ShowProgress:
    """Display phases and processor names as are they are run"""

    def __init__(self, show_progress):
        self.show_progress = show_progress
        if show_progress:
            self.width, height = self._getTerminalSize()
            self.RE_NORMALIZE = re.compile(r'(([Pp]re|[Bb]lock|[Tt]ree|[Pp]ost)?[Pp]rocessor)?\'>')
            self.comma = ''
            self.number_of_blocks = None

    def phase(self, phase_name):
        """ Display a phase: Pre, Block, Tree, or Post Processors """
        if not self.show_progress:
            return
        if phase_name:
            if phase_name != 'Preprocessors':
                sys.stderr.write("\n")
            sys.stderr.write("%s:" % phase_name)
        else:
            sys.stderr.write("\n")
        sys.stderr.flush()
        self.pos = len(phase_name)
        self.comma = ''

    def processor(self, p_obj):
        """ Display the name of the passed processor object  """
        if not self.show_progress:
            return
        x = self.RE_NORMALIZE.sub('', str(p_obj.__class__).split('.')[-1])
        if self.pos + len(x) + 2 > self.width:
            sys.stderr.write("%s\n " % self.comma)
            self.pos = 0
            self.comma = ''
        sys.stderr.write("%s %s" % (self.comma, x))
        sys.stderr.flush()
        self.comma = ','
        self.pos = self.pos + len(x) + 2

    def block_count(self, blocks):
        """ Display the number of blocks being processed by the block processors """
        if self.show_progress and not self.number_of_blocks:
            self.number_of_blocks = len(blocks)
            sys.stderr.write("\n  Processing %i blocks" % self.number_of_blocks)
            sys.stderr.flush()

# (81 lines pertaining to determining terminal size deleted)

As noted in the source code above, this isn’t a regular extension. Other Markdown modules have to be patched in order for it to work. There’s some work to accept a -p pr --progress parameter in __main__.py, and to accept a show_progress parameter in __init__.py, and after that the criticial stuff is (for example):

import markdown_extensions.show_progress
    # (lines skipped)
    # (self.show_progress is the value of the show_progress parameter: True/False)
    progress = markdown_extensions.show_progress.ShowProgress(self.show_progress)
    # (lines skipped)
    progress.phase('Preprocessors')
    self.lines = source.split("\n")
    for prep in self.preprocessors.values():
        progress.processor(prep)
        self.lines = prep.run(self.lines)

Idea for specifying extensions more consisely

Currently, extensions must be specified by passing a complete path to the extension:

>>> import sys
>>> sys.path.append('/home/brian/projects/python')
>>> import markdown
>>> markdown.markdown('text', extensions = ['markdown.extensions.extra',
...  'markdown.extensions.toc', 'markdown_extensions.auc_headers' ...])
export PYTHONPATH='/home/brian/projects/python'
markdown_py-3 -x markdown.extensions.extra \
  -x markdown_extensions.gfm_tasklist -x markdown.extensions.meta \
  -x markdown.extensions.sane_lists -x markdown.extensions.smarty \
  -x markdown_extensions.spantable -x markdown.extensions.toc \
  -x markdown_extensions.urlize -x markdown_extensions.gentoc_remove \
  -x markdown_extensions.percent_comments -x markdown_extensions.auc_headers \
  -x markdown_extensions.linebreak_plus -x markdown_extensions.autoxref \
  -x markdown_extensions.toc_fixer filename.md >filename.html

I thought there might be a way to improve this.

  • On the command line, provide a comma separated list of extension names
  • When calling markdown as a Python module, pass an array of names
  • For each name, prefix with markdown.extensions. and try to load it
  • If that fails, loop through entries in markdown_mod_prefix; prefix the extension name with the entry and try to load it.
  • If that fails, give up and raise an error
>>> import sys
>>> sys.path.append('/home/brian/projects/python')
>>> import markdown
>>> markdown.markdown('text', extensions_mod_prefix = ['markdown_extensions'],
... extensions = ['extra', 'toc', 'auc_headers'])
export PYTHONPATH='/home/brian/projects/python'
markdown_py-3 -m markdown_extensions -x extra,gfm_tasklist,meta,sane_lists \
  -x smarty,spantable,toc,urlize,gentoc_remove,percent_comments,auc_headers \
  -x linebreak_plus,autoxref,toc_fixer

Right now it’s just an idea. In practice, if one needs to load up a lot of extensions, on the command line I use a shell script, and when running as a Python module I’s probably write a wrapper.

Here’s a prototype --help that I wrote for this. Most of the options are already in markdown_py; I added the second -x line and the -m line.

Usage: markdown_py-3 [options] [INPUTFILE]
       (STDIN is assumed if no INPUTFILE is given)

A Python implementation of John Gruber's Markdown.
https://pythonhosted.org/Markdown/

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -f OUTPUT_FILE, --file=OUTPUT_FILE
                        Write output to OUTPUT_FILE. Defaults to STDOUT.
  -e ENCODING, --encoding=ENCODING
                        Encoding for input and output files.
  -s SAFE_MODE, --safe=SAFE_MODE
                        Deprecated! 'replace', 'remove' or 'escape' HTML tags
                        in input
  -o OUTPUT_FORMAT, --output_format=OUTPUT_FORMAT
                        'xhtml1' (default), 'html4' or 'html5'.
  -n, --no_lazy_ol      Observe number of first item of ordered lists.
  -x EXTENSION, --extension=EXTENSION
                        Load extension EXTENSION.
  -x NAME[,NAME...], -extensions NAME[,NAME...], 
                        Load list of extension names separated by commas. When
                        resolving names, they are first prefixed with
                        'markdown.extension', and if not found, prefixes from
                        -m/--extensions_mod_prefix are tried as well.
  -m PREFIX[:PREFIX ...], --extensions_mod_prefix PREFIX[:PREFIX ...]
                        One or more module prefixes to prepend to extension
                        names when searching for them, in addition to the
                        built-in name 'markdown.extensions'. For example, if
                        you have extensions in '/usr/local/lib/md_py_extn',
                        you can pass '-m md_py_extn' (note that you also need
                        to set PYTHONPATH='/usr/local/lib')
  -c CONFIG_FILE, --extension_configs=CONFIG_FILE
                        Read extension configurations from CONFIG_FILE.
                        CONFIG_FILE must be of JSON or YAML format. YAMLformat
                        requires that a python YAML library be installed. The
                        parsed JSON or YAML must result in a python dictionary
                        which would be accepted by the 'extension_configs'
                        keyword on the markdown.Markdown class. The extensions
                        must also be loaded with the `--extension` option.
  -q, --quiet           Suppress all warnings.
  -v, --verbose         Print all warnings.
  -p, --progress        Show markdown progress.
  --noisy               Print debug messages.

I did make a change similar to this, but put it into the jmd wrapper script instead.

Markdown extension percent_comments

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#!/usr/bin/python
"""
Percent Comments Extension for Python-Markdown
==============================================

Treats lines beginning with '%' in columns 1, 2, or 3 in the document as internal
comments and removes them from the document so they do not appear in the
output.  Handles such lines found within blockquotes and tables but ignores
them within fenced code blocks.

Such comments are useful if you maintain a document intended for public
consumption but you want to include information that's useful only for
you or other people who need to make changes to it. For example:

    % Information in the following table is gleaned from this file:
    % NODE"VMS12"::SYS$DISK2:[DOCUMENTS.INTERNAL]NETWORK.TXT

"""

from __future__ import absolute_import
from __future__ import unicode_literals
from markdown import Extension
from markdown.blockprocessors import BlockProcessor
from markdown.postprocessors import Postprocessor
from random import randint
from hashlib import sha1
import re

class PercentCommentsExtension(Extension):
    """ PercentComments Extension. """

    def extendMarkdown(self, md, md_globals):
        """ Add Percent Comments Block processor to Markdown. """
        md.registerExtension(self)

        # Placeholder text to indicate an elemnt that became empty because it
        # was a "%" comment. Append a random hex string to eliminate the
        # possbility of text being replaced by chance. Because this needs to be
        # accessible from both the BlockProcessor and the Postprocessor, we
        # make it a property of the extension itself.
        sha = sha1()
        sha.update(randint(0, 9999999).__str__().encode('UTF-8'))
        self.EMPTY_ELEMENT = "EMPTY_DUE_TO_PERCENT_COMMENT_%s" % sha.hexdigest().upper()[0:8]

        md.parser.blockprocessors.add('percent_comments',
            PercentCommentsBlockProcessor(md.parser, self), '<paragraph')

        # Insert the post-processor before inserting raw HTML
        md.postprocessors.add('percent_comments',
            PercentCommentsPostprocessor(self), '<raw_html')

class PercentCommentsBlockProcessor(BlockProcessor):
    """ Remove text elements that have a % in column 1, 2, or 3 """
    RE_TEST = re.compile(r'^ ? ? ?\\?%', re.MULTILINE)
    RE_REMOVE = re.compile(r'^ ? ? ?%[^\n]+\n?', re.MULTILINE)
    RE_UNESCAPE = re.compile(r'\\%')
    sw = False      # True = we processed this block on the previous call

    def __init__(self, md_parser, extobj):
       super(PercentCommentsBlockProcessor, self).__init__(md_parser)
       self.EMPTY_ELEMENT = extobj.EMPTY_ELEMENT

    def test(self, parent, block):
        if self.sw:
            self.sw = False
            return False
        return bool(self.RE_TEST.search(block))

    def run(self, parent, blocks):
        raw_block = blocks.pop(0)
        raw_block = self.RE_REMOVE.sub('', raw_block)
        raw_block = self.RE_UNESCAPE.sub('%', raw_block)
        if len(raw_block):
            # Block contains additional lines. Add to master blocks for later.
            blocks.insert(0, raw_block)
            self.sw = True
        elif not parent.tag == 'div':
            blocks.insert(0, self.EMPTY_ELEMENT)


class PercentCommentsPostprocessor(Postprocessor):
    """ Remove HTML elements made empty by PercentCommentsBlockProcessor """

    def __init__(self, extobj):
        super(PercentCommentsPostprocessor, self).__init__()
        self.EMPTY_ELEMENT = extobj.EMPTY_ELEMENT

    def run(self, text):
        """ Remove lines of the format \n<xx>EMPTY_DUE_TO_PERCENT_COMMENT_XXXXXXXX</xx>  """
        return re.sub(r'\n<([a-z]+)>%s</\1>' % self.EMPTY_ELEMENT, '', text,
            flags=re.MULTILINE)


def makeExtension(*args, **kwargs):
    return PercentCommentsExtension(*args, **kwargs)

Markdown extension linebreak_plus

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/bin/python
"""
Linebreak-Plus Extension for Python-Markdown
============================================

In addition to having two spaces at the end of a line mark a line break, adds
support for backslash (John McFarlane's CommonMark) and "<space><underscore>"
(my idea--I think it looks nicer) at the end of a line.

"""

from __future__ import absolute_import
from __future__ import unicode_literals
from markdown import Extension
from markdown.inlinepatterns import SubstituteTagPattern

LINE_BREAK_PLUS_RE = r'(\\| _)\n'

class LinebreakPlusExtension(Extension):

    def extendMarkdown(self, md, md_globals):
        linebreak_plus_tag = SubstituteTagPattern(LINE_BREAK_PLUS_RE, 'br')
        md.inlinePatterns.add('linebreak_plus', linebreak_plus_tag, '>linebreak')

def makeExtension(*args, **kwargs):
    return LinebreakPlusExtension(*args, **kwargs)

Running my Markdown extensions in Python 2

First, my markdown_extensions directory needed an empty file in it named __init__.py before Python Markdown could import files from it.

Then my H1H2_Uplinks extension failed:

File "/home/brian/markdown_extensions/h1h2_uplinks.py", line 63, in run
  self.h1h2_id[slugify(str(elem.text), '-')] = target
File "/usr/lib/python2.7/site-packages/markdown/extensions/headerid.py", line 93, in slugify
  value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
TypeError: must be unicode, not str

According to Ned Batchelder’s Pragmatic Unicode pages:

Just as in Python 2, Python 3 has two string types, one for Unicode and one for bytes, but they are named differently.

Now the “str” type that you get from a plain string literal stores unicode, and the “bytes” types stores bytes. You can create a bytes literal with a b prefix.

So “str” in Python 2 is now called “bytes,” and “unicode” in Python 2 is now called “str”. This makes more sense than the Python 2 names, since Unicode is how you want all text stored, and byte strings are only for when you are dealing with bytes.

That makes the above error obvious: the function str(elem.text) in Python 3 creates and returns Unicode text. But in Python 2 it returns a simple byte string, which causes slugify to raise an error because it’s expecting Unicode text.

Now I have a problem. The source text I’m dealing with can be either straight ASCII or Unicode. It doesn’t matter for Python 3, because it implicily works with it as Unicode. But not so much in Python 2.

I changed the above problem code to read:

self.h1h2_id[slugify(elem.text, '-')] = target

The change is I removed the str() function that wrapped elem.text. I had it there originally to force it to Unicode, but further testing on both Python 2 and Python 3 shows it already is Unicode.

I may have to revisit this if I start getting errors (again) about non-Unicode text being passed to slugify.