Elapsed time for MultiMarkdown to format a large file

I’ve known for a while now that MultiMarkdown is an inefficient program that scales poorly. As a test, I used the notebook I wrote at Olinia for the year 2016, which has 35,629 lines and weighs in at just over 1.6 MB. After removing all but the first line of the Table of Contents from the file, I used a script to successively cut the file in half until the number of lines was reduced to under 50. I then re-generated the Table of Contents and used MultiMarkdown to format each file.

Elapsed time for MultiMarkdown to format files approximately doubling in size

Lines raven Dell Laptop penguin HP Pavilion Raspberry Pi 3 Acer T180 Raspberry Pi 1
N 35 0.0s 0.0s 0.3s 0.0s 0.5s 0.0s 1.5s
O 69 0.0s 0.1s 0.0s 0.1s 0.8s 0.0s 1.9s
P 140 0.1s 0.2s 0.1s 0.1s 1.4s 0.1s 3.0s
Q 284 0.1s 0.2s 0.1s 0.3s 1.3s 0.2s 4.6s
R 566 0.2s 0.4s 0.2s 0.5s 2.0s 0.5s 7.0s
S 1130 0.5s 0.9s 0.4s 0.8s 6.3s 1.7s 11.3s
T 2268 1.6s 2.6s 1.9s 2.8s 18.7s 0.8s 34.2s
U 4517 3.1s 5.2s 4.6s 6.8s 31.1s 8.0s 1m19.5s
V 9017 12.7s 23.1s 42.2s 54.3s 2m30.0s 1m38.3s 9m47.1s
W 17914 42.2s 1m08.5s 1m45.7s 1m59.9s 7m04.9s 13m46.6s 21m13.0s
Y 35629 7m08.3s 10m50.8s 20m33.7s 21m17.0s 60m41.8s 1h54m43.4s 3h22m25.5s

Here’s the script I used to generate the files and run MultiMarkdown:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash
LINE_COUNT=$(wc -l Notebook.b.2016.text | cut -f1 -d' ')
ALPHABET="YWVUTSRQPONMLKHIJGFEDCBA"
TEMP_FN="/r/Notebook.2016.tmp"

# REMINDER: Remove all but the first line of the Table of Contents from
# Notebook.b.2016.text

LINE_COUNT=$((LINE_COUNT * 2))
N=-1
while [ $LINE_COUNT -gt 50 ]
do
    N=$((N+1)); L=${ALPHABET:$N:1}
    TO_FN="/r/Notebook.2016.$L.text"

    LINE_COUNT=$((LINE_COUNT/2))
    echo "$N $TO_FN ($LINE_COUNT lines)"
    head -n $LINE_COUNT Notebook.b.2016.text >$TEMP_FN

    genTOC.pl $TEMP_FN &>/dev/null
    TOC_FIRST_LINE=$(grep -n '^Table of Contents' $TEMP_FN |
        tail -n 1 | sed 's/\([0-9]\+\).*/\1/')
    TOC_FIRST_LINE=$((TOC_FIRST_LINE + 2))
    TOC_LAST_LINE=$(grep -n '^ \+[0-9]\+ \+[A-Z]' $TEMP_FN |
        tail -n 1 | sed 's/\([0-9]\+\).*/\1/')
    sed "${TOC_FIRST_LINE},${TOC_LAST_LINE}s/^/  /" $TEMP_FN >$TO_FN
done
rm -f $TEMP_FN

export TIMEFORMAT='real %1lR (%1R seconds)'
cd /r
for FILE in Notebook.2016.[A-Z].text
do
    echo -e "_____\n"
    echo $FILE
    OUT_FN=${FILE/.2016/}; OUT_FN=${OUT_FN/.text/.html}
    time MultiMarkdown.pl $FILE >$OUT_FN
done