Storing a list of files in sqlite and dbm, using Python

Contents

As part of my attempt to famiiliarise myself with Python, on Saturday 13 October I rewrote the Perl program that reads file name from stdin and puts them into a database. This version creates both an old-school dbm file (a key/value pair store: in this case, key is a number, data is a file name) and a sqlite3 file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/python3
import sys
import dbm.gdbm
import sqlite3

# Create the DBM database; 'n' indicates 'truncate and open for read/write')
db = dbm.open('allfiles.dbm', 'n', mode=0o644)

# Create the sqlite3 database. There's no 'r/w/rw/rb+' options, so I first open
# a file for writing, truncate and close it, then reopen it in sqlite3.
fh = open('allfiles.sq3', 'w')
fh.truncate(0)
fh.close()
conn = sqlite3.connect('allfiles.sq3')
c = conn.cursor()
c.execute("CREATE TABLE allfiles ('id' INT PRIMARY KEY, filepath TEXT)")

i = 0
while True:
    try:
        filepath = input()
    except UnicodeDecodeError:
        print("Unable to decode input")
    except EOFError: 
        break
    filepath = filepath.rstrip()

    i = i + 1
    if i % 100 == 0:
        sys.stdout.write("  {0}\r".format(i))
        sys.stdout.flush()

    # Add an "i: filepath" entry into the dbm file
    db[str(i)] = filepath

    # Insert a row into the sqlite databse. The "(filepath,)" changes filepath
    # from a bytearray(?) into a tuple, which prevents "sqlite3.ProgrammingError:
    # Incorrect number of bindings supplied."
    c.execute("INSERT INTO allfiles VALUES(NULL, ?)", (filepath,))
db.close()
conn.commit()
conn.close()
print("Loaded {} file names into the database".format(i))

The Python program is considerably longer than the Perl program, primarily because Python encourages readabilty over terseness. It’s also handling two databases simultaneously.

I ran the program using the following command line:

find /var -type f 2>/dev/null |
  egrep --text -v '/(bin|dev|lib|lib64|proc|r|run|sbin|sys)/' |
  ./x.py

Reading stdin

The standard way to iterate over stdin in Python is:

for filepath in sys.stdin
    do_stuff()

But that construct killed the loop when it read in a file name that was in ISO-8859 format and Python craoked because it believed the input was invalid UFT-8. So I had to use a different loop construct to handle the error:

while True:
    try:
        filepath = input()
    except UnicodeDecodeError:
        print("Unable to decode input")

No ++ operator

Python doesn’t have the ++ operator. so I couldn’t write

if ++i % 100 == 0:

Instead I used:

i = i + 1
if i % 100 == 0:

Dynamic but strong typing

    db[str(i)] = filepath

The db object is handled like a Python dictionary, and dictionary keys must be strings. i started life as an integer. Perl would automatically re-cast it as a string if needed, but Python won’t, hence the use of the str() function.

str.format() replaces printf

    print("Loaded {} file names into the database".format(i))

Python doesn’t have an printf() function, only print() (which is implemented as a function, so the parentheses are required.) Instead the string object has a format method that does much the same thing as printf.

(not in this script) Ternary operator

Perl:

print "Loaded $i file name", i == 1 ? '' : 's', " into the database\n";

Python:

print("Loaded {} file name{} into the database".format(i, '' if i==1 else 's'))

sqlite files are 30% smaller than gdbm files

I couldn’t get the program to run on Python 2 on penguin, so I created a file list there using the following:

find / -type f 2>/dev/null |
  egrep --text -v '/(bin|dev|lib|lib64|proc|r|run|sbin|sys)/' | 
  gzip >allfiles.text.gz

Then I copied allfiles.text.gz to sparrow and ran this:

$ time gunzip -c allfiles.text.gz | ./x.py 
Unable to decode input
Unable to decode input
Unable to decode input
Loaded 672525 file names into the database

real    0m17.750s
user    0m13.365s
sys     0m4.736s

The two database files created were:

-rw-r--r-- 1 brian brian 93,855,744 Oct 14 02:59 allfiles.dbm
-rw-rw-r-- 1 brian brian 62,194,688 Oct 14 02:59 allfiles.sq3

I’m not sure how gdbm creates and indexes its hashes, but sqlite3 seems to do a far more efficient job at it. Part of is likely that sqlite is indexing the id field as an integer (probably 4 bytes), while gdbm first creates an 8 byte hash and stores that. But that accounts for 2 MB of data; it doesn’t explain what the other 29 MB are used for.

Be that as it may, extracting a single key from the gdbm file is easy:

# The 'count' command returns "There are N items in the database."
FILE_COUNT=$(echo count | gdbmtool allfiles.dbm | cut -f3 -d' ')
echo fetch $((${RANDOM}${RANDOM}${RANDOM} % FILE_COUNT)) | gdbmtool allfiles.dbm