bdb-tool, a program for working with the Berkeley database

Contents

After noticing that GNU dbm perfomance, at least for inserts, drops off after about 75 million records, I decided to see how well Berkeley DB did. So I signed up for an Oracle account (using a unique email address) and downloaded a copy of the latest version.

It’s distributed as source code; compiling and installing it was quite straighforward:

dist/configure --prefix=/opt/bdb-18.1.32
make && sudo make install

Then came a surprise. GNU dbm ships with a handy command-line program (gdbmtool) to perform ad-hoc database operations, but no such program is distributed with the Berkeley database.

So I immediately set about adapting the GNU program for the Berkeley database.

No existing Berkeley database tool program

Before launching into a potentially long development process, I thought I should check to ensure someone else hasn’t already done this.

Quite to my surprise, there don’t appear to be any existing projects like this. A couple of results on StackOverfow suggest using the db_dump utility to see the contents of a database. One promising lead was BMT (Berkeley DB Management Tool) but it was last updated in February 2013 by being overwritten with a different piece of source code! Its website at bmt.sourceforge.net indicates the project is closed, with no date ndicating when this happened.

Can I convert gdbmtool to use Berkeley database?

Initially I thought I should use the Berkeley “legacy” interface (designed for the original Unix db program and inherited by ndbm and GNU dbm) to ease programming for the transition to Berkeley. So I mapped the gdbm calls to Berkeley calls:

GNU Database Manager Berkeley Database
gdbm_open Historic dbm_open()
gdbm_close Historic dbm_close()
gdbm_store Historic dbm_store()
gdbm_fetch Historic dbm_fetch()
gdbm_delete Historic dbm_delete()
gdbm_firstkey Historic dbm_firstkey()
gdbm_nextkey Historic dbm_nextkey()
gdbm_strerror db_sterror()
gdbm_version db_version()
gdbm_errno Variable - should be able to use as is
gdbm_dump Don’t implement - see db_dump program
gdbm_load Don’t implement - see db_load program
gdbm_recover Don’t implement - see db_recover program
gdbm_avail_block_validate No directly applicable function
gdbm_avail_block_valid_p No directly applicable function
gdbm_avail_list_size No directly applicable function
gdbm_count No directly applicable function
gdbm_count_t No directly applicable function
gdbm_db_strerror No directly applicable function
gdbm_debug_flags No directly applicable function
gdbm_debug_parse_state No directly applicable function
gdbm_debug_printer No directly applicable function
gdbm_debug_token No directly applicable function
gdbm_file_seek No directly applicable function
gdbm_full_read No directly applicable function
gdbm_get_bucket No directly applicable function
gdbm_hash No directly applicable function
gdbm_hash_key No directly applicable function
gdbm_option No directly applicable function
gdbm_print_avail_list No directly applicable function
gdbm_print_bucket_cache No directly applicable function
gdbm_recovery No directly applicable function
gdbm_reorganize No directly applicable function
gdbm_setopt No directly applicable function
gdbm_syserr Used only in recover_handle() function

Based on the above, I concluded I’d be able to get the following legacy features to work, provided the tool’s tool’s original code doesn’t use unexpected gdbm functions when processing the following commands:

  • open/close
  • store/fetch/delete
  • first/next
  • count (maybe)
  • define/unset
  • status
  • dump/import (by calling the utilities)
  • version

Removing vestiges of the NDBM/GDBM interface

However, after working with C source to remove unusable functions and change gdbm calls to db, I figured I should just eliminate the gdbm functions and make this program work with the native Berkeley database functions. That meant tracking down things within the code that relied on the GDBM interface and converting them to the Berkeley way of doing things, or eliminating them altogether.

One of the more problematic items was dataum. It’s a small structure in the old database code that represents a a key or record:

typedef struct {
    char *dptr;
    int dsize;
} datum;

It’s used as the data type for several variables and functions:

bdb-tool.h:
   227:     datum dat;
   256: struct bdb_arg *bdb_arg_datum (datum *, struct locus *);
   348: void datum_format (FILE *fp, datum const *dat, struct dsegm *ds);
   349: int datum_scan (datum *dat, struct dsegm *ds, struct kvpair *kv);

bdb-tool.c:
    39: datum key_data;                                 /* Current key */
    40: datum return_data;                              /* Current data */
   560:         datum key;
   561:         datum data;
   566:                 datum nextkey = DB->nextkey(db_handle, key);
   795:                 N_("define datum structure") },
  1000: struct bdb_arg *bdb_arg_datum(datum *dat, struct locus *loc)
  1184:         datum d;
  1193:         datum d;
  1207: char *argtypestr[] = { "string", "datum", "k/v pair" };

datconv.c:
   216: datum_format (FILE *fp, datum const *dat, struct dsegm *ds)
   315: datum_scan_notag (datum *dat, struct dsegm *ds, struct kvpair *kv)
   411: datum_scan_tag (datum *dat, struct dsegm *ds, struct kvpair *kv)
   418: datum_scan (datum *dat, struct dsegm *ds, struct kvpair *kv)

The problem here is the native Berkeley code doesn’t return dataum type data from its DB->get() function. It does, however, return a DBT type:

/* Key/data structure -- a Data-Base Thang. */
struct __db_dbt {
    void     *data;         /* Key/data */
    u_int32_t size;         /* key/data length */

    u_int32_t ulen;         /* RO: length of user buffer. */
    u_int32_t dlen;         /* RO: get/put record length. */
    u_int32_t doff;         /* RO: get/put record offset. */

    void *app_data;

#define DB_DBT_APPMALLOC        0x0001  /* Callback allocated memory. */
#define DB_DBT_BULK             0x0002  /* Internal: Insert if duplicate. */
#define DB_DBT_DUPOK            0x0004  /* Internal: Insert if duplicate. */
#define DB_DBT_ISSET            0x0008  /* Lower level calls set value. */
#define DB_DBT_MALLOC           0x0010  /* Return in malloc'd memory. */
#define DB_DBT_MULTIPLE         0x0020  /* References multiple records. */
#define DB_DBT_PARTIAL          0x0040  /* Partial put/get. */
#define DB_DBT_REALLOC          0x0080  /* Return in realloc'd memory. */
#define DB_DBT_READONLY         0x0100  /* Readonly, don't update. */
#define DB_DBT_STREAMING        0x0200  /* Internal: DBT is being streamed. */
#define DB_DBT_USERCOPY         0x0400  /* Use the user-supplied callback. */
#define DB_DBT_USERMEM          0x0800  /* Return in user's memory. */
#define DB_DBT_BLOB             0x1000  /* Alias DB_DBT_EXT_FILE. */
#define DB_DBT_EXT_FILE         0x1000  /* Data item is an external file. */
#define DB_DBT_BLOB_REC         0x2000  /* Internal: Blob database record. */
    u_int32_t flags;
};

So it looks like I can do the following:

GDBM Berkeley
datum DBT
datum->dptr DBT->data
datum->dsize DBT->size

That was actually very close, with the exception that when creating a DBT instance (in dbd_arg_datum()) I needed to set flags to 0.

GDBM functions return data, Berkeley functions return success/fail

One noticeable change is how the two database engines return data. In GNU dbm, functions typically return handles or a datum, with errors being noted by a NULL return and an error value stored in gdbm_errno.

Berkeley DB, however, requires the programmer to pass the address of a data area which will receive the returned information, and the functions return a code. Negative codes are database errors and positive codes are regular errno values.

Work still to be done

Today’s Berkeley database bears only a superficial resemblence to the original dbm code written back in 1979 and used by ndbm and GNU dbm. It has a lot more functionality:

  • Additional database types: in addition to Btree, there’s Hash, Heap, Queue, and Recno
  • Bulk insert, read, and delete
  • Multiple databases in a single file
  • Ability to handle duplicate keys
  • Cursors
  • Foreign keys
  • Secondary indexes
  • Record locking
  • Transactions
  • Replication
  • Logging and recovery
  • Encryption
  • Environments (encapsulates a group of databases, metatdata, and cache)
  • Partitions (splitting large databases into multiple files)
  • Slices (splitting large databases for processing by multiple cores)
  • Java and C++ support

Here’s a list of things that really should be added to bdb-tool tool to make it work with Berkeley database as it’s currently implemented:

  • Allow a database to be opened using any type
    • put works differently depending on the datbase type
  • Support additional create and open options
  • Berkeley DB supports bulk put, read, and delete operations; there should be a way to implement this with (say) a bulk-put function
  • Berkeley DB can have multiple databases within a single file; bdb-tool should support this.
    • Question: Is there a way to open a file and scan it for databases?
    • Answer: Yes. See Opening multiple databases in a single file: “The database type should be specified as DB_UNKNOWN and the database must be opened read-only. The handle that is returned from such a call is a handle on a database whose key values are the names of the databases stored in the database file and whose data values are opaque objects.”
  • Berkeley DB supports transactions, which may be useful when doing multiple put or delete operations.
  • Berkeley DB has database environments, which have a comprehensive set of functions associated with them. bdb-tool should support at least a basic set of these functions.
  • Berkeley DB has cursors. However, I may abstract cursors away behind the functionality for first/next/list.
  • DB->stat_print() to display statistics about a database
  • Specify a password to use for an encrypted database (command-line switch and tool command)
  • I need to get the program to build using automake
  • I need to reinstate language suport. Messages are found in the following files:
    • bdb-tool.c
    • input-file.c
    • lex.c
    • parseopt.c
    • util.c
  • I need to re-write the man page
  • bdb-tool -h has lines for “Report bugs to” and “gdbm home page”

Updates needed to the language files

  • The welcome message is now:
      \nWelcome to the unofficial Berkeley Database tool. Type ? for help.\n\n
  • New message failed to create database handle: %s
  • New message failed to create cursor: %s
  • store is now put
  • fetch is now get