Multibase CLI

Posted on by Chris Warburton

I recently pushed a few patches to the multiformats/rust-multibase project to scratch my own itch, and I thought they would be useful to talk about. In short, these changes do the following:

Overall these make the multibase command more useful as a day-to-day tool, akin to familiar workhorses like base64 from coreutils.

What is multibase?

Multibase is a “self-identifying base encoding”: it’s a data format which adds a prefix character on to various existing encoding formats, so they can be easily identified and converted-between, without requiring extra out-of-band knowledge.

For example, the ASCII bytes hello can be encoded as multibase strings in a variety of ways, depending on situation and preference, e.g.:

The important part is the first character, which indicates the base/alphabet in use; e.g. the leading 7 tells us to interpret the subsequent 32062554330674 as base8 (AKA octal).

This prevents ambiguity, since many encoding schemes have overlapping alphabets. If the above examples didn’t have a leading character to identify their base encoding (i.e. if I didn’t write them as multibase), then there’s no way to tell whether that 32062554330674 was intended to mean base8, base10, base16, etc.

Here’s the full table of prefixes

Note that many programming languages use a similar approach to disambiguate the base of literal numbers, e.g. it’s common to use 0x as a prefix for base16, or (less commonly) 0o for base8, 0b for base2, etc. One way that multibase differs from such schemes is that each multibase prefix is a valid symbol in the base it represents; e.g. f is valid in base16, whereas 0x is not.

Existing implementations

Multibase is a pretty simple idea (the hard part is getting people to adopt it consistently!) so there are implementations in pretty much every programming language. However, they vary in how many of the formats they support.

The rust-multibase implementation supports a large number of formats, so if you’re not tied to some other language then it’s a decent choice.

Existing tooling

Alongside library support in programming languages, I also care about CLI tools for encoding and decoding multibase data (converting to another base might be nice too; but is easy-enough to accomplish with a pipe).

This is where I struggled, since I didn’t find any implementation that felt right. Don’t get me wrong, there are loads of multibase CLI tools out there; but some of them only support a limited selection of formats/bases. Others have a decent selection of formats, but are intended mostly as demos showing how to use their corresponding library implementation; rather than “proper” tools suitable for “serious” work; resulting in awkward, unscriptable interfaces.

Being dissatisfied with the current state of affairs, I decided to choose one of those awkward demo applications; and turn it into something more useful as a day-to-day data-processing workhorse.

rust-multibase

I chose to work on rust-multibase, partially because it’s provided by the multiformats GitHub organisation, and hence feels more-supported than some random personal project. It already supports many formats, so I would only need to fiddle the UI. Finally (at the risk of starting flamewars), it compiles down to a fast, standalone executable.

I’ve written enough data-processing tools to know that I’ll inevitably end up running them in a shell loop one day; and spinning up an interpreter over and over again in a loop is horrible for performance.

Standalone executables also avoid bloating the runtime closure of downstream scripts, which might otherwise contain multiple redundant interpreters, etc. (especially since I use Nix).

Commandline ergonomics

rust-multibase is primarily a Rust library, but it also provides a multibase command built on that library. My main issue with that command was its poor CLI ergonomics; here are examples of what encoding and decoding the ASCII text hello used to look like, before my changes:

$ multibase encode -b base16 -i hello; echo
Result: f68656c6c6f
$ multibase decode -i 'f68656c6c6f'; echo
Result: base16, [104, 101, 108, 108, 111]

This is less than ideal, for several reasons…

Binary data requires file descriptors

The old CLI interface required data to be provided via a -i (--input) argument. That certainly limits the amount of data we can process; but it turns out to have deeper problems, even for small inputs.

In particular, multibase is a great way to encode binary data (i.e. bytes which aren’t limited to valid ASCII or UTF-8); but commandline arguments aren’t suitable for storing binary data. This is because they’re passed to the program as null-terminated strings, which has two consequences:

How Linux implements program arguments…

Arguments are supplied to Linux programs via the execve system call, which gives the program a pointer to a null-terminated array of pointers, and those refer to arrays of bytes. It’s not required for those arrays of bytes to be null-terminated strings, but there is no standard mechanism to communicate the length of those arrays (e.g. no length-prefix, or equivalent). Even if we invented our own approach to passing sizes, there’s no support in existing infrastructure like shells, etc.

As a result, commandline arguments are incapable of representing NUL bytes; and hence cannot store arbitrary binary data. Of course, it’s common to avoid such problems by encoding our binary with something like base64; but that would defeat the purpose of using multibase!

Instead, we need an alternative method to send data into the multibase process; especially when using encode. The usual approach is (as its name suggests) the standard input stream.

Hence, with my updates, the multibase command will now default to reading stdin (unless the existing -i or --input arguments are used), which lets it participate in pipelines, e.g.

printf 'hello' | multibase encode -b base16

and

printf 'f68656c6c6f' | multibase decode

This has another benefit regarding binary, since arguments are usually expressed using string variables from our host/shell language; if those are assumed to be null-terminated (like they are in Bash) then they can’t be used to express arbitrary binary data either.

In contrast, pipelines provide direct communication from producer to consumer, without any involvement of the host/shell language.

Avoid obfuscating results

The less fatal, but more obvious, annoyance is the output being prefixed by the string "Result: ". That’s fine for interactive usage, but writing such non-data to the standard output stream makes it harder to use with other tools, e.g. forcing scripts to strip it off with tools like cut -d' ' -f2-

I’ve altered this behaviour, so the new version of multibase only writes the requested data to stdout, with no extra fluff.

The output of decode had another problem, since the decoded data was being shown as a JSONesque array of decimal numbers like [104, 101, 108, 108, 111]. This is much less useful (and more error-prone) than a direct representation as a string of bytes. So, I changed that too!

Here’s our running example with the new version:

$ printf 'hello' | multibase encode -b base16; echo
f68656c6c6f
$ printf 'f68656c6c6f' | multibase decode; echo
hello

Notice that I’ve added some calls to echo. That’s due to the next change…

Avoiding extra characters

There’s one more way that the data on stdio differs from the literal encoded or decoded contents: the addition of a trailing newline.

Whilst this can make life a bit nicer in a terminal (ensuring our prompt always appears at the start of its own line), I don’t think that’s enough to justify altering the bytestring output of decode. Then, if we don’t insert a trailing newline for decode, it seems inconsistent to do so for encode. Hence I’ve changed both modes to avoid it.

With this change, the encode and decode commands are now true inverses; with data round-tripping correctly between them:

$ printf 'hello' | multibase encode -b base16 | multibase decode; echo
hello
$ diff <(printf 'hello') \
       <(printf 'hello' | multibase encode -b base16 | multibase decode)
$ echo "$?"
0
$ printf 'f68656c6c6f' | multibase decode | multibase encode -b base16; echo
f68656c6c6f
$ diff <(printf 'f68656c6c6f') \
       <(printf 'f68656c6c6f' | multibase decode | multibase encode -b base16)
$ echo "$?"
0

Bonus: reproducible builds

I’ve also upstreamed a separate patch, which adds a Cargo.lock file to the project repository. I’ve written about “lock files” before, but the current Rust ecosystem seems to be stuck with them. In particular, the rustPlatform.buildRustPackage function in Nixpkgs expects a lock file, so that its dependency-fetching can be verified.

Going forward

I originally wrote these patches for multiformats/rust-multibase to enable some experiments I’ve been doing with git on ipfs. Now that the multibase command is more suited to scripting, I’m also more inclined to use multibase elsewhere, rather than hard-coding a choice like hex or base64.

One notable outstanding issue is that the multibase command isn’t streaming: it reads all data, performs the required encoding/decoding in memory, then dumps out the result. That’s not been an issue in any of my use-cases so far, but I can certainly imagine it being useful; though I’m not familiar-enough with Rust yet to know what the best abstraction/implementation/pattern might be for that.