Multibase CLI
I recently pushed a few patches to the multiformats/rust-multibase project to scratch my own itch, and I thought they would be useful to talk about. In short, these changes do the following:
- Allow the
multibasecommand to read data from stdin, rather than requiring commandline arguments. This is important for handling binary data, NUL bytes, etc. - Print resulting bytes as-is, without adornment. This is useful in scripts.
- Provide a
Cargo.lockfile for reproducible builds, e.g. thecargoLock.lockFileargument ofnixpkgs.rustPlatform.buildRustPackage.
Overall these make the multibase command more useful as
a day-to-day tool, akin to familiar workhorses like base64
from coreutils.
What is multibase?
Multibase is a “self-identifying base encoding”: it’s a data format which adds a prefix character on to various existing encoding formats, so they can be easily identified and converted-between, without requiring extra out-of-band knowledge.
For example, the ASCII bytes hello can be encoded as
multibase strings in a variety of ways, depending on situation and
preference, e.g.:
maGVsbG8(base64)zCn8eVZg(base58btc)f68656c6c6f(base16)F68656C6C6F(base16upper)9448378203247(base10)732062554330674(base8)00110100001100101011011000110110001101111(base2)
The important part is the first character, which indicates the
base/alphabet in use; e.g. the leading 7 tells us to
interpret the subsequent 32062554330674 as
base8 (AKA octal).
This prevents ambiguity, since many encoding schemes have overlapping
alphabets. If the above examples didn’t have a leading
character to identify their base encoding (i.e. if I didn’t write them
as multibase), then there’s no way to tell whether that
32062554330674 was intended to mean base8,
base10, base16, etc.
Here’s the full table of prefixes
Note that many programming languages use a similar approach to
disambiguate the base of literal numbers, e.g. it’s common to use
0x as a prefix for base16, or (less commonly)
0o for base8, 0b for
base2, etc. One way that multibase differs from such
schemes is that each multibase prefix is a valid symbol in the base it
represents; e.g. f is valid in base16, whereas
0x is not.
Existing implementations
Multibase is a pretty simple idea (the hard part is getting people to adopt it consistently!) so there are implementations in pretty much every programming language. However, they vary in how many of the formats they support.
The rust-multibase implementation supports a large
number of formats, so if you’re not tied to some other language then
it’s a decent choice.
Existing tooling
Alongside library support in programming languages, I also care about CLI tools for encoding and decoding multibase data (converting to another base might be nice too; but is easy-enough to accomplish with a pipe).
This is where I struggled, since I didn’t find any implementation that felt right. Don’t get me wrong, there are loads of multibase CLI tools out there; but some of them only support a limited selection of formats/bases. Others have a decent selection of formats, but are intended mostly as demos showing how to use their corresponding library implementation; rather than “proper” tools suitable for “serious” work; resulting in awkward, unscriptable interfaces.
Being dissatisfied with the current state of affairs, I decided to choose one of those awkward demo applications; and turn it into something more useful as a day-to-day data-processing workhorse.
rust-multibase
I chose to work on rust-multibase, partially because
it’s provided by the multiformats GitHub organisation, and hence feels
more-supported than some random personal project. It already supports
many formats, so I would only need to fiddle the UI. Finally (at the
risk of starting flamewars), it compiles down to a fast, standalone
executable.
I’ve written enough data-processing tools to know that I’ll inevitably end up running them in a shell loop one day; and spinning up an interpreter over and over again in a loop is horrible for performance.
Standalone executables also avoid bloating the runtime closure of downstream scripts, which might otherwise contain multiple redundant interpreters, etc. (especially since I use Nix).
Commandline ergonomics
rust-multibase is primarily a Rust library, but it also
provides a multibase command built on that library. My main
issue with that command was its poor CLI ergonomics; here are examples
of what encoding and decoding the ASCII text hello
used to look like, before my changes:
$ multibase encode -b base16 -i hello; echo
Result: f68656c6c6f
$ multibase decode -i 'f68656c6c6f'; echo
Result: base16, [104, 101, 108, 108, 111]This is less than ideal, for several reasons…
Binary data requires file descriptors
The old CLI interface required data to be provided via a
-i (--input) argument. That certainly limits
the amount of data we can process; but it turns out to have deeper
problems, even for small inputs.
In particular, multibase is a great way to encode
binary data (i.e. bytes which aren’t limited to valid ASCII or
UTF-8); but commandline arguments aren’t suitable for storing binary
data. This is because they’re passed to the program as null-terminated
strings, which has two consequences:
- The program does not know the length of each argument. To calculate that, it must count their bytes one-by-one from the beginning.
- If the program encounters a zero byte (AKA a NUL or null byte) when reading an argument, it must not attempt to read any further; since it could go out of bounds.
How Linux implements program arguments…
Arguments are supplied to Linux programs via the execve
system call, which gives the program a pointer to a null-terminated
array of pointers, and those refer to arrays of bytes. It’s not
required for those arrays of bytes to be null-terminated
strings, but there is no standard mechanism to communicate the length of
those arrays (e.g. no length-prefix,
or equivalent). Even if we invented our own approach to passing sizes,
there’s no support in existing infrastructure like shells, etc.
As a result, commandline arguments are incapable of
representing NUL bytes; and hence cannot store arbitrary binary
data. Of course, it’s common to avoid such problems by encoding our
binary with something like base64; but that would defeat
the purpose of using multibase!
Instead, we need an alternative method to send data into the
multibase process; especially when using
encode. The usual approach is (as its name suggests) the
standard input stream.
Hence, with my updates, the multibase command will now
default to reading stdin (unless the existing -i or
--input arguments are used), which lets it participate in
pipelines, e.g.
printf 'hello' | multibase encode -b base16and
printf 'f68656c6c6f' | multibase decodeThis has another benefit regarding binary, since arguments are usually expressed using string variables from our host/shell language; if those are assumed to be null-terminated (like they are in Bash) then they can’t be used to express arbitrary binary data either.
In contrast, pipelines provide direct communication from producer to consumer, without any involvement of the host/shell language.
Avoid obfuscating results
The less fatal, but more obvious, annoyance is the output being
prefixed by the string "Result: ". That’s fine for
interactive usage, but writing such non-data to the standard output
stream makes it harder to use with other tools, e.g. forcing scripts to
strip it off with tools like cut -d' ' -f2-
I’ve altered this behaviour, so the new version of
multibase only writes the requested data to stdout, with no
extra fluff.
The output of decode had another problem, since the
decoded data was being shown as a JSONesque array of decimal numbers
like [104, 101, 108, 108, 111]. This is much less useful
(and more
error-prone) than a direct representation as a string of bytes. So,
I changed that too!
Here’s our running example with the new version:
$ printf 'hello' | multibase encode -b base16; echo
f68656c6c6f
$ printf 'f68656c6c6f' | multibase decode; echo
helloNotice that I’ve added some calls to echo. That’s due to
the next change…
Avoiding extra characters
There’s one more way that the data on stdio differs from the literal encoded or decoded contents: the addition of a trailing newline.
Whilst this can make life a bit nicer in a terminal (ensuring our
prompt always appears at the start of its own line), I don’t think
that’s enough to justify altering the bytestring output of
decode. Then, if we don’t insert a trailing newline for
decode, it seems inconsistent to do so for
encode. Hence I’ve changed both modes to avoid it.
With this change, the encode and decode
commands are now true inverses; with data round-tripping correctly
between them:
$ printf 'hello' | multibase encode -b base16 | multibase decode; echo
hello
$ diff <(printf 'hello') \
<(printf 'hello' | multibase encode -b base16 | multibase decode)
$ echo "$?"
0
$ printf 'f68656c6c6f' | multibase decode | multibase encode -b base16; echo
f68656c6c6f
$ diff <(printf 'f68656c6c6f') \
<(printf 'f68656c6c6f' | multibase decode | multibase encode -b base16)
$ echo "$?"
0Bonus: reproducible builds
I’ve also upstreamed a separate patch, which adds a
Cargo.lock file to the project repository. I’ve written
about “lock
files” before, but the current Rust ecosystem seems to be stuck with
them. In particular, the rustPlatform.buildRustPackage
function in Nixpkgs expects a lock file, so that its dependency-fetching
can be verified.
Going forward
I originally wrote these patches for multiformats/rust-multibase
to enable some experiments I’ve been doing with git on ipfs. Now that the
multibase command is more suited to scripting, I’m also
more inclined to use multibase elsewhere, rather than hard-coding a
choice like hex or base64.
One notable outstanding issue is that the multibase
command isn’t streaming: it reads all data, performs the required
encoding/decoding in memory, then dumps out the result. That’s not been
an issue in any of my use-cases so far, but I can certainly imagine it
being useful; though I’m not familiar-enough with Rust yet to know what
the best abstraction/implementation/pattern might be for that.