SHA on the fly
Disclaimer
There is a better, future-proof alternative to what I’m describing; which is to choose an appropriate hash algorithm. In particular, those which allow “verified streaming”, like BLAKE3, do not need this sort of hack.
The approach described below is only useful for those who cannot choose which hash algorithm to use, either due to compatibility with other software (e.g. git only supports SHA), or the need to support existing hashes. It is also unnecessary if we already know the size of the data (which is the case for e.g. pieces of a torrent), since that can be used to limit the amount of data we will accept before verifying.
This software is also just a proof-of-concept, thrown together one evening. The idea is sound, as far as I know, but the implementation is not. It’s also wildly inefficient!
Introduction
Cryptographic hashes, like SHA, allow us to fetch data from untrusted systems, like mirrors or peer-to-peer networks; safe in the knowledge that, once the download has finished, we can check whether they sent what we actually asked for. This is used extensively by software like package managers, and is a key enabler for distributed version control systems.
However, this approach to verifying downloads still has a vulnerability that could be exploited by the sender: since verification is only performed after a download has finished, a malicious sender could produce arbitrary amounts of garbage before any verification takes place; wasting our time, bandwidth and disk space.
There is a simple way to avoid this problem, based directly on the way those SHA algorithms work; but I’ve never actually seen it implemented. Hence this demo.
How SHA works
At a high-level, the SHA1 and SHA2 algorithms work in the same way:
- Initialise their state to a constant value (the “initialisation vector”)
- Read a fixed-size block of input data (e.g. 64 bytes). If there isn’t that much input, add padding until we reach the correct size.
- Combine the current state and the current block (using a “compressor” function), to get the new state.
- If there’s more input remaining, go to (2)
- The final state is the hash
Hence the hash calculation involves two sequences: one containing blocks and one containing states. Before we fetch any data, we already know the initial state (it’s the initialisation vector, which is a constant) and we know the final state (that’s the hash we will verify against). As the sender gives us more blocks of data, we can save them to disk and also feed them into the hash algorithm to get more states.
Ideally we will eventually reach the expected final state, verifying that the data is correct. Alternatively, the sender might stop sending blocks before we reach the final state we expected, indicating that the data we received is either corrupt or incomplete. Another possibility is that the sender keeps sending more and more blocks, which we feed into the hash algorithm to get more and more states, without us ever seeing the final state. This third possibility is the most dangerous, since senders can generate random junk to send us at very low cost; whilst we save it all to disk, in the hope that the hash will one day reach the final state and the sender will finish.
Downloading in reverse
To mitigate this vulnerability, we have the sender do two things: send states alongside blocks, and send those sequences in reverse order. The sender will begin by sending the final block and second-to-last state (since we already know the final state: it’s the hash we expect). We can run one iteration of the hash algorithm on those values, which should give us a state that matches our expected hash; if not, we can disconnect immediately!
If they match, we save the block, receive the preceding block and state, run one iteration of the hash function on those, and verify that the resulting state matches the second-to-last state we already received and verified. We continue this process, each time verifying the received block/state pair against the previously-received state; until we receive a state which matches the initialisation vector, which proves we’ve reached the start of the data.
Demo
I’ve made a little demo you can try, which runs entirely locally:
Download backwards-sha.tar.gz demo
Unzip that archive and run nix-build to produce a
backwards-sha-client and a
backwards-sha-server. They both speak HTTP over stdio; it’s
easy enough to pipe the client’s stdout to the server’s stdin, but we’ll
need a FIFO to plug the server back into the client:
mkfifo my-fifo
< my-fifo ./result/bin/backwards-sha-client sha256 <my-sha256> |
./result/bin/backwards-sha-server <my-file> > my-fifo
Replace <my-file> with the path to a file you’d
like the server to provide; multiple paths can be given, if you like.
The server program will wait for a request and, depending on which sort
of hash was requested, calculates the SHA1 or SHA256 of those given
files. If the requested hash doesn’t match any of the given files, the
server sends a 404 Not Found response. Otherwise, the server sends a 200
OK response containing that file’s contents, in reverse order and paired
with the intermediate SHA states.
Replace <my-sha256> with the SHA256 hash (in hex)
of the file that you would like the client to request; or change the
first argument to sha1 to request a file by its SHA1 hash
instead. If the requested hash matches one of the server’s files, the
client will receive a 200 OK response from the server and save the data
to a file whose name is the given hash.
For example we can run the client with arguments sha256
and, say,
deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef.
If the server has a file with that hash, it sends a 200 OK response
containing the file contents; which the client will write to a file
called
deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef
in the current working directory. The SHA256 of that file will be
deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef,
since it will be an exact copy of the server’s file.
Implementation details
I’ve tried to keep this demo simple and reasonably self-contained, so the client and server do their own HTTP parsing, which is just enough to run the demo but will likely fail if given extra headers, etc.!
As described above, file data is sent as pairs containing a block and a state. We encode these using bencode, since it’s a convenient format for streaming data structures that contain sequences and byte strings.
We rely on the B-Con/crypto-algorithms repo for implementations of SHA1 and SHA256. Those are not particularly secure, e.g. regarding timing attacks, etc., but they are straightforward and expose the algorithms in a way that lets us perform a single step with our own states.
Since those implementations are written in C, we expose the functionality we require via a few CLI helper programs:
sha1-validate and sha256-validate are used
by the client, to perform a single hash iteration on a block
and state given as (hex-encoded) commandline arguments.
sha1-step and sha256-step are used by the
server. They take a (hex-encoded) state as argument, and read
file data from stdin; hashing it until they reach the given
state. At which point, they emit the previous
block and state (again, hex-encoded).
The server and client are written in Haskell, since that’s my go-to language for implementing little stdio-processing CLI tools like these.
Efficiency
Note that the “step” programs take O(n) time to find a block/state pair; and the server runs them O(n) times to get all of the pairs. Hence the server takes O(n²) time! That’s fine for this proof of concept; but there are a bunch of optimisations a real implementation could do to avoid that. The most obvious would be using a buffer to store block/state pairs calculated along the way, so they’re already available once we hit the end.
We could also have the client calculate more than one iteration at a time: that way, the server wouldn’t have to send all of the states; it could, say, send one state for each MiB of block data, and the client would work out the intermediate states itself.
The way the client writes data could be improved too. It currently prepends to the temp file by writing the received block to a new file then appending the existing data on the end. That results in a lot of copying and filesystem operations; which could be reduced by e.g. pre-allocating a larger file (perhaps sparse, if the filesystem support that), and back-filling blocks as they arrive.
Conclusions
There are a lot of systems out there which rely on algorithms like SHA to verify data once it’s downloaded. Those systems could be adapted to also verify during the download, which would cut-off malicious actors trying to DoS us with an endless stream of bogus data. Most importantly, this could be achieved using the existing hashes!
Motivations
Decentralising
The most common way to avoid this vulnerability seems to be centralisation: connecting to a handful of blessed servers, and trusting them not to DoS us. This is certainly the case for package managers (RPM, APT, Nix, Cabal, NPM, etc.), and even for supposedly “distributed” version control (which tend to rely on a single “forge”).
Such software already relies on countless hashes, especially SHA, scattered across “object stores”, “lock files”, and the like. Using those existing hashes to make that software slightly more secure would be great; reducing the amount of trust they need to place in servers they connect to; and reducing some of the pressure towards centralisation.
Compatibility
IPFS avoids this vulnerability by imposing a size limit on the blocks of data exchanged between peers. That second link is where I came across this “download in reverse” idea many years ago; and since there’s been more discussion recently on the Nix and IPFS issue, I wanted to double-check whether it would actually work.
The reason I think something along these lines would be good for a system like IPFS is that it already tries to be compatible with other systems; e.g. in its use of multihash and multicodec to represent and distinguish all sorts of existing encodings. Indeed, I already use Git and Nix with IPFS. Getting rid of IPFS’s block size limit, without requiring new hashes, would increase compatibility; and unlock further opportunities for integration.
Challenges
This approach to verified transfers can be viewed in a couple of ways: as a protocol, which clients and servers must implement; or as an encoding scheme for content, akin to Base64. I think it’s more like a protocol, since the encoding only has value during a transfer. For example, a dumb client which just saves all the incoming data to a file, for later “decoding”, would defeat the entire purpose of sending things in this way: the files would be bloated due to the state information; none of that extra information would be needed for verification (since we already have all the blocks); those blocks would need to be cut out and stitched back together in the right order; and, crucially, we would still be vulnerable to the DoS-like attacks that this scheme is meant to prevent!
In contrast, viewing this as a protocol tells us that such dumb clients are simply insecure implementations, which don’t comply with the requirement to verify blocks as they arrive.
Still, however we want to think of it, the challenge remains that an approach like this will only work if it is actually implemented by senders and receivers (whether clients, servers, or P2P). I am not suggesting anyone use my implementation, especially since cryptography and networking aren’t my strong suit. Rather, as per the McDonalds Theory, I’m hoping that someone out there will be so disgusted by my obvious mistakes that they’ll be duty-bound into sketching out a much saner, secure and performant alternative!
Lessons
Finally, another conclusion we can draw from all of this is for those of us building systems that use hashes for verification. Our choice of algorithm will have ramifications for the entire life of our software so, before we default to picking SHA like everyone else, it’s worth asking whether particular features of other algorithms (even ones we’ve never heard of!) might be beneficial to support natively from the outset.
That way, you can avoid having idiots like me suggesting post-hoc hacks like this!