The Backwards Compatibility of Text

Posted on by Chris Warburton

I was watching some of Matt Parker’s Stand-up Maths videos recently, and a few of them involved quirks of the modern world which are work-arounds for problems that no longer exist. In particular, the staggered layout of computer keyboards (to avoid the levers of a typewriter from hitting each other), and the 29.97Hz framerate of NTSC video (to prevent colour information interfering with black-and-white receivers).

These reminded me of the rabbit-hole of complexity underlying one of the seemingly-simplest data formats we have on modern computers: plain text.

UTF-8 (modern text)

The standard, modern representation of text is UTF-8, which has some cool features:

ASCII (older latin/“western” text)

ASCII encodes the latin alphabet (A-Z, upper- and lower-case) as well as digits, some punctuation and “control codes”:

There are some good videos about EBCDIC (and earlier BCD) on Computerphile. Those standards are more focused on (decimal) numerals and arithmetic, rather than ‘proper’ text, although they do descend from Hollerith tabulators, which Stand-up Maths also has a video on.

The Teletypewriter Interface (TTY)

The early use of teletypewriters & punched tape doesn’t just show up in modern data formats, it’s also pervasive in command-line user interfaces, modern operating systems and programming languages:

Video Extensions

Once video terminals and terminal emulators appeared, many extensions of ASCII appeared, to do fancy things like bold text, colours, etc.

We can even send image data over a teletypewriter connection! The most popular format is Sixel, which is actually control codes for the pins of a dot-matrix print head.

Hence when we show a Sixel image in a terminal emulator, we’re using an emulated serial line, which itself is emulating a telegraph line, to connect to an emulated video terminal, which itself is emulating a teletypewriter, which happens to be equipped with a 6-pin dot-matrix print head and a bell!