---
title: The Sorry State of the Web
---

Being a programmer, I have a bunch of scripts to automate everyday tasks. Some
of those scripts provide commandline interfaces for various Web sites, which I
use both directly and from within other scripts.

Recently, one of these sites has changed its pages such that the page's markup
doesn't occur in the result of a HTTP GET of the page's URL. Instead the
resulting markup contains Javascript, which inserts new elements into the page's
DOM when executed in a browser (e.g. Firefox).

To try and fix my scripts, I tried to replace their previous use of `wget`,
`curl`, `xidel`, etc. with PhantomJS, which I've used in other projects before.

Unfortunately, even with timeouts, and spoofing the user agent and window size,
the elements weren't being fetched and inserted into the DOM, which I could
verify by dumping out screenshots.

Since it works in Firefox, but not in PhantomJS, I tried the same script using
SlimerJS, which is backed by the same Gecko rendering engine as Firefox, but
that didn't work either.

Since the only working browser I could find was Firefox, I tried using Selenium,
but that seems to be very flaky regarding versions, and I couldn't get it to run
(Firefox would load, but Selenium wouldn't connect to it).

I made a brief attempt at using SikuliX and xautomation, I tried using Firefox's
"Save Page As" feature, but that only stores the original HTML without the
modifications.

Eventually, I cobbled together a script based on the following, which seems to
work:

 - Use `xvfb-run` to launch a second script, where all the following take place
 - Run `firefox` in the background, using `timeout` to kill the process after 30
   seconds. Firefox itself is launched in "safe mode" with the desired URL.
 - Use `xdotool` to send keyboard events which close the "safe mode" information
   message, open the developer console and wait long enough for the DOM to be
   updated.
 - Again, using `xdotool`, type a jQuery-based expression into the console, to
   extract the desired text. This is fed into a call to `prompt`, which causes
   the text to appear, selected, in an overlay.
 - One final call to `xdotool` presses `ctrl+c` to copy the text to the X
   clipboard.
 - `xsel` is used to paste the text to stdout, and `echo` adds a newline.
 - We kill the background `firefox` process using its process ID, stored from
   when it was invoked.

Needless to say, this is a very convoluted and fragile approach, and it speaks
to the sorry state of the Web as a medium for the exchange of hyperlinked
documents between a proliferation of applications on disparate machines.
