PDF bibliography tools
Intro
Every time I find a document online (mostly PDFs, but postscript, sometime HTML, etc.) I save it to my Documents
directory. This directory now contains many files, on many diverse topics; from type theory, to particle Physics, to AI. Whenever I recall reading some particular fact, I can usually find the reference in that directory (usually via a simple grep
).
The major problem with this approach is that it’s quite tedious to actually cite any of these files, since they don’t have associated bibliographic information. I keep a “master” BibTeX file called Documents/ArchivedPapers/Bibtex.bib
, which I add entries to whenever the need arises, in the following way:
- Open the relevant document
- Enter some of its details (title, author, etc.) into a search engine like Google Scholar
- Copying the most likely-looking BibTeX entry
- Paste into
Bibtex.bib
- Move that document file into
ArchivedPapers
- Add a
localfile
key pointing to the document file
In fact, some of this is made a little smoother by KBibTeX, which combines a BibTeX editor, document viewer and search engine into one tool. KBibTeX is certainly nice to use as a viewer of the documents which are already in Bibtex.bib
, but unfortunately it’s still sort of clunky to do the above kind of import procedure, since it neccessarily involves viewing documents which aren’t in the database yet. It certainly makes a decent effort, with Dolphin and Okular built in, but requires an awful lot of context-switching between the different “panes”/tabs.
Recently I decided to automatically import as many of these documents as possible, to see how far I could get. This document describes the various approaches I’ve taken, as well as providing handy commandline snippets which I can use in the future.
Document Properties
Each document can be considered to have a bunch of properties, which can influence how easy or hard it is to import it automatically. Here are some I’ve come across:
- Filetype: I’m only considering PDFs for now, since postscript, HTML, etc. are few enough for me to import manually.
- Scanned: PDFs of old documents, eg. many from the 1960s and earlier, will be scans; essentially, one giant image. This is difficult to handle, since it doesn’t contain any machine-readable strings of text. Some documents may be converted via OCR (optical character recognition), although there may be mis-spellings, etc. in their results.
- Metadata: PDFs can contain metadata, like author and title, in a similar way to MP3s and JPEGs. If available, this can be extracted very easily.
- DOIs: a digital object identifier (DOI) is a form of URI which uniquely identifies a document. If a document contains its DOI on the first couple of pages, it can be extracted easily.
Approaches
Some of these may work for you straight away, some may require tweaking, some may prove hopeless. I’d give each a try, and move on if you have too many difficulties.
Zotero
Zotero is a bibliography manager, built around Mozilla’s XUL toolkit. Making it work on NixOS is a little tricky.
Zotero has a nice workflow for importing PDF files:
- Create a database
- Add to it links to the PDF files we wish to import
- Select those links
- Choose “Retrieve PDF metadata”
- Export the resulting BibTeX and copy into your real BibTeX file
This will extract metadata from the PDFs, search for it online (eg. using Google Scholar) and present any BibTeX it finds. There are two major problems with this approach:
- If there is no metadata to extract, it usually fails (it tries the filename, but this may be unhelpful)
- There seems to be a request limit for Google Scholar. Even after filling in some CAPTCHAs, I couldn’t get it to work for more than a couple of dozen files.
Docear
Docear is a rather bloated application for managing “projects”, which just-so-happen to contain bibliographies. Its reference management is built on JabRef, but seems to work better in my experience. Similar to Zotero, this can work well for getting the “low hanging fruit”, like PDFs with existing metadata.
I’ve made a quick and dirty Nix package for Docear.
Scholar.py
scholar.py is a script for querying Google Scholar from the commandline. Whilst not the most useful thing in the world on its own, it’s great for embedding into scripts. One thing to keep in mind when calling scholar.py
is that it will crash if given non-ascii characters, so you should run your strings through iconv
first to transliterate them.
I’ve made a little Nix package for scholar.py.
pdfmeat
pdfmeat is a Python script which tries to extract data from a PDF.
I’ve made a little Nix package for pdfmeat, although you’ll have to download the source yourself since Google Code was giving me inconsistent hashes. I’ve also packaged its dependencies translitcodec and subdist.
pdfssa4met
This bizarrely named tool is yet another Python script for handling PDF files. Once again, I’ve packaged it for Nix.
searchtobibtex
searchtobibtex is a collection of handy scripts for extracting information from PDF files. Be warned: it includes tools to destructively rename PDF files based on its results. I tend to avoid this, and instead just get the metadata printed out, which I can act on in a subsequent “phase”.
Here’s my Nix package for searchtobibtex, as well as one for bibclean which it depends on.
pdf-extract
I came across pdf-extract during my travels, although haven’t tried it (or at least, I don’t appear to have made a Nix package for it).
Processes
These are roughly the steps I followed to import as many PDFs as possible. It hasn’t caught everything, but it’s saved me a lot of time compared to the purely manual approach.
Low Hanging Fruit
Send everything through Zotero and Docear, see what sticks. I don’t trust these tools with my real BibTeX database (it’s in git, but I don’t like wading through massive diffs), so I tend to import into a fresh database, then copy/paste the result over to the real one.
Embedded DOIs
TODO: This should give easy wins, but I couldn’t extract a single DOI from my documents for some reason.
Extractable Titles
For those tricky PDFs which don’t contain metadata or DOIs, but do contain machine-readable text, I’ve found the following process to be useful (note, I’ve had to reconstruct some commands from memory; they may need tweaking):
Loop through all files, echoing out the file name and the output of headings.py --title
applied to the file. headings.py
is part of pdfssa4met, and will output titles in XML. Something like this:
for FILE in Documents/*.pdf
do
T=$(headings.py --title "$FILE")
[[ -z "$T" ]] && continue
echo "<file><name>$FILE</name>$T</file>"
done | tee TITLES
This will make a file TITLES
associating PDF filenames with their titles (if found).
We can use xidel
to loop through these, and send each title to searchtobibtex
to find a reasonable looking BibTeX entry from CrossRef. We then echo out the file name, the extracted title and any BibTeX we found, with a bunch of sentinel strings sprinkled in (FILE:
, TITLE:
, BIB:
and ENDBIB
).
# This XPath3 expression will produce tab-separated pairs of titles (with newlines stripped out) and filenames
xidel - --extract-kind=xpath3 \
--extract "//*[text()='$F']/../(pdf/title/replace(text(), '\n', '') || ' ' || name)" < TITLES |
while read -r LINE
do
# Use cut to extract the elements of each pair
TITLE=$(echo "$LINE" | cut -f 1)
FILE=$(echo "$LINE" | cut -f 2)
# Look up some BibTeX for this title
BIB=$(searchtobibtex "$TITLE")
echo -e "FILE: $FILE\nTITLE: $TITLE\nBIB:\n$BIB\nENDBIB"
done | tee BIBOUT
Make a copy of the output (eg. BIBOUT2
), and process it with Emacs macros:
- Remove any FILE and TITLE lines which aren’t followed by a BibTeX entry
- Remove all BibTeX content except for the
title
key. Use a different sentinel (eg.GOT
) for these online titles, to disambiguate from the TITLE we extracted. - Remove all newlines from BibTeX titles
- Make all titles (extracted and BibTeX) lowercase (
C-x C-l
) - Remove all non-alphabetic letters from all titles (
replace-regexp
) - Put each
FILE
/TITLE
/GOT
into one line, separated by tabs, eg.FILE foo.pdf TITLE atitle GOT atitle
Loop through these lines in a shell, echoing out those filenames where the (normalised) extracted title matches the (normalised) BibTeX title. It’s pretty likely that these have been correctly extracted and looked up:
cat BIBOUT2 | while read -r LINE
do
FILE=$(echo "$LINE" | cut -f 2)
TITLE=$(echo "$LINE" | cut -f 4)
GOT=$(echo "$LINE" | cut -f 6)
[[ "x$TITLE" = "x$GOT" ]] && echo "$FILE"
done
In Emacs again, make another copy of the extracted BibTeX results (eg. BIBOUT3
) and use a macro to:
- Search for
ENDBIB
- Find the preceding
FILE:
sentinel - Copy the filename
- Go forward to the
ENDBIB
- Append a
localfile = "..."
entry to the BibTeX
Next, remove all FILE
, TITLE
, BIB
and ENDBIB
lines, just leaving the raw BibTeX.
Using an Emacs macro, do the following:
- Switch to a buffer of filenames which had matching extracted/searched titles
- Cut (kill, copy then remove, whatever) the first one
- Switch to the BibTeX entries annotated with
localfile
keys - Go to the start of the buffer, then search for
localfile = "<PASTE FILENAME HERE>"
- Copy the surrounding BibTeX entry
- Switch to a BibTeX file (either your main one, or a temporary file)
- Go to the end of the buffer and paste the BibTeX entry
Repeat this until you’ve got entries for all of the PDFs which had successfully extracted titles. I prefer to invoke this macro manually each time, rather than specifying some number of repetitions, so that I can give each BibTeX entry a quick glance for credibility (eg. Does it even look like BibTeX? Does the title sound familiar? Does it look like the kind of document I would have saved?)
Titles in File Names
For those which weren’t caught above, try this rather manual process. List all filenames twice, with this format:
FILE foo.pdf TITLE foo
Manually remove lines which don’t look like they’ll be useful search terms, ie. their name probably doesn’t resemble their title (eg. 002.pdf
is probably not worth searching for). For the rest, perform a bunch of regex-replaces on the TITLE
parts:
- Replace punctuation with spaces
- Prefix capital letters with a space (eg.
FooBar
becomesFoo Bar
) - Search for
[A-Z] [A-Z]
and fix broken initialisms (eg. turnA S T
back intoAST
)
Run the results through searchtobibtex
:
cat TITLES | while read -r LINE
do
FILE=$(echo "$LINE" | cut -f 2)
TITLE=$(echo "$LINE" | cut -f 4)
BIB=$(searchtobibtex "$TITLE")
echo -e "FILE: $FILE\nTITLE: $TITLE\nBIB\n$BIB\nENDBIB"
done
Inspect the results to remove those which are clearly unrelated, then use a macro to add localfile
keys containing the file names. Remove the sentinel values, and append to the main BibTeX file as above.
Manual Title Entry
This is pretty horrible, so it’s a bit of a last resort. Loop through all file names and wrap them in a BibTeX entry containing a localfile
field. In my case, I used the key zzzz$MD5
where $MD5
is the file’s MD5 hash.
With this in place, we can use a little Emacs Lisp to loop through every entry, and for each one we open the PDF, querying for the title and insert it into the BibTeX entry.
(defun take-name ()
;; Find the next localfile key which doesn't have a title
(re-search-forward "^@misc{zzzzz.*,[\n][\t]localfile = \"[^\"]*\"[\n]")
(forward-line -1)
(beginning-of-line)
(re-search-forward "localfile = ")
;; Copy the contents
(forward-char) ;; Past "
(set-mark (point))
(end-of-line)
(backward-char) ;; Past "
;; Open the file in a temporary doc-view buffer
(let ((selection (buffer-substring-no-properties (mark) (point)))
(title ""))
(with-temp-buffer
(insert-file-contents selection)
(doc-view-mode)
(switch-to-buffer (current-buffer))
(sit-for 2)
(doc-view-fit-width-to-window)
;; Query for the title
(setq title (read-from-minibuffer "Title: ")))
;; Insert a title key into the BibTeX
(end-of-line)
(insert ",\n\ttitle = \"")
(insert title)
(insert "\"")))
This at least saves us from having to click around GUIs to do our data entry. At the end we will have a bunch of BibTeX entries with title
and localfile
fields.
Expanding Meagre Entries
If all you have is a file and a title, you can look it up on Google Scholar to get more info. Note that this will give up after a few dozen, probably due to rate limiting; just try again after a while:
# Grep for the pattern which we used for our keys, getting the line numbers
grep -n "@misc{zzzz" ArchivedPapers/Bibtex.bib |
cut -d ':' -f 1 |
while read -r NUM
do
# Extract the BibTeX entry beginning at each line number (should be 4 lines: the opening/closing braces, the localfile and the title)
CTX=$(( NUM + 3 ))
ENTRY=$(head -n"$CTX" ArchivedPapers/Bibtex.bib | tail -n4)
# Get the filename and title fields
FILE=$(echo "$ENTRY" | grep localfile | sed -e 's/.*localfile = "\(.*\)".*/\1/g')
TITLE=$(echo "$ENTRY" | grep "title =" | sed -e 's/.*title = "\(.*\)".*/\1/g')
# Transliterate the title to ASCII
TASC=$(echo "$TITLE" | iconv -f utf-8 -t iso8859-1//TRANSLIT)
# Try to look up BibTeX on Google Scholar
BIB=$(scholar.py -c 1 -t -A "$TASC" --citation=bt)
# Skip if we got nothing
[[ -z "$BIB" ]] && continue
# Output if we got something
echo -e "FILE: $FILE\n$BIB"
done | tee -a SCHOLAROUT
Alternatively, you can search CrossRef, which gives less accurate results but isn’t rate-limited:
grep -n "@misc{zzzz" ArchivedPapers/Bibtex.bib |
cut -d ':' -f 1 |
while read -r NUM
do
CTX=$(( NUM + 3 ))
ENTRY=$(head -n"$CTX" ArchivedPapers/Bibtex.bib | tail -n4)
FILE=$(echo "$ENTRY" | grep localfile | sed -e 's/.*localfile = "\(.*\)".*/\1/g')
TITLE=$(echo "$ENTRY" | grep "title =" | sed -e 's/.*title = "\(.*\)".*/\1/g')
TASC=$(echo "$TITLE" | iconv -f utf-8 -t iso8859-1//TRANSLIT)
# This is the only difference to the above
BIB=$(searchtobibtex "$TASC")
[[ -z "$BIB" ]] && continue
echo -e "FILE: $FILE\n$BIB"; done | tee -a SCHOLAROUT