Monday, August 10, 2015

Demo of full-text indexing of BHL using CouchDB hosted by Cloudant

One of the limitations of the Biodiversity Heritage Library (BHL) is that, unlike say Google Books, its search functions are limited to searching metadata (e.g., book and article titles) and taxonomic names. It doesn't support full-text search, by which I mean you can't just type in the name of a locality, specimen code, or a phrase and expect to get back much in the way of results. In fact, in many cases when I Google a phrase that occurs in BHL content I'm more likely to find that phrase in content from the Internet Archive, and then it's a matter of following the links to the equivalent item in BHL.

So, as an experiment I've created a live demo of what full-text search in BHL could look like. I've done this using the same infrastructure the new BioStor is built on, namely CouchDB hosted by Cloudant. Using BHL's API I've grabbed some volumes of the British Ornithological Club's Bulletin and put them into CouchDB (BHL's API serves up JSON, so this is pretty straightforward to do). I've added the OCR text for each page, and asked Cloudant to index that. This means that we can now search on a phrase in BHL (in the British Ornithological Club's Bulletin) and get a result.

I've made a quick and dirty demo of this approach and you can see it in the "Labs" section on BioStor, so you can try it here. You should see something like this:

Bhl couchdb The page image only appears if you click on the blue labels for the page. None of this is robust or optimised, but it is a workable proof-of-concept of how fill-text search could work.

What could we do with this? Well, all sorts of searches are no possible. We can search for museum specimen codes, such as 1900.2.27.13. This specimen is in GBIF (see http://bionames.org/~rpage/material-examined/www/?code=BMNH%201900.2.27.13) so we could imagine starting to link specimens to the scientific literature about that specimen. We can also search for locations (such as Mt. Albert Edward), or common names (such as crocodile).

Note that I've not completed uploading all the page text and XML. Once I do I'll have a better idea of how scalable this approach is. But the idea of having full-text search across all of BHL (or, at least the core taxonomic journals) is tantalising.

Technical details

Initially I simply displayed a list of the pages that matched the search term, together with a fragment of text with the search term highlighted. Cloudant's version of CouchDB provides these highlights, and a "group_field" that enabled me to group together pages from the same BHL "item" (roughly corresponding to a volume of a journal).

This was a nice start, but I really wanted to display the hits on the actual BHL page. To do this I grabbed the DjVu XML for each BHL page for British Ornithological Club's Bulletin, and used a XSLT style-sheet that renders the OCR text on top of the page image. You can't see the text because it I set the colour of the text to "rgba(0, 0, 0, 0)" (see http://stackoverflow.com/a/10835846) and set the "overflow" style to "hidden". But the text is there, which means you can select with the mouse and copy and paste it. This still leaves the problem of highlighting the text that matches the search term. I originally wrote the code for this to handle species names, which comprise two words. So, each DIV in the HTML has a "data-one-word" and "data-two-words" attribute set, which contains the first (and forst plus second) word in the search term, respectively. I then use a JQuery selector to set the CSS of each DIV that has a "data-one-word" or "data-two-words" attribute that matches the search term(s). Obviously, this is terribly crude, and doesn't do well if you've more than two word sin your search query.

As an added feature, I use CSS to convert the BHL page scan to a black-and-white image (works in Webkit-based browsers).