Wednesday, November 28, 2012

ZooBank data model

I'm trying to get my head around the data model used by ZooBank to store taxonomic names. To do this, I've built a graph for the species Belonoperca pylei described by Baldwin & Smith described in:
Baldwin, C. C., & Smith, W. L. (1998). Belonoperca pylei, a new species of seabass (Teleostei: Serranidae: Epinephelinae: Diploprionini) from the cook islands with comments on relationships among diploprionins. Ichthyological Research, 45(4), 325–339. doi:10.1007/BF02725185

After extracting some data from ZooBank API I created a DOT file connecting the various "taxon name usages" associated with Belonoperca pylei and constructed a graph using GraphViz:
Zoobank
You can grab the DOT file here, and a bigger version of the image is on Flickr. I've labelled taxon names and references with plain text as well as the UUIDs that serve as identifiers in ZooBank. (Update: the original diagram had Belonoperca pylei Baldwin & Smith, 1998 sensu Eschmeyer [9F53EF10-30EE-4445-A071-6112D998B09B] in the wrong place, which I've now fixed.)

This is a fairly simple case of a single species, but it's already starting to look a tad complicated. We have Belonoperca pylei Baldwin & Smith, 1998 linked to its original description (doi:10.1007/BF02725185) and to the genus Belonoperca Fowler & Bean, 1930 (linked to its original publication http://biostor.org/reference/105997) as interpreted by ("sensu") Baldwin & Smith, 1998. Belonoperca Fowler & Bean 1930 sensu Baldwin & Smith 1998 is linked to the original use of that genus (i.e., Belonoperca Fowler & Bean, 1930). Then we have the species Belonoperca pylei Baldwin & Smith, 1998 as understood in Eschmeyer's 2004 checklist.

Notice that each usage of a taxon name gets linked back to a previous usage, and names are linked to higher names in a taxonomic hierarchy. When the species Belonoperca pylei was described it was placed in the genus Belonoperca, when Belonoperca was described it was placed in the family Serranidae, and so on.

Tuesday, November 27, 2012

Fuzzy matching taxonomic names using ngrams

Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users. There are various ways to tackle this problem, and Tony Rees' TAXAMATCH is perhaps the best known solution.

Cloudant supports Lucence for full text searching, but while this allows some possibility for approximate matching (by appending "~" to the search string) initial experiments suggested it wasn't going to be terribly useful. What does seem to work is to use ngrams. As a crude example, here is a CouchDN view that converts a string (in this case a taxon name) to a series of trigrams (three letter strings) then indexes their concatenation.


{
"_id": "_design/taxonname",
"language": "javascript",
"indexes": {
"all": {
"index": "function(doc) { if (doc.docType == 'taxonName') { var n = doc.nameComplete.length; var ngrams = []; for (var i=0; i < n-2;i++) { var ngram = doc.nameComplete.charAt(i) + doc.nameComplete.charAt(i+1) + doc.nameComplete.charAt(i+2); ngrams.push(ngram); } if (n > 2) { ngrams.push('$' + doc.nameComplete.charAt(0) + doc.nameComplete.charAt(1)); ngrams.push(doc.nameComplete.charAt(n-2) + doc.nameComplete.charAt(n-1) + '$'); } ngrams.sort(); index(\"default\", ngrams.join(' '), {\"store\": \"yes\"}); } }"
}
}
}

To search this view for a name I then generate trigrams for the query string (e.g., "Pomatomix" becomes "$Po Pom oma mat ato tom omi mix ix$" where "$" signals the start or end of the string) and search on that. For example, append this string to the URL of the CouchDB database to search for "Pomatomix":


_design/taxonname/_search/all?q=$Po%20Pom%20oma%20mat%20ato%20tom%20omi%20mix%20ix$&include_docs=true&limit=10


Initial results are promising (searching on bigrams generated an alarming degree of matches that seemed rather dubious). I need to do some more work on this, but it might be a simple and quick way to support "did you mean?" for taxonomic names.

Thursday, November 22, 2012

BioStor in the cloud

CloudantQuick note on an experimental version of BioStor that is (mostly) hosted in the cloud. BioStor currently runs on a Mac Mini and uses MySQL as the database. For a number of reasons (it's running on a Mac Mini and my knowledge of optimising MySQL is limited) BioStor is struggling a bit. It's also gathered a lot of cruff as I've worked on ways to map article citations to the rather messy metadata in BHL.

So, I've started to play with a version that runs in the cloud using my favourite database, CouchDB. The data is hosted by Cloudant, which now provides full text search powered by Lucene. Essentially, I simply take article-level metadata from BioStor in BibJSON format and push that to Cloudant. I then wrote a simple wrapper around querying CouchDB, couple that with the Documentcloud Viewer to display articles and citeproc-js to format the citations (not exactly fun, but someone is bound to ask for them), and a we have a simple, searchable database of literature.

If you want to try the cloud-based version go to http://biostor-cloud.pagodabox.com/ (code on Github).

Bcloud

I've been wanting to do this for a while, partly because this is how I will implement my entry in EOL's computational data challenge, but also because CrossRef's Metadata search shows the power of finding references simply by using full text search (I've shamelessly borrowed some of the interface styling from Karl Ward's code). David Shorthouse demonstrates what you can do using CrossRef's tool in his post Conference Tweets in the Age of Information Overconsumption. Given how much time I spend trying to parse taxonomic citations and match them to articles in CrossRef's database, or BioStor, I'm looking forward to making this easier.

There are two major limitations of this cloud version of BioStor (aprt from the fact it has only a subset of the articles in BioStor). The first is that the page images are still being served from my Mac Mini, so they can be a bit slow to load. I've put the metadata and the search engine in the cloud, but not the images (we're talking a terabyte or two of bitmaps).

The other limitation is that there's no API. I hope to address this shortly, perhaps mimicking the CrossRef API so if one has code that talks to CrossRef it could just as easily talk to BioStor.

Wednesday, November 21, 2012

Species wait 21 years to be described - show me the data

21Benoît Fontaine et al. recently published a study concluding that average lag time between a species being discovered and subsequently described is 21 years.

Fontaine, B., Perrard, A., & Bouchet, P. (2012). 21 years of shelf life between discovery and description of new species. Current Biology, 22(22), R943–R944. doi:10.1016/j.cub.2012.10.029

The paper concludes:

With a biodiversity crisis that predicts massive extinctions and a shelf life that will continue to reach several decades, taxonomists will increasingly be describing from museum collections species that are already extinct in the wild, just as astronomers observe stars that vanished thousands of years ago.

This is a conclusion that merits more investigation, especially as the title of the paper suggests there is an appalling lack of efficiency (or resources) in the way we decsribe biodiversity. So, with interest I looked at the Supplemental Information for the data:

I was hoping to see the list of the 600 species chosen at random, the publication containing their original description, and the date of their first collection. Instead, all we have is a description of the methods for data collection and analysis. Where is the data? Without the data I have no way of exploring the conclusions, asking additional questions. For example, what is the distribution of date of specimen collection in each species? One could imagine situations where a number of specimens are recently collected, prompting recognition and description of a new species, and as part of that process rummaging through the collections turns up older, unrecognised members of that species. Indeed, if it takes a certain number of specimens to describe a species (people tend to frown upon descriptions based on single specimens) perhaps what we are seeing is the outcome of a sampling process where specimens of new species are rare, they take a while to accumulate in collections, and the distribution of collection dates will have a long tail.

These are the sort of questions we could have if we had the data, but the authors don't provide that. The worrying thing is that we are seeing a number of high-visibility papers that potentially have major implications for how we view the field of taxonomy but which don't publish their data. Another recent example is:

Joppa, L. N., Roberts, D. L., & Pimm, S. L. (2011). The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution, 26(11), 551–553. doi:10.1016/j.tree.2011.07.010

Biodiversity is a big data science, it's time we insisted on that data being made available.