Wednesday, January 21, 2009

Darwin on Twitter


Well, not Darwin himself, exactly. The Evolution Directory (better known as "EvolDir") is a mailing list run by Brian Golding at McMaster University, Ontario. It's widely used by evolutionary biologists to post announcements about jobs, courses, conferences, software, and other topics of interest to the community.

In this age of spam- and administrivia-clogged inboxes I find it hard to keep track of emails (and routinely ignore most), so it occurred to me that I'd pay more attention to EvolDir if the announcements were made over twitter. Hence, I wrote a script to monitor my email account for EvolDir emails, and post any announcements to twitter. You can follow EvolDir by going to http://twitter.com/evoldir.

One complication was working out a URL for individual EvolDir emails. To make my life simpler I created a simple web archive of each individual email. As a by product of this there is now an RSS feed for EvolDir, available from http://bioguid.info/services/evoldir/, if RSS is your preferred means of consuming news. I've added the EvolDir RSS feed to the Systematic Biology web site, so that visitors to that site can see the latests announcements from the evolutionary biology community.

Sunday, January 18, 2009

Equivalent author names

One problem I've encountered in building a bibliographic database is the different ways author names are written. For example, for papers I've authored my name may be written as "Roderic D. M. Page" or "R. D. M. Page". Googling about this problem I came across Dror Feitelson's paper On identifying name equivalences in digital libraries. Feitelson addresses the issue of matching first names:
The services provided by digital libraries can be much improved by correctly identifying variants of the same name. For example, this will allow for better retrieval of all the works by a certain author. We focus on variants caused by abbreviations of first names, and show that significant achievements are possible by simple lexical analysis and comparison of names. This is done in two steps: first a pairwise matching of names is performed, and then these are used to find cliques of equivalent names. However, these steps can each be performed in a variety of ways. We therefore conduct an experimental analysis using two real datasets to find which approaches actually work well in practice. Interestingly, this depends on the size of the repository, as larger repositories may have many more similar names.
Feitelson's solution is to construct a graph of similarity between first names, then find weighted cliques grouping equivalent names. For example, given the first names "Ace D. E.", "A. D.", "Abe F. G.", "Abe Bob C.", "A. B. C.", and "Abe B", we create the graph below where the edges are weighted by similarity between the names:

In this example, the names "Abe Bob C", "A B C", and "Abe B" are equivalent, as are "Ace D E" and "A D", leaving "Abe F G" by itself.

I've implemented Feitelson's weighted clique algorithm in a PHP script that calls a C++ program that does the clique analysis. Results can be returned in HTML or JSON. You can try the service at http://bioguid.info/services/. You can also call the service directly by a HTTP POST request to the URL http://bioguid.info/services/equivalent.php with these parameters:



ParameterValueDescription
namesstringList of first names, separated by end of line (\n) character
formathtml or jsonFormat of the results

Thursday, January 15, 2009

Wikis versus Scratchpads

Yes, I know this is ultimately a case of the "genius of and", but the more I play with the Semantic Mediawiki extension the more I think this is going to be the most productive way forward. I've had numerous conversations with Vince Smith about this. Vince and colleagues at the NHM have been doing a lot of work on "Scratchpads" -- Drupal based webs sites that tend to be taxon-focussed. My worry is that in the long term this is going to create lots of silos that some poor fool will have to aggregate together to do anything synthetic with. This makes inference difficult, and also raises issues of duplication (for example, in bibliographies).

I've avoided wikis for a while because of the reliance on plain text (i.e., little structure) (see this old post of mine on Semant), but Semantic Mediawiki provides a fairly simple way to structure information, and it also provides some basic inference. This makes it possible to create wiki pages that are largely populated by database queries, rather than requiring manual editing. For example, I have queries now that will automatically populate a page about a person with that person's publications, and any taxa named after that person. The actual wiki page itself has hardly any text (basically the name of the person). That is, nobody has to manually edit the wiki page to update lists of published papers. Similarly, maps can be generated in situ using queries that aggregate localities mentioned on a wiki page with localities for GenBank sequences and specimens. Very quickly relationships start to emerge without any manual intervention. The combination of templates and Semantic Mediawiki queries seems a pretty powerful way to aggregate information. There are, of course, limitations. The queries are fairly basic, and there's not the power of something like SPARQL, but it's a start. Coupled with the ease of editing to fix the errors in the contributing databases, and the ease of redirecting to handle multiple identifiers, I think a wiki-based approach has a lot of promise.

So, I've been teasing Vince that Drupal (or another CMS) is probably the wrong approach, and that semantic wikis are much more powerful (something Gregor Hagedorn has also been arguing). Vince would probably counter that the goal of scratchpads is to move taxonomists into the digital age by providing them with a customisable platform for them to store and display their data, hence his mission is to capture data. My focus is more to do with aggregating and synthesising the large amount of data we already have (and are struggling to do anything exciting with). Hence, the "genius of and". However, I still worry that when we have a world with loads of scratch pads with overlapping data, some poor fool will still have to merge them together to make sense of it all.