Thursday, December 17, 2015

Will JSON, NoSQL, and graph databases save the Semantic Web?

OK, so the title is pure click bait, but here's the thing. It seems to me that the Semantic Web as classically conceived (RDF/XML, SPARQL, triple stores) has had relatively little impact outside academia, whereas other technologies such as JSON, NoSQL (e.g., MongoDB, CouchDB) and graph databases (e.g., Neo4J) have got a lot of developer mindshare.

In biodiversity informatics the Semantic Web has been a round for a while. We've been pumping out millions of RDF documents (mostly served by LSIDs) since 2005 and, to a first approximation, nothing has happened. I've repeatedly blogged about why I think this is (see this post for a summary).

I was an early fan of RDF and the Semantic Web, but soon decided that it was far more hassle than it was worth. The obsession with ontologies, the problems of globally unique identifiers based on HTTP (http-14 range, anyone?), the need to get a lot of ducks in a row all mad it a colossal pain. Then I discovered the NoSQL document database CouchDB, which is a JSON store that features map-reduce views rather than on the fly queries. To somebody with a relational database background this is a bit of a headfuck:

Fault tolerance

But CouchDB has a great interface, can be replicated to the cloud, and is FUN (how many times can you say that about a database?). So I starting playing with CouchDB for small projects, then used it to build BioNames and more recently moved BioStor to CouchDB hosted both locally and in the cloud.

Then there are graph databases such as Neo4J, which has some really cool things such as GraphGists which is a playground where you can create interactive graphs and query them (here's an example I created). Once again, this is FUN.

Another big trend over the last decade is the flight from XML and its hideous complexities (albeit coupled with great power) to the simplicity of JSON (part of the rise of JavaScript). JSON makes it very easy to pass around data in simple key-value documents (with more complexity such as lists if you need them). Pretty much any modern API will serve you data in JSON.

So, what happened to RDF? Well, along with a plethora of formats (XML, triples, quads, etc., etc.) it adopted JSON in the form of JSON-LD (see JSON-LD and Why I Hate the Semantic Web for background). JSON-LD lets you have data in JSON (which both people and machines find easy to understand) and all the complexity/clarity of having the data clearly labelled using controlled vocabularies such as Dublin Core and schema.org. This complexity is shunted off into a "@context" variable where it can in many cases be safely ignored.

But what I find really interesting is that instead of JSON-LD being a way to get developers interested in the rest of the Semantic Web stack (e.g. HTTP URIs as identifiers, SPARQL, and triple stores), it seems that what it is really going to do is enable well-described structured to get access to all the cool things being developed around JSON. For example, we have document databases such as CouchDB which speaks HTTP and JSON, and search servers such as ElasticSearch which make it easy to work with large datasets. There are also some cool things happening with graph databases and Javascript, such as Hexastore (see also Weiss, C., Karras, P., & Bernstein, A. (2008, August 1). Hexastore. Proc. VLDB Endow. VLDB Endowment. http://doi.org/10.14778/1453856.1453965, PDF here) where we create the six possible indexes of the classic RDF [subject,predicate,object] triple (this is the sort of thing can also be done in CouchDB). Hence we can have graph databases implemented in a web browser!

So, when we see large-scale "Semantic Web" applications that actually exist and solve real problems, we may well be more likely to see technologies other than the classic Semantic Web stack. As an example, see the following paper:

Szekely, P., Knoblock, C. A., Slepicka, J., Philpot, A., Singh, A., Yin, C., … Ferreira, L. (2015). Building and Using a Knowledge Graph to Combat Human Trafficking. The Semantic Web - ISWC 2015. Springer Science + Business Media. http://doi.org/10.1007/978-3-319-25010-6_12

There's a free PDF here, and a talk online. The consortium behind this project researchers did extensive text mining, data cleaning and linking, creating a massive collection of JSON-LD documents. Rather than use a triple store and SPARQL, they indexed the JSON-LD using ElasticSearch (notice that they generated graphs for each of the entities they care about, in a sense denormalising the data).

I think this is likely to be the direction many large-scale projects are going to be going. Use the Semantic Web ideas of explicit vocabularies with HTTP URIs for definitions, encode the data in JSON-LD so it's readable by developers (no developers, no projects), then use some of the powerful (and fun) technologies that have been developed around semi-structured data. And if you have JSON-LD, then you get SEO for free by embedding that JSON-LD in your web pages.

In summary, if biodiversity informatics wants to play with the Semantic Web/linked data then it seems obvious that some combination of JSON-LD with NoSQL, graph databases, and search tools like ElasticSearch are the way to go.

Wednesday, December 09, 2015

Visualising the difference between two taxonomic classifications

It's a nice feeling when work that one did ages ago seems relevant again. Markus Döring has been working on a new backbone classification of all the species which occur in taxonomic checklists harvested by GBIF. After building a new classification the obvious question arises "how does this compare to the previous GBIF classification?" A simple question, answering it however is a little tricky. It's relatively easy to compare two text files -- and this function appears in places such as Wikipedia and GitHub -- but comparing trees is a little trickier. Ordering in trees is less meaningful than in text files, which have a single linear order. In other words, as text strings "(a,b,c)" and "(c,b,a)" are different, but as trees they are the same.

Classifications can be modelled as a particular kind of tree where (unlike, say, phylogenies) every node has a unique label. For example, the tips may be species and the internal nodes may be higher taxa such as genera, families, etc. So, what we need is a way of comparing two rooted, labelled trees and finding the differences. Turns out, this is exactly what Gabriel Valiente and I worked on in this paper doi:10.1186/1471-2105-6-208. The code for that paper (available on GitHub) computes an "edit script" that gives a set of operations to convert one fully labelled tree into another. So I brushed up my rusty C++ skills (I'm using "skills" loosely here) and wrote some code to take two trees and the edit script, and create a simple web page that shows the two trees and their differences. Below is a screen shot showing a comparison between the classification of whales in the Mammals Species of the World, and one from GBIF (you can see a live version here).

Treediff

The display uses colours to show whether a nodes has been deleted from the first tree, inserted into the second tree, or moved to a different position. Clicking on a node in one tree scrolls the corresponding node in the other tree (if it exists) to scroll into view. Most of the differences between the two trees are due to the absence of fossils from Mammals Species of the World, but there are other issues such as GBIF ignoring tribes, and a few taxa that are duplicated due to spelling typos, etc.

Tuesday, December 01, 2015

Frontiers of biodiversity informatics and modelling species distributions #GBIFfrontiers @AMNH videos

For those of you who, like me, weren't at the "Frontiers Of Biodiversity Informatics and Modelling Species Distributions" held at the AMNH in New York, here are the videos of the talks and panel discussion, which the organisers have kindly put up on Vimeo with the following description:

The Center for Biodiversity and Conservation (CBC) partnered with the Global Biodiversity Information Facility (GBIF) to host a special "Symposium and Panel Discussion: Frontiers Of Biodiversity Informatics and Modelling Species Distributions" at the American Museum of Natural History on November 4, 2015.

The event kicked off a working meeting of the GBIF Task Group on Data Fitness for Use in Distribution Modelling at the City College of New York on November 5-6. GBIF convened the Task Group to assess the state of the art in the field, to connect with the worldwide scientific and modelling communities, and to share a vision of how GBIF support them in the coming decade.

The event successfully convened a broad, global audience of students and scientists to exchange ideas and visions on emerging frontiers of biodiversity informatics. Using inputs from the symposium and from a web survey of experts, the Data Fitness task group will prepare a report, which will be open for consultation and feedback at GBIF.org and on the GBIF Community Site in December 2015.

Guest post: 10 years of global biodiversity databases: are we there yet?

YtNkVT2UThis guest post by Tony Rees explores some of the themes from his recent talk 10 years of Global Biodiversity Databases: Are We There Yet?.

A couple of months ago I received an invitation to address the upcoming 2015 meeting of the Malacological Society of Australasia (Australian and New Zealand mollusc persons for the uninitiated) on some topic associated with biodiversity databases, and I decided that a decadal review might be an interesting exercise, both for my potential audience (perhaps) and for my own interest (definitely). Well, the talk is delivered and the slides are available on the web for viewing if interested, and Rod has kindly invited me to present some of its findings here, and possibly stimulate some ongoing discussion since a lot of my interests overlap his own quite closely. I was also somewhat influenced in my choice of title by a previous presentation of Rod's from some 5 years back, "Why aren't we there yet?" which provides a slightly pessimistic counterpoint to my own perhaps more optimistic summary.

I decided to construct the talk around 5 areas: compilations of taxonomic names and associated valid/accepted taxa; links to the literature (original citations, descriptions, more); machine-addressable lists of taxon traits; compilations of georeferenced species data points such as OBIS and GBIF; and synoptic efforts in the environmental niche modelling area (all or many species so as to be able to produce global biodiversity as well as single-species maps). Without recapping the entire content of my talk (which you can find on SlideShare), I thought I would share with readers of this blog some of the more interesting conclusions, many of which are not readily available elsewhere, at least not with effort to chase down and/or make educated guesses.

In the area of taxonomic names, for animals (sensu lato) ION has gone up from 1.8m to 5.2m names (2.8m to 3.5m indexed documents) from all ranks (synonyms not distinguished) over the cited period 2005-2015, while Catalogue of Life has gone up from 0.5m species names + ?? synonyms to 1.6m species names + 1.3m synonyms over the same period; for fossils, BioNames database is making some progress in linking ION names to external resources on the web but, at less than 100k such links, is still relatively small scale and without more than a single-operator level of resourcing. A couple of other "open access" biological literature indexing activities are still at a modest level (e.g. 250k-350k citations, as against an estimated biological literature of perhaps 20m items) at present, and showing few signs of current active development (unless I have missed them of course).

Comprehensive databases of taxon traits (in machine addressable form) appear to have started with the author’s own "IRMNG" genus- and species- level compendium which was initially tailored to OBIS needs for simply differentiating organisms into extant vs. fossil, marine vs. nonmarine. More comprehensive indexes exist for specific groups and recently, Encyclopedia of Life has established "TraitBank" which is making some headway although some of the "traits" such as geographic distribution (a bounding box from either GBIF or OBIS) and "number of GenBank sequences" stretch the concept of trait a little (just my two cents' worth, of course), and the newly created linkage to Google searches is to be applauded.

With regard to aggregating georeferenced species data (specimens and observations), both OBIS (marine taxa only) and GBIF (all taxa) have made quite a lot of progress over the past ten years, OBIS increasing its data holdings ninefold from 5.6m to 44.9m (from 38 to 1,900+ data providers) and GBIF more than tenfold from 45m to 577m records over the same period, from 300+ to over 15k providers. While these figures look healthy there are still many data gaps in holdings e.g. by location sampled, year/season, ocean depth, distance to land etc. and it is probably a fair question to ask what is the real "end point" for such projects, i.e. somewhere between "a record for every species" and "a record for every individual of every species", perhaps...

Global / synoptic niche modelling projects known to the author basically comprise Lifemapper for terrestrial species and AquaMaps for marine taxa (plus some freshwater). Lifemapper claims "data for over 100,000 species" but it is unclear whether this corresponds to the number of completed range maps available at this time, while AquaMaps has maps for over 22,000 species (fishes, marine mammals and invertebrates, with an emphasis on fishes) each of which has a point data map, a native range map clipped to where the species is believed to occur, an "all suitable habitat map" (the same unclipped) and a "year 2100 map" showing projected range changes under one global warming scenario. Mapping parameters can also be adjusted by the user using an interactive "create your own map" function, and stacking all completed maps together produces plots of computed ocean biodiversity plus the ability to undertake web-based "what [probably] lives here" queries for either all species or for particular groups. Between these two projects (which admittedly use different modelling methodologies but both should produce useful results as a first pass) the state of synoptic taxon modelling actually appears quite good, especially since there are ongoing workshops e.g. the recent AMNH/GBIF workshop Biodiversity Informatics and Modelling Species Distributions at which further progress and stumbling blocks can be discussed.

So, some questions arising:

  • Who might produce the best "single source" compendium of expert-reviewed species lists, for all taxa, extant and fossil, and how might this happen (my guess: a consortium of Catalogue of Life + PaleoBioDB at some future point)
  • Will this contain links to the literature, at least citations but preferably as online digital documents where available? (CoL presently no, PaleoBioDB has human-readable citations only at present)
  • Will EOL increasingly claim the "TraitBank" space, and do a good enough job of it? (also bearing in mind that EOL is still an aggregator, not an original content creator, i.e. somebody still has to create it elsewhere)
  • Will OBIS and/or GBIF ever be "complete", and how will we know when we’ve got there (or, how complete is enough for what users might require)?
  • Same for niche modelling/predicted species maps: will all taxa eventually be covered, and will the results be (generally) reliable and useable (and at what scale); or, what more needs to be done to increase map quality and reliability.

Opinions, other insights welcome!