iPhylo: November 2013

Roderic D. M. Page

Thursday, November 21, 2013

GBIF, GitHub, and taxonomy (again)

Quick notes on yet another attempt to marry the task of editing a taxonomic classification with versioning it in GitHub.

The idea of dumping the whole GBIF classification into GitHub as a series of nested folders looks untenable. So, maybe there's another way to tackle the problem.

Let's imagine that we dump, say, the GBIF classification down to family-level as a series of nested folders (i.e., we recreate the classification on disk). For each family we then create a bunch of files and store them in that folder. For example, we could have the classification in Darwin Core Archive format (basically, delimited text). Let's also create a graph that corresponds to that classification, using a format for which we have tools available for visualising and editing.

For example, I've created a Graph Modelling Language (GML) file for the Pinnotheridae here. Using software such as yEd I can load this file, display it, and edit it. For example, below is a compact tree layout of the graph:

This image is a bitmap, if you opened the GML file in yEd it would be interactive, and you could zoom in, alter the layout, edit the graph, etc.

Looking at the graph there are a few oddities, such as "orphan" genera that lack any species, and some names that appear very similar. For example, there is an orphan genus Glassella, and a similar genus Glassellia (note the "i") with a single species Glassellia costaricana. A little digging in BioNames shows that Glassellia is a misspelling of Glassella. The original description appears in:

E Campos, M K Wicksten (1997) A New Genus For The Central American Crab Pinnixa costaricana Wicksten, 1982 (Crustacea: Brachyura: Pinnotheridae). Proceedings of the Biological Society of Washington 110(1): 69–73. http://biostor.org/reference/81137

So, we have one genus that appears twice due to a typo. Furthermore, there are nodes in the graph for the taxa Glassellia costaricana and Pinnixa costaricana, but these are the same thing (the names are synonyms, albeit Glassellia costaricana has the genus misspelt). So, we could delete Pinnixa costaricana, delete the mispelling Glassellia, fix the misspelling in Glassellia costaricana, and move it to the correctly spelt Glassella. There are other problems with this classification, but let's leave them for the moment.

Now, imagine that after editing I use the graph to regenerate the DWCA file, which now has the edited classification. I then commit the changes to GitHub, and anyone else (including GBIF) could grab the DWCA and, for example, replace their Pinnotheridae classification with the edited version.

We could also go further, and add what i think is a missing component of the GBIF classification, namely a link to the nomenclators. For example, in an ideal world we would have each name in the classification linked to a stable identifier for the name provided by a nomenclator, and that nomenclator would know, for example, that Pinnixa costaricana and Glassella costaricana were objective synonyms. If we had those links then we could automatically detect cases such as this where logically you can have either Pinnixa costaricana or Glassella costaricana in the same classification, but not both.

There are some wrinkles to figure out, for example it would be nice to compute the difference between the original and edited graphs in terms of graph operations (not simply the difference as text files) so we could do things like list nodes that have been moved or deleted. I did some work on this a while back (Page, R. D., & Valiente, G. (2005).BMC Bioinformatics, 6(1), 208. doi:10.1186/1471-2105-6-208), something like that tool might do the trick.

There is an element here of trying to coerce a problem into a form that can existing tools can solve, but in a way that's what makes it attractive. If we can use things that already exist then we can move from talking about it to actually doing it.

Wednesday, November 20, 2013

Reaction to taxonomic reactionaries

There is a fairly scathing editorial in Nature [The new zoo. (2013). Nature, 503(7476), 311–312. doi:10.1038/503311b ] that reacts to a recent paper by Dubois et al.:

Dubois, A., Crochet, P.-A., Dickinson, E. C., Nemésio, A., Aescht, E., Bauer, A. M., Blagoderov, V., et al. (2013). Nomenclatural and taxonomic problems related to the electronic publication of new nomina and nomenclatural acts in zoology, with brief comments on optical discs and on the situation in botany. Zootaxa, 3735(1), 1. doi:10.11646/zootaxa.3735.1.1

To quote the editorial:

...there might be more than a disinterested concern for scientific integrity at work here. A typical reader of the Zootaxa paper (not that there are typical readers of a 94-page work on the minutiae of nomenclature protocol) might reasonably conclude that the authors have axes to grind. Exhibits A–E: the high degree of autocitation in the Zootaxa paper; the admission that some of the authors were against the ICZN amendments; that they clearly feel that their opinions regarding the amendments have been disregarded; the ad hominem attacks on ‘wealthy’ publishers as opposed to straitened natural-history societies; and the use of emotive and occasionally intemperate language that one does not associate with the usually dry and legalistic tone of debate on this subject. (The online publisher BioMed Central, based in London, gets a particular pasting, to which it has responded; see http://blogs.biomedcentral.com/bmcblog/2013/11/15/the-devil-may-be-in-the-detail-but-the-longview-is-also-worth-a-look/.)

One of many recommendations made in the diatribe is that journals should routinely have on their review boards those expert in the business of nomenclature — in other words, a cadre of people who are, unlike ordinary mortals, qualified to interpret the mystic strictures of the code. A typical reader is again entitled to ask whom, apart from themselves, the authors think might be suitable candidates.

Ouch! But Dubois et al.'s paper pretty much deserves this reaction - it's a reactionary rant that is breathtaking in it's lack of perspective. From the abstract:

As shown by several examples discussed here, an electronic document can be modified while keeping the same DOI and publication date, which is not compatible with the requirements of zoological nomenclature. Therefore, another system of registration of electronic documents as permanent and inalterable will have to be devised.

So, we have an identifier system for publications which currently has 63,793,212 registered DOIs (see CrossRef), includes key journals such as Zootaxa and ZooKeys, and which has tools to support versioning of papers (see CrossMark) but hey, let's have our own unique system. After all, zoological nomenclature is special, and our community has such a good track record of maintaining our own identifier system (LSIDs anyone?).

Now that the financial crisis faced by the ICZN has been averted by a three-year bail-out by the National University of Singapore (for three years at least), maybe the guardians of scientific names can focus on providing tools and services of value to the broader scientific community (or, indeed, taxonomists). As it stands, the ICZN can say little about the majority of animal names. Much better to focus on that than trying to rail against the practices of modern publishing.

Monday, November 11, 2013

Names and nomenclators: just do it already!

Quick notes on taxonomic names (again). It's a continuing source of bafflement that the biodiversity community is making a dog's breakfast of names. It seems we are forever making it more complicated than it needs to be, forever minting new acronyms that pollute the landscape without actually contributing anything useful, and forever promising shiny new tools and services without every actually delivering them. Meanwhile people and projects that build upon names are left to deal with a mess.

It seems to me that it would be nice if we had a single place to go to get definitive information on a name, and that place would give us a unique identifier that we could use in our own databases as a way to clean up and reconcile our data. For example, if we have a bibliographic database we can map citations to DOIs and then use those to identify the articles. If we have a list of journal names, we can map those to ISSNs and clean up our data. Likewise, if we have a classification such as GBIF or NCBI, we should be able to map the names in those classifications onto standard identifiers for taxonomic names.

The frustrating thing is we already have standard identifiers for taxonomic names. Since around 2005 we have been serving LSIDs for plant and animal names. We have Index Fungorum, IPNI, ION, and ZooBank, all serving LSIDs, all serving RDF, all using the same TDWG vocabulary.

The nomenclators vary in size and scope, but we have the three major, multicellular eukaryotes covered (circles proportional to number of names in each database):

There is some duplication, both within nomenclators (IPNI and ION I'm looking at you) and between nomenclators (ION and ZooBank have the same scope, although ZooBank is dwarfed by ION, anyone care to explain why we have both...?). All four databases are actively growing, partly through direct registration of new taxonomic names.

So, we're basically done, right? Surely all we need to do is harvest the LSIDs for all these names, put them into a single triple store, and wrap some basic services around them? If the nomenclators provide a list of recent changes (e.g., as an RSS feed) then we could continuously update the store with new names. Then any database or classification could reconcile it's names with those in the nomenclators. They could also then augment their own records by making use of additional data the nomenclators have, such as objective synonomies and links to original descriptions. In other words, we could have a model like this:
Taxonmodel

Classifications represent a view of how taxa are related, the names associated with those taxa are stored in nomenclators. This means that classification databases like GBIF and NCBI are not in the business of managing names, they simply link to the nomenclators (in the same way that a bibliographic database can link to DOI, ISSNs, and author ids such as ORCID and VIAF).

We have almost all of this infrastructure in place already. In one of the unsung triumphs of TDWG we have all the nomenclators serving data in the same format using the same technology. And yet we have singly failed to do anything useful with this extraordinary resource! Instead we seem more interested in contributing more projects to the acronym soup of biodiversity informatics. All around us projects to assign and link identifiers for publications (CrossRef), data (DataCite), and people (ORCID) are taking off. The infrastructure for taxonomic names has been in place since 2005, we could be doing the same sort of things CrossRef, DataCite and ORCID are doing in their domains. Why aren't we?

Wednesday, November 06, 2013

ZooKeys, GBIF, and GitHub: fixing Darwin Core Archives part 2

Here's another example of a Darwin Core Archive that is "broken" such that GBIF is missing some information. GBIF data set A checklist to the wasps of Peru (Hymenoptera, Aculeata) comes from Pensoft, and corresponds to the paper:

Rasmussen, C., & Asenjo, A. (2009). A checklist to the wasps of Peru (Hymenoptera, Aculeata). ZooKeys, 15(0). doi:10.3897/zookeys.15.196

As with the previous example GBIF says there are 0 georeferenced records in this dataset. This is odd, because the ZooKeys page for this article lists three supplementary files, including KML files for Google Earth. I've used one to create the image below:

So, clearly there is georeferenced data here. Looking at the Darwin Core Archive (which I've put on GitHub there are a bunch of issues with this data. The occurrence.txt file has decimal latitude and longitude values with a comma rather than a decimal point, the file has some character encoding issues, and the columns with latitude and longitude data are labelled as "verbatim" fields not "decimal" fields. All of this means GBIF lacks all the point data for this dataset (over 2000 records). If we fix these problems, we get a map like this:

This illustrates one problem with publishing data, namely the data is rarely checked in the same way a manuscript is. Peer-review of data is a phrase that always struck me as odd, because you only get to be able to evaluate a data set by using it. In other words, data almost demands post- rather than pre-publication review. It's only when people start trying to use the data that problems emerge.

At the same time, we could improve checking of data prior to publication. In the case of the Darwin Core Archives I've looked at so far, it would be easier to find the problems if we had a simple tool that could take a Darwin Core Archive, extract the information and display it in various ways. If, for example, we have georeferenced records but we don't get a map, we would immediate wonder why that was, and figure out what the problem was. At the moment it seems easy to send data to GBIF, thinking you are contributing important information, whereas in fact that information never makes it onto a GBIF map.

GBIF and Github: fixing broken Darwin Core Archives

Following on from Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite here's a quick and dirty example of using GitHub to help clean up a Darwin Core Archive.

The dataset 3i - Cicadellinae Database has 2,152 species and 4,749 taxa, but GBIF says it has no georeferenced data. As a result, the map for this dataset looks like this:

Gbif 3i

I downloaded the Darwin Core Archive and was puzzled because the occurrence.txt file contained in the archive has latitude and longitude pairs for some of the records. How come there is no map? After a bit of fussing I discovered that the meta.xml file that describes the data is broken. It lists a column which doesn't appear in the data file, so everything after that column gets shifted along and hence the column headings for latitude and longitude are out of alignment with the data.

So, I loaded the Darwin Core Archive into GitHub (you can see it here), then fixed the error, and then for fun extracted the latitude and longitude pairs as a GeoJSON file. GitHub can display this on a map:

Note that we now have a fairly extensive set of georeferenced data points for these insects, and this data hasn't made it onto a GBIF map because of a simple error in the metadata. I keep finding cases like this, which suggests that GBIF has more georeferenced data than it realises.

Saturday, November 02, 2013

Catalogue of Life and LSIDs: a catalogue of fail

I have a love/hate relationship with the Catalogue of Life (CoL). On the one hand, it's an impressive achievement to have persuaded taxonomists to share names, and to bring those names together in one place. I suspect that Frank Bisby would feel that the social infrastructure he created is his lasting legacy. The social infrastructure is arguably more impressive than the informatics infrastructure, in particular, the Catalogue has consistently failed to support globally unique identifiers for its taxa.

If you visit the CoL web pages you will see Life Science Identifiers (LSIDs) for taxa, such as urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20130401 for the African elephant Loxodonta africana. The rationale for using LSIDs in CoL is explained in the following paper:

Jones, A. C., White, R. J., & Orme, E. R. (2011). Identifying and relating biological concepts in the Catalogue of Life. Journal of Biomedical Semantics, 2(1), 7. doi:10.1186/2041-1480-2-7

This paper describes the implementation in great detail, but this is all for nought as CoL LSIDs don't resolve. In fact, as far as I'm aware, CoL LSIDs have not resolved since 2009. Here is a major biodiversity informatics project that seems incapable of running a LSID service. These LSIDs are appearing in other projects (e.g., Darwin Core Archives harvested by GBIF), but they are non-functioning. Anyone using these LSIDs on the assumption that they are resolvable (or, indeed, that CoL cared enough about them to ensure they were resolvable) is sadly mistaken.

Jones et al. list some projects that use CoL LSIDs, including the Atlas of Living Australia (ALA). While I have seen CoL LSIDs used by ALA in the past, it now seems that they've abandoned them. Resolving a LSID such as urn:lsid:biodiversity.org.au:afd.name:433239 (Dromaius novaehollandiae) (using, say the TDWG resolver) we see the following LSID: urn:lsid:biodiversity.org.au:col.name:6847559. This corresponds to the record for Dromaius novaehollandiae for the 2011 edition of the Catalogue of Life. ALA have constructed their own LSID using an internal identifier from CoL. This is the very situation working CoL LSIDs should have made unnecessary. As Jones et al. note:

Prior to the introduction of LSIDs, the CoL was criticized for using identifiers which changed from year to year [29]. The internal identifiers have never been intended to be used in other systems linking to the CoL, of course, but this criticism draws attention to the demand for persistent identifiers that are designed for use by other systems. The CoL still does not guarantee to maintain the same internal identifiers, because there appears to be no need to insist on this as a requirement, but it does now provide persistent globally unique, publicly available identifiers.

That would be fine if, in fact, the identifiers were persistent. But they aren't. Because CoL have been either unable or unwilling to support their own LSIDs, ALA has had to program around that by minting their own LSIDs for CoL content! Note that these ALA LSIDs are tied to a specific version of CoL. Record 6847559 exists in the 2011 edition (http://www.catalogueoflife.org/annual-checklist/2011/details/species/id/6847559) but not the latest (2013), where Dromaius novaehollandiae is now http://www.catalogueoflife.org/annual-checklist/2013/details/species/id/11908940.

Versioning LSIDs

One of features of LSIDs that has caused the most heartache is versioning. Just because this feature is there doesn't mean it is necessary to use it, and yet some LSID providers insist on versioning every LSID. CoL is such an example, so with every release the LSID for every taxon changes. In my opinion, versioning is one of the most discussed and most over-rated features of any identifier. Most people, I suspect, don't want a version, they want the latest version. They want to be able to have links that will always get them to the current version. This is how Wikipedia works, this is how DOIs work (see CrossMark). In both cases you can see the existence of other versions, and go to them if needed. But by putting versions front and centre, and by not enabling the user to simply link to the latest version, CoL have made things more complicated than they need to be.

Changing LSIDs

It needs to be understood that in relation to concepts the Catalogue is intentionally not stable, so if a client is wishing to link to a name, not a concept, the client should use any LSID available for the name (or just the name itself), not a CoL-supplied taxon LSID. It should also be noted that it is intended that deprecated concepts will be accessible via their LSIDs in perpetuity, and the meta- data retrieved will include information about the concepts’ relationships to relevant current concepts (such as inclusion, etc.). - Jones et al. p. 14

Leaving aside the fact that CoL clearly has a different notion of "perpetuity" to the rest of us, the notion that identifiers change when content changes is potentially problematic. If a taxonomic concept changes CoL will mint a new LSID. While I understand the logic, imagine if other databases did this. Imagine if the NCBI decided that because the African elephant was two species instead of one (see doi:10.1126/science.1059936), they should change the NCBI tax_id of Loxodonta africana (tax_id 9785, first used in 1993) because our notion of what "Loxodonta africana" meant has now changed. Imagine the chaos this could cause downstream to all the databases that build upon the NCBI taxonomy, which would now link to an identifier the NCBI had dropped. Instead, NCBI simply added a new identifier for Loxodonta cyclotis. Yes, this means the notion of "Loxodonta africana" may now be ambiguous (if it was sequenced before 2001, did the authors sequence Loxodonta africana or Loxodonta cyclotis?), but given the choice I suspect most could live with that ambiguity (as opposed to rebuilding databases).

But, even if we accept CoL's approach of changing LSIDs if the concept changes, surely concepts that don't change should always have the same LSID (except for changes in the version at the end)? Turns out, this is not always the case. For example, here are the CoL LSIDs for Loxodonta africana from 2008 to 2013:

urn:lsid:catalogueoflife.org:taxon:de5724e4-29c1-102b-9a4a-00304854f820:ac2008
urn:lsid:catalogueoflife.org:taxon:de5724e4-29c1-102b-9a4a-00304854f820:ac2009
urn:lsid:catalogueoflife.org:taxon:24f8e252-60a7-102d-be47-00304854f810:ac2010
urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20110201
urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20120124
urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20130401

The core part of the LSID (the UUID highlighted in bold) has changed twice. But in each release of these versions of CoL there have only been two species of Loxodonta, L. africana and L. cyclotis. How is the 2008 concept of Loxodonta africana different from the 2010, or the 2011 concept?

Summary

As we start to tackle issues such as data quality and annotation, having persistent, resolvable, globally unique identifiers will matter more than ever. Shared identifiers are the glue that helps us bind diverse data together. The tragedy of LSIDs is that they could have been this glue if our community had chosen to invest even a fraction of the effort CrossRef invested in DOIs. Unfortunately we are now left with web sites and databases littered with LSIDs that simply don't work (CoL is not the only offender in this regard).

Resolvable identifiers mean we can actually get information about the things identified, as well as serving as a litmus test of the credibility of a resource (if I give you a URL and the URL doesn't work, you may doubt the value of the information on the end of that link). In a networked world, the trustworthiness of a resource is closely bound to its ability to maintain identifiers. The Catalogue of Life fails this test.

DoubleFacepalm2 zps6e8e47eb

Friday, November 01, 2013

Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite

This is a quick sketch of a way to combine existing tools to help clean and annotate data in GBIF, particularly (but not exclusively) occurrence data.

GitHub

The data provider puts a Darwin Core Archive (expanded, not zipped) into a GitHub repository. GBIF forks the repository, cleans the data, and uploads that to GBIF to populate the database behind the portal.

DOI

When GBIF firsts loads the repository it assigns it a DOI (using, say, DataCite). Actually we assign two DOIs, one for this version of the data (e.g., 10.1234/data.v1) and one for all versions of the data, say 10.1234/data. The data is considered to be published, authorship is determined by the provider, which may be an individual, a project, an institution, etc.

Big scale annotation and cleaning

Anyone familiar with GitHub can fork the repository of data and do their own cleaning (e.g., fixing dates, latitudes and longitudes, links to taxon names, etc.).

Small scale, casual annotation

Anyone visiting the GBIF portal and noticing an error (or something that they want to comment on) does so on the portal. Behind the scenes these comments are stored as issues on the GBIF repository in GitHub. To do this GBIF can either (a) enable users with an existing GitHub account to link that to their GBIF user account, or (b) create a GitHub account for the user. The user need not actually interact directly with GitHub (a similar approach is described by Mark Holder for the social curation of phylogenetic studies).

This means all annotation, big or small, is in the open and on GitHub. There is very little programming to do, GBIF simply talks to GitHub using GitHub's API. GBIF could display known "issues" for a dataset, so portal users immediate know if any data has been flagged as problematic.

All the annotations belong to the "community", in the sense that each annotation is linked to GitHub user (even if the user might not ever actually go to GitHub). This also means that the provider can, at any point, pull in those annotations so they can update their own data (and hence gain direct benefit form exposing it in the first place).

Updating

When GBIF decides that enough annotations have been made and resolved, the latest version of the repository is loaded into GBIF and gets a new DOI (e.g., 10.1234/data.v2). This means an analysis based on that version is citable. We add a link to the overall DOI so someone who doesn't care about versions can still cite the data.

Authorship and credit

Now we come to the fun part. The revision will include the input from a bunch of people. This will be recorded on GitHub, but that will only mean something the handful of geeks who think GitHub is awesome. But, let's imagine that we do the following:

Anyone with a GBIF account can link that to their ORCID (if you are a researcher you really should have one of these).
Anyone contributing to this version of the repository gets authorship (appended to the end of the list, so the original provider is first author).
GBIF uses the ORCID API to automatically load the DOI of the new version of the dataset onto the list of works for each contributor. They instantly get credit as an co-author of a citable dataset, and this appears on their ORCID profile.

Benefits

This approach has a number of benefits:

It creates citable data
It gives credit in a way many people will recognise (authorship of a citable work that has a DOI)
The annotations are freely available, there is a complete version history, anyone can contribute at whatever scale suits them.
Anyone can grab the repo at any time and load it into their own system, including the original provider, who can see what people are added to their original data.
There is virtually no programming to do, no new domain-specific protocols, everything is pretty much in place. GitHub does versioning, DataCite does citable identifiers, ORCID handles identify and credit.

Caveats

There are a couple of potential issues. Darwin Core Archive data files can be large, and GitHub can be less effective with large files (although it is ideally suited to the delimited-text files that Darwin Core Archive uses, see Git (and Github) for Data). One approach to impose a limit on the size of an individual "occurrence.txt" file in the archive, so we may have multiple files, none of which is too big. Another task will be linking issues to specific occurrences (if they concern just one occurrence), the GitHub issues will be at level of the complete file. This could be handled in a form-based interface on GBIF that sent the occurrenceID as part of the issue report.

Summary

The key point of this proposal is that everything is in place already to do this. The ducks are lining up, and serious, credible projects are handling the things we need (versioning, identifiers, credit). Sometimes the smart thing is to do nothing and wait to someone else solves the problems you face. I think the waiting may be over.