Sunday, June 10, 2007

Making taxonomic literature available online

Based on my recent experience developing an OpenURL service (described here, here, and here), linking this to a reference parser and AJAX tool (see David Shorthouse's description of how he did this), and thoughts on XMP, maybe it's time to try and articulate how this could be put together to make taxonomic literature more accessible.

Details below, but basically I think we could make major progress by:

  1. Creating an OpenURL service that knows about as many articles as possible, and has URLs for freely available digital versions of those articles.

  2. Creating a database of articles that the OpenURL service can use.

  3. Creating tools to populate this database, such as bibliography parsers (http://bioguid.info/references is a simple start).

  4. Assigning GUIDs to references.



Background

Probably the single biggest impediment to basic taxonomic research is lack of access to the literature. Why isn't it the case that if I'm looking at a species in a database or a search engine result, I can't click on the original description of that species? (See my earlier grumble about this). Large-scale efforts like the Biodiversity Heritage Library (BHL) will help, but the current focus on old (pre-1923) literature will severely limit the utility of this project. Furthermore, BHL doesn't seem to have dealt with the GUID issue, or the findability issue (how do I know that BHL has a particular paper?).

My own view is that most of this stuff is quite straightforward to deal with, using existing technology and standards, such as OpenURL and SICIs. The major limitation is availability of content, but there is a lot of stuff out there, if we know where to look.

GUIDs

Publications need GUIDs, globally unique identifiers that we can use to identifier papers. There are several kinds of GUID already being used, such as DOIs and Handles. As a general GUID for articles, I've been advocating SICIs.

For example, the Nascimento et al. paper I discussed in an earlier post on frog names has the SICI 0365-4508(2005)63<297>2.0.CO;2-2. This SICI comprises the ISSN number of the serial (in this example 0365-4508 is the ISSN for Arquivos do Museu Nacional, Rio de Janeiro), the year of publication (2005), volume (63), and the starting page (297), plus various other bits of administrivia such as check digits. For most articles this combination of four elements is enough to uniquely define an article.

SICIs can be generated easily, and are free (unlike DOIs). They don't have a resolution mechanism, but one could add support for them to an OpenURL resolver. For more details on SICIs I've some bookmarks on del.icio.us.

Publisher archiving

A number of scientific societies and museums have literature online already, some of which I've made use of already (e.g., the AMNH's Bulletins and Novitates, and the journal Psyche). My OpenURL service knows about some 8000 articles, based on a few days work. But my sense is that there is much more out there. All this needs is some simple web crawling and parsing to build up a database that an OpenURL service can use.

Personal and communal archiving

Another, complementary approach is for people to upload papers in their collection (for example, papers they have authored or have digitised). There are now sites that make this as easy as uploading photos. For example, Scribd is a Flickr-like site where you can upload, tag, and share documents. As a quick test, I uploaded Nascimento et al., which you can see here: http://www.scribd.com/doc/99567/Nascimento-p-297320. Scribd uses Macromedia Flashpaper to display documents.

Many academic institutions have their own archiving programs, such as ePRINTS (see for example ePrints at my own institution).

The trick here is link these to an OpenURL service. Perhaps the taxonomic community should think about a service very like Scribd, which will at the same time update the OpenURL service everytime an article becomes available.

Summary

I suspect that none of this is terribly hard, most of the issues have already been solved, it's just a case of gluing the bits together. I also think it's a case of keeping things simple, and resisting the temptation to make large-scale portals, etc. It's a matter of simple services that can be easily used by lots of people. I this way, I think the imaginative way David Shorthouse made my reference parser and OpenURL service trivial for just abut anybody to use is a model for how we will make progress.

No comments: