Wednesday, February 15, 2017

New feature for BioStor: extracting literature cited from OCR text

At present BioStor provides a simple display of an article extracted from BHL. You get the page images, and sometimes a map and an altmetric "donut". But we can do better than this. For example, I'm starting to experiment with displaying a list of literature cited by the article. Below is a screenshot of the article A remarkable new species of Homalomena (Araceae) from Peninsular Malaysia showing the two references this article cites:

Screenshot 2017 02 15 19 28 17

These references have been extracted using some simple regular expressions written in Javascript and wrapped up in a CouchDB view. They are extracted as simple text strings, I've not made any further attempt to parse the string into authors, title, journal, etc.

Of course, what we really want is to be able to convert these strings into clickable links to the actual reference. In the spirit of "We don't need no stinkin' parser" (see also Resolving free-form citations) I've added a little search icon that when you click on it attempts to find the reference in BioStor. In the screenshot above we've found both references in BioStor.

Obvious next steps are to add other resolvers (such as CrossRef for DOIs), do the resolution before the references are displayed (rather than wait for the user to click on the search icon), and even more usefully, display a list of articles that cite each article in BioStor (in the example above, both cited articles should "know" that they have been cited).

Whether an article in BioStor has a list of citations depends on the success of the regular expressions in extracting them, and whether the database has the OCR text. The current version of BioStor didn't originally store the OCR text, so I'm slowly adding that to the references. Other examples of articles with citations include Northeast African racers of the Platyceps rhodorachis complex (Reptilia: Squamata: Colubrinae) and Synopsis of the Neotropical mantid genus Pseudacanthops Saussure, 1870, with the description of three new species (Mantodea: Acanthopidae).

Long term adding linked citations to BioStor means we get a step closer to being able to offer readers an experience like PubMed Central (PMC), where articles in PMC are linked to articles in PMC that either cite, or a cited by that article. I think there's a case for a PubMed Central-like service for biodiversity literature (see Possible project: A PubMed Central for taxonomy) that rescues that literature from the ghetto much of it currently resides in, and instead makes it a first class citizen of the wider digital biodiversity landscape.

Saturday, January 14, 2017

Displaying taxonomic classifications from Wikidata using d3js and SPARQL

Sahelanthropus tchadensis TM 266 01 060 1 Following on from previous posts The Semantic Web made fun: d3sparql and The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor I've put together an example query that can be used to extract a taxonomic classification from Wikidata. The query is inspired by the http://biohackathon.org/d3sparql/ example, and uses the wikidata property P171 ("parent taxon") which is subproperty of rdfs:subClassOf (the property used in the d3sparql example which queries the Uniprot taxonomy).

The following SPARQL query generates a list of nodes in the tree representing the classification of Hominini (humans, chimps, and their extinct relatives):

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?root_name ?parent_name ?child_name WHERE
{
 VALUES ?root_name {"Hominini"}
 ?root wdt:P225 ?root_name .
 ?child wdt:P171+ ?root .
 ?child wdt:P171 ?parent .
 ?child wdt:P225 ?child_name .
 ?parent wdt:P225 ?parent_name .
}

Using https://query.wikidata.org/sparql as the endpoint, in http://biohackathon.org/d3sparql/ this generates the following diagram:

Screenshot 2017 01 14 11 41 55

There are some obvious issues with this classification, such as genera that lack descendant species (e.g., Cyphanthropus). Indeed, we could imagine developing SPARQL queries to flag up such errors (see A use case for RDF in taxonomy). But the availability and accessibility of Wikidata and its SPARQL interface makes it a great playground to explore the utility of SPARQL for exploring taxonomic data.

Wednesday, January 11, 2017

The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor

I've added an experimental feature to BioStor that uses data from Wikidata and Wikispecies to augment what information BioStor displays on authors. This is a crude first step towards the goal of representing all the data in BioStor as a "knowledge graph" where articles, journals, and authors are all treated as entities, all have identifiers, and we can explore relationships between those entities (e.g., citation, co-authorship, etc.). At the moment this is true of articles, which have Biostor URLs (and in many cases DOIs), and for most journals which are identified by their ISSN. Using identifiers helps reduce ambiguity, especially if there are multiple ways to represent the same thing (e.g., all the alternative ways to write a journal name can be circumvented by using the journal's ISSN).

However, BioStor doesn't have a way to identify authors beyond simply searching for a name. As a first step to tackling this problem I've added a little widget that displays information about an author based on the name you are searching for. For example, searching for George Albert Boulenger will give you a list of publications where the author name is "George Albert Boulenger", as well as a picture of the author and some identifiers (from sources such as VIAF, ISNI, IPNI, and Wikidata):

Screenshot 2017 01 11 16 30 57

For now this widget is independent of the data in BioStor. I don't link an article to its author(s) using identifiers for those authors, nor have I tackled the problem of clustering all the variations in people's names together into one set of names that share the same identifier (see Equivalent author names) nor do I attempt to match names to identifiers (see Reconciling author names using Open Refine and VIAF) other than by an exact text search (for details see below). At this stage I just want to get a sense of what identifiers exist for an author, and what I can learn from those identifiers. I also want to explore the potential of Wikispecies as a source of data on people and publications, and how this relates to Wikidata (for earlier thoughts on using Wikipedia for the same goal see Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library).

Wikispecies

I confess I've never really "got" Wikispecies (e.g., Wikispecies is not a database), it seems to exist in isolation from Wikipedia, which is arguably more informative about many species. But there are a couple of things Wikispecies does very well. Firstly, it is building a rich, crowd-sourced bibliography of papers on the taxonomy of many different species. Readers of iPhylo will recall how many times I've expressed frustration at the nearly evidence-free nature of many online taxonomic databases that simply have lists of names unconnected to the primary literature. Many Wikispecies pages have long lists of papers, making it a potential goldmine. Recently there is a lot of interest in extracting bibliographic data from Wikipedia (see WikiCite). Wikispecies could also be harvested, although a major obstacle any such project faces is the lack of a consistent format for references in Wikispecies.

The other nice thing about Wikipecies is that it has articles on taxonomic authorities, and these often list publications by those authors, and also list external identifiers for those authors, such as the VIAF and ISNI identifiers used in the library world, IPNI and ZooBank identifiers used in taxonomic databases, and ORCID which is becoming the de-facto identifier for academic researchers. This information also ends up in Wikidata.

Using Wikidata to glue things together

Wikidata is an interesting project that, like Wikispecies, I've been in two minds about (see Wikidata, Wikipedia, and #wikisci). However, I've started to make more use of it recently. Inspired by the Wikidata:SPARQL query service/2016 SPARQL Workshop I decided to explore the SPARQL query interface to Wikidata. I was struck by one of the example queries involving Wikispecies, and so after a little bit of messing about came up with a query that takes the name of an author and returns some identifiers from Wikidata, as well as an image of that person if one is available. I restrict the results to people that have an article about them in Wikispecies, because I want start exploring using those articles to make assertions about authorship. Here is a query to search for "George Albert Boulenger":

SELECT *
WHERE
{
  ?item rdfs:label "George Albert Boulenger"@en .
  ?article schema:about ?item .
  ?article schema:isPartOf  .
  OPTIONAL {
   ?item wdt:P213 ?isni .
	}
  OPTIONAL {
   ?item wdt:P214 ?viaf .
	}
  OPTIONAL {
   ?item wdt:P18 ?image .
	}
  OPTIONAL {
   ?item wdt:P496 ?orcid .
	}
  OPTIONAL {
   ?item wdt:P586 ?ipni .
	}
  OPTIONAL {
   ?item wdt:P2006 ?zoobank .
	}
}

This query simply asks whether Wikidata has an item on this person, whether that item is linked to Wikispecies, what identifiers Wikidata has, and whether there is an image of the person. You can see the query "live" here:

I've added some code to BioStor to do this query on the fly, and display the results. So, for Boulenger we get: Screenshot 2017 01 11 17 04 16 Here is the result for noted carcinologist Jocelyn Crane who currently lacks identifiers: Screenshot 2017 01 11 17 05 32 A nice surprise was Bernard Landry: Screenshot 2017 01 11 17 07 14 Note the ORCID 0000-0002-6005-1067. Interestingly, Bernard Landry's ORCID profile doesn't list any publications, whereas we can see lists of these in BioStor and Wikispecies.

Where next?

There are several obstacles to mapping the names of authors to identifiers. One is simply the lack of identifiers. This seems to be rapidly becoming less of a problem with the efforts of the library community around VIAF, the rise of ORCID for living researchers, and the creation of Wikidata items for every taxonomist in Wikispecies. The next challenge is clustering the different ways of writing the same person's name into sets that represent the same person. As discussed above, there are tools for this. Furthermore, with Wikipedia and Wikispecies we have sources of lists of publications linked to a person and their identifiers, which should simplify the task considerably. What is nice about this is that it relies on a crowd-sourcing effort which is already well-established, namely those people who in adding articles to Wikispecies and Wikipedia are created a curated database of publications linked to authors. In many cases those publications are linked to BHL (the source that BioStor extracts its articles from), so many of the links between publications and people are essentially lying there, just waiting for some skilful harvesting.

Friday, December 23, 2016

Taxonomic name timelines for BHL

Given a big corpus of literature one of the fun things to do is look at how the use of a term has changed over time. When did people first use a particular word? When did one word start to replace another, etc.? Google's Ngram Viewer is perhaps the best known tool for exploring these questions.

In the context of biodiversity doing something similar for BHL is an obvious thing to do. I've made various clunky attempts in the past (e.g., Biodiversity Heritage Library sparklines) but these all died.

Ryan Schenk (who did a lot of the user interface for my BioNames project) wrote a very stylish tool to display changes in names over time. Called "Synynyms" his tool is now defunct, but you can read about it here and the source code is on github. Ryan would take a name, find synonyms, then graph the changes in use of all those names over time.

Bison bison Linnaeus 1758 synynyms 1024x675

The death of Synynyms has not gone unnoticed:

I've had a tool for my own use that searches BHL for a name and displays the results after first trying to aggregate the hits in a sensible way. For example, if there is more than one hit in a scanned volume, and those hits al fall on pages in the same article in BioStor, then I display the BioStor article, instead of a list of each hit separately. Inspired by @PhyloJCAM's question I've built a simple tool to explore the use of one or more name over time.

Located in the "labs" section of BioStor, the BHL timeline takes one or more names and searches for those names in BHL, displaying the results as a chart and a list of hits. I often use it simply to search BHL for a particular name, but you can also use it to compare names, e.g. Aspidoscelis costata and Cnemidophorus costatus:

Screenshot 2016 12 23 06 38 32

The timeline tool is pretty crude, and it's slow if there are lots of hits in BHL. So, it's not as slick as Synynyms (Ryan Schenk is a clever programmer than I am). Still, it is a useful way to explore BHL and discover articles that you might not have known existed.

Thursday, December 22, 2016

DNA barcoding taxonomy now in GBIF

220px The Face of a Lupine BlueFollowing on from adding DNA barcodes to GBIF I've now uploaded a taxonomic classification of DNA barcode BINs (Barcode Index Numbers). Each BIN is a cluster of similar DNA barcodes that is essentially equivalent to a species. For more details see:

Ratnasingham, S., & Hebert, P. D. N. (2013, July 8). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.), PLoS ONE. Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0066213
The data I've uploaded was obtained by screen scraping the BOLD web site for each BIN in the DNA barcode dataset (BOLD's API doesn't let me get all the information I want). In addition to the taxonomic hierarchy associated with each BIN I've also extracted any publications mentioned on the BIN page, and subsequently tried to link those to the corresponding DOI, if the publication has one. The code for all this is available on GitHub https://github.com/rdmpage/bold-bins, which also serves as the host for the Darwin Core Archive for this dataset. There's a neat trick where you can use a .gitattributes file to tell GitHub not store certain files in the zip file it creates for the repository (see Excluding files from git archive exports using gitattributes by @fmarier).

Having done this, I've a few thoughts.

Please, please use DOIs for articles

BOLD pages for BINs often include one or more papers that published the barcodes included in that BIN. This is great, but often links to these papers are pretty strange:

If you are going to store literature in a database treat links to articles with great care. they are often full of extraneous stuff that depends on how the user reached that article online. DOIs greatly simplify this process. Instead of a URL like http://onlinelibrary.wiley.com/store/10.1111/j.1755-0998.2009.02650.x/asset/j.1755-0998.2009.02650.x.pdf?v=1&t=hellc54c&s=e14bbc4146b66a051ad5cd1f5361ac2e16dc5831&systemMessage=Pay+Per+View+will+be+unavailable+for+upto+3+hours+from+06%3A00+EST+March+23rd+ (I kid you not) you should use the DOI 10.1111/j.1755-0998.2009.02650.x.

Adding DOIs to these articles means GBIF will display them on the corresponding species page, for example Centromerus sylvaticus (Blackwall, 1841) has links to these two papers:

Telfer, A., deWaard, J., Young, M., Quinn, J., Perez, K., Sobel, C., … Hebert, P. (2015, August 30). Biodiversity inventories in high gear: DNA barcoding facilitates a rapid biotic survey of a temperate nature reserve. Biodiversity Data Journal. Pensoft Publishers. https://doi.org/10.3897/bdj.3.e6313
Blagoev, G. A., deWaard, J. R., Ratnasingham, S., deWaard, S. L., Lu, L., Robertson, J., … Hebert, P. D. N. (2015, July 26). Untangling taxonomy: a DNA barcode reference library for Canadian spiders. Molecular Ecology Resources. Wiley-Blackwell. https://doi.org/10.1111/1755-0998.12444
Now GBIF users can easily explore what we know about barcodes from this species by going directly to the primary literature.

Dark taxa

In an earlier post I discussed dark taxa, which are taxa that lack formal scientific names. BOLD is full of these, so many of the taxa I've added to GBIF don't have Linnean names. Instead I've used a combination of higher taxon name and the BIN itself.

Composite taxa

Having said that BINs are essentially the same as species, this need not imply that there's a one-to-one match between BINs and currently recognised species (indeed, this is of the things that makes barcoding so interesting, it's ability to discover hidden variation without taxa currently considered to be a single species). This means that some BINs will have the same name (significant variation within a species), and some BINs will have multiple names (more than one species name assigned to the same BIN). For example, BOLD:AAA2525 is a cluster of DNA barcodes with the following names attached:
  • Icaricia lupini
  • Icaricia acmon
  • Icaricia neurona
  • Plebejus lupini
  • Aricia sp. RV-2009
  • Aricia acmon
  • Plebejus acmon
  • Plebejus elvira
  • Icaricia lupini texanus
  • Icaricia lupini monticola
  • Icaricia lupini chlorina
  • Icaricia lupini lupini
  • Icaricia lupini alpicola
This cluster of names includes subspecies, synonyms (e.g. ). Looking at the phylogeny for this BIN (PDF-only) some of these names are intermingled suggesting that some specimens might be misidentified, apparently Icaricia lupini and I. acmon are very similar:
Coutsis, J. G. (2011). The male genitalia of N American Icaricia lupini and I. acmon; how they differ from each other and how they compare to those of the other two members of the group, I. neurona and I. shasta (Lepidoptera: Lycaenidae, Polyommatiti). Phegea, 39(4), 144-151. Retrieved from http://biostor.org/reference/160269

Summary

This is a first attempt to integrate DNA barcode taxonomy into GBIF, so there are going to be some issues to explore. GBIF currently assumes taxa can be easily mapped to a Linnean hierarchy. While this is ultimately likely to be true for animal COI barcodes, getting there is going to be messy while we have numerous dark taxa and/or BINs which don't match the current identifications of the voucher specimens.

Perhaps it's worth asking whether attempt to fit the results of DNA barcoding into a classical taxonomy is the best way forward. In doing so we loose much of what makes barcodoing so powerful, namely a specimen-level phylogenetic tree. Maybe what we should be really thinking about is ways to explore barcoding data natively. See Notes on next steps for the million DNA barcodes map for some thoughts on how to do that.

Image from Wikimedia Commons The Face of a Lupine Blue by Ingrid Taylar.

Thursday, December 08, 2016

iBOL DNA barcodes in GBIF

I've uploaded all the COI barcodes in the iBOL public data dumps into GBIF. This is an update of an earlier project that uploaded a small subset. Now that dataset doi:10.15468/inygc6 has been expanded to include some 2.7 million barcodes. In the new GBIF portal (work in progress) the map for these barcodes looks like this:

Screenshot 2016 12 07 22 58 43

Many of these records have images of the specimens that were sequenced, and the new GBIF "gallery" feature displays these nicely, e.g.:

Screenshot 2016 12 08 10 04 00

Having done this, I've a few thoughts.

Why did I do this?

Why did I do this, or, put another why didn't iBOL do this already? In an ideal world, iBOL would be an active contributor to GBIF and would be routinely uploading barcodes. Since this isn't happening, I've gone ahead and uploaded the barcodes myself. From my perspective, I want as much data to be as discoverable and as accessible as possible, hence if need be I'll grab data from wherever it lives and add it to GBIF (for an earlier example see The Zika virus, GBIF, and the missing mosquitoes). A downside of this is that, long term, the relationship between data provider and GBIF may be as valuable to GBIF as the data, and simply grabbing and reformatting data doesn't, by itself, form that relationship. But in the absence of a working relationship I still need the data.

Where are the taxonomic names

Lots of barcodes lack formal scientific names, even though in many cases BOLD has them. The data in the public dumps often lacks this information. A next logical step would be to harvest data from the BOLD API and add taxonomic names as "identifications".

Where are the sequences?

The sequences themselves aren't in GBIF, which on the one hand is not surprising as GBIF isn't a sequence databases. However, I think it should be, in the sense that for a lot of biodiversity sequences are going to be the only way forward. This includes the eukaryote barcodes, bacterial sequences, and metabarcodes. Fundamentally sequences are just strings of letters, and GBIF already handles those (e.g., taxonomic names, geographic places, etc.). Furthermore, the following paper by Bittner et al. makes a strong case that rather than knowing "what is there?" it's more important to know "what are they doing?"

Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biology Direct. Springer Nature. https://doi.org/10.1186/1745-6150-5-47

In other words, a functional approach may matter more than a purely taxonomic approach to diversity. For a big chunk of biology this is going to depend on analysing sequences. Even if we restrict ourselves to just taxonomic diversity, there is scope for expanding our notion of what we display once we have sequences and evolutionary trees, e.g. Notes on next steps for the million DNA barcodes map.

Thursday, November 24, 2016

The Semantic Web made fun: d3sparql

Screenshot 2016 11 24 10 08 22

Continuing my on-again off-again relationship with the Semantic Web, I stumbled across a cool approach to visualising the results of SPARQL queries. Toshiaki Katayama (@tktym) has put together d3sparql, a set of Javascript scripts that takes SPARQL queries and formats the results graphically using D3.

For example, give the SPARQL endpoint http://togostanza.org/sparql, the following query retrieves the NCBI classification for the tardigrade family Hypsibiidae:

PREFIX rdfs: PREFIX up: SELECT ?root_name ?parent_name ?child_name FROM <http://togogenome.org/graph/uniprot> WHERE { VALUES ?root_name { "Hypsibiidae" } ?root up:scientificName ?root_name . ?child rdfs:subClassOf+ ?root . ?child rdfs:subClassOf ?parent . ?child up:scientificName ?child_name . ?parent up:scientificName ?parent_name . }

By outputting the results as a list of parent-child pairs, it is straightforward to convert the output of this query into a form that D3 accepts, so we can get a tree like this:

HypsibiidaeHebesuncusHebesuncus conjugensHebesuncus ryaniHebesuncus sp. Hebe_06_218Hebesuncus sp. Hebe_06_221DiphasconDiphascon sp. CJS-2007aDiphascon sp. CJS-2007bDiphascon cf. scoticum MC-2011Diphascon (Adropion) sp. MC-2011Diphascon maucciDiphascon puniceumDiphascon sp. Diph_06_114Diphascon sp. Diph_06_147Diphascon sp. Diph_07_008Diphascon sp. Diph_07_168Diphascon sp. Diph_07_169Diphascon sp. Diph_07_176Diphascon alpinumDiphascon sp. F6456Diphascon sp. F6457Diphascon sp. F6458Diphascon sp. F6459Diphascon sp. F6460Diphascon pingueDiphascon belgicaeDiphascon scoticumDiphascon higginsiDiphascon nodulosumDiphascon pataneiDiphascon ramazzottiiDiphascon sp. F7485Diphascon sp. Diph06_146Diphascon sp. Diph07_25Diphascon sp. Diph07_28Diphascon sp. Diph07_29Diphascon sp. Diph07_61Diphascon sp. Diph07_64AcutuncusAcutuncus antarcticusAcutuncus sp. PC-2013HypsibiusHypsibius cf. convergens 1 EK-2007Hypsibius klebelsbergiHypsibius scabropygusHypsibius cf. convergens 2 EK-2007Hypsibius dujardiniHypsibius sp. CJS-2008Hypsibius sp. 'Moon 1997'Hypsibius sp. F7889Hypsibius convergensHypsibius pallidusHypsibius cf. convergens MD-2013BorealibiusBorealibius zetlandicusThuliniusThulinius stephaniaeThulinius sp. JCR-2003Thulinius sp. DVL-2010Thulinius augustiIsohypsibiusIsohypsibius granuliferIsohypsibius cambrensisIsohypsibius asperIsohypsibius prosostomusIsohypsibius papilliferIsohypsibius sp. Tardi_OakIsohypsibius elegansIsohypsibius sp. Tar179Isohypsibius sp. Tar194Isohypsibius sp. Tar195Isohypsibius dastychiHalobiotusHalobiotus crispaeHalobiotus stenostomusRamazzottiusRamazzottius oberhaeuseriRamazzottius cf. oberhaeuseriRamazzottius sp. Rama_07_123Ramazzottius sp. F10349Ramazzottius sp. F10350Ramazzottius sp. F10470Ramazzottius sp. F10471Ramazzottius sp. F10472Ramazzottius sp. F10473Ramazzottius sp. F3679Ramazzottius sp. F3680Ramazzottius sp. F3681Ramazzottius sp. F3682Ramazzottius sp. F3683Ramazzottius sp. F6917Ramazzottius sp. F6918Ramazzottius sp. F6919Ramazzottius sp. F6920Ramazzottius sp. F6921Ramazzottius sp. F6922Ramazzottius varieornatusPseudobiotusPseudobiotus sp. SHR-2005Pseudobiotus kathmanaePseudobiotus megalonyxAstatumenAstatumen trinacriaeEremobiotusEremobiotus alicataiDoryphoribiusDoryphoribius flavusDoryphoribius macrodonItaquasconItaquascon placophorumMixibiusMixibius cf. saracenus MC-2011Mixibius saracenusPlaticristaPlaticrista angustata

The ability to quickly generate trees, charts, and maps from SPARQL queries makes things a lot easier. We can play around a little and explore things. The strength (and challenge) of SPARQL is that it is very open-ended, you can more or less develop queries to do anything. Being able to visualise the results will help guide that exploration.

The code for d3sparql is on GitHub. One "gotcha" is that the cached examples and external Javascript libraries aren't included. I've forked the repository here and added the missing files, so that if you grab that version it works straight out of the box.