Thursday, May 08, 2008

Fixing GBIF

The more I play with GBIF the more I come across some spectacular errors. Here's one small example of what can go wrong, and how easy it would be to fix at least some of the errors in GBIF. This is topical given that the recent review of EOL highlighted the importance of vetting and cleaning data.

The frog Boophis periegetes features in a recent study of DNA barcoding (doi:10.1186/1742-9994-2-5). The sequences from this study (AY848605-9) aren't georeferenced in GenBank, but in iPhylo they are courtesy of Metacarta's web services. The sequences are located in Madagascar.

Finding errors

Curious about the frog I did a search in iSpecies and got the following map:


Oops, the frog is found in the middle of the South Atlantic(!), and in Brazil(!?).
These specimen records are provided by the MCZ, Harvard. Looking at the latitude and longitude co-ordinates, it's clear that there has been a comedy of errors. In the case of MCZ A-119852 the longitude is west instead of east, for MCZ A-119850 and MCZ A-119851 the latitude and longitudes have been swapped, and the longitude is west instead of east (again). If we make these changes, the specimens go back to Madagascar (the rectangle on the SVG map below). If you don't see the map, use a decent web browser such as Safari 3 or Firefox 2. If you must use Internet Explorer, grab the RENESIS player.


Error, browser must support "SVG"




Interestingly the DiGIR records all list the country as Madagascar, so for any specimen in GBIF it would be trivial to test whether:
  1. do the co-ordinates for the specimen fall inside the bounding box for the country?
  2. if not, will they if we change sign (i.e., hemispheres) and/or swap latitude and longitude?

These would be trivial things to do, and would probably catch a lot of the more egregious errors in GBIF.

Fixing errors
What will be interesting is whether these records will be fixed. I have sent feedback via GBIF's web site, as well as sending an email to the MCZ. I'll let readers know what happens.

Ground truth

Lastly, those interested in the frog itself may find the iSpecies search frustrating as the link returned by Google Scholar leads to a page in Ingenta saying:
This title is now published by Blackwell Publishing and can be found here www.ingentaconnect.com/content/bsc/zoj

Nope, the paper in question is actually at ScienceDirect (doi:10.1006/zjls.1995.0040). This paper describes the species, and gives the latitude and longitude of the collection localities (correctly).

13 comments:

Anonymous said...

I don't get it either why GBIF reports georeference-errors in their dataset event logs (like Geospatial issues: longitude probably negated), without fixing them.

And in the case of MCZ A-119852 it's even weirder: the coordinates fall outside the "darwin:Country", but here they follow the coordinates, and switch the country to Brazil? I would guess a filled in "darwin:Country" is more reliable than coordinates?

Roderic Page said...

Many thanks to José Rosado at the MCZ, who sent me an email just now saying "I have made the changes in our database and have noticed some additional mistakes in these series which will be corrected."
Now, the question is how long these corrections take to migrate to the DiGIR provider, and then to GBIF.

Anonymous said...

Thank you Rod for this blog.

In fact the 2 steps you describe above are exactly how GBIF process data currently, and we maintain bounding boxes for countries to 'sanity check' whether a point lies in a stated country (and this only really works for terrestrial species - we are investigating Marine shape files).

With regard to the frog in the South Atlantic, it was originally served with only a Lat/Long and therefore based on the geospatial data alone, there was nothing to say that this was not correct (of course we could start the 'known distribution' conversation here...).
- In the database that will be put online next week, you will see that this record will be flagged with suspected Lat/Long errors, and it will not appear on any maps, since it is now served with a country.
- In the database that will be put online the month after, if the provider has altered the records, then it will be placed on Madagascar correctly, after we reindex them.

The GBIF portal was developed on the notion that it is an index of biodiversity data available online, and therefore will not modify records, but rather facilitate a quality reporting mechanism to the data curator. Until stable network-wide record identifiers are common place, modifying data that is shared through a network in multiple routes can cause obvious problems.

Tim (GBIF)

Roderic Page said...

Thought I added this comment last night, but it disappeared. Jose updated the DiGIR provider last night, and it now serves the correct co-ordinates. For example, here's the record for M A-119850.

Roderic Page said...

Tim, I'm puzzled by your comment

"With regard to the frog in the South Atlantic, it was originally served with only a Lat/Long and therefore based on the geospatial data alone, there was nothing to say that this was not correct (of course we could start the 'known distribution' conversation here...)."

because the original DiGIR record harvested by GBIF (see the page for occurrence 33352972) has the country as "Madagascar", hence GBIF had enough information for a sanity check.

This suggests that the sanity checks aren't working. I've come across other examples (e.g., channel catfish with country listed as Canada or the USA, but with co-ordinates placing them in Antarctica).

I completely agree with the need for stable identifiers. As wonderful as DiGIR is, its failure to address this issue is a major limitation.

Anonymous said...

Hi Rod,
When we first indexed the occurrence record for the frog in question, there was no country with the record - hence my saying there was nothing to imply the lat/long were incorrect.
Then, later on there was a country, but wrong Lat/Long (e.g. last week it was like this) and thus next week you will see that we do flag it and omit it from maps.
Now they have fixed the coordinates (last 48 hours), next time we index them it will all be correct. We are always around a month to 2 months behind due to the delays of harvesting and batch processing which we need to do thanks to the data size.
GBIF are working on proposals to reduce the latency seen after a provider changes a record to the change being seen on the GBIF portal, but the benefits of this will not be shown until the end of 2008.

With regards to the channel catfish:
http://data.gbif.org/occurrences/searchWithTable.htm?c[0].s=20&c[0].p=0&c[0].o=13543761&c[1].s=19&c[1].p=0&c[1].o=37.8E,76.8S,37.9E,76.7S

Again, these 2 records have no country provided from the source(DiGIR in this case), and all we have done is deduce it from the stated coordinates and declare that "country inferred from coordinates". In other words, we are not changing a record, but aiding it's discoverability, albeit incorrectly in this case...

Please keep reporting errors Rod, so the network can address them.

As an aside: Did I ever tell you that your "transitive reduction" blog describes exactly how we build the 'backbone taxonomy' that the occurrence records hang from? (I can't find the blog anymore though, so I assume you removed it)

Tim (GBIF)

Roderic Page said...

Tim, maybe I'm a bit confused, but I've been assuming that the information displayed in the occurrence view is what GBIF has in its cache (i.e., I'm looking at the data GBIF hold, not what is currently served from the original source). Hence, when I look at the map occurrences I see no country record for the frog, but for each individual record (e.g., 33352972 I see "Madagascar". If I click on Retrieve original record from data provider, then I talk to the DiGIR provider directly (which has the newly corrected co-orindates). As it stands these three views -- (1) occurrence table, (2) occurrence detail, and (3) DiGIR provider are out of sync with each other). I'd assumed that 1 and 2 were the same thing (GBIF's local copy of the data). I hope you can see why I'm confused.

The "transitive reduction" post is still online.

Sounds like you guys need a blog so we can follow the cool stuff you are doing.

Anonymous said...

You are correct when you observe that there are 3 views. At GBIF we have harvesting and processing, and it is not always possible to process everything harvested in time for database roll over. Currently we process when a harvest is complete. We aim to have all processing done, but really we are always limited for processing time. In this frog instance, it just so happens when the version that is online has an out of sync harvested and processed view as we rolled over mid harvest. I understand this confusion, and can only say that we will try and improve this - the streamlined harvest proposal will go a very long way to eliminate this.
Thanks
Tim

Anonymous said...

And see the point map for the African freshwater stingray, Dasyatis garouaensis, plotted in FishBase with data from GBIF. That is more than "a comedy of errors" -- it amounts to ridiculous! GBIF data is nearly useless and I would never trust them anyway.

Roderic Page said...

Mauro,

I'm puzzled by your comment, because I can't find any georeferenced records for Dasyatis garouaensis in GBIF.

I eventually got the FishBase map to load, and yes this looks a mess, but it's hard to figure out why because the links to the original GBIF records are all broken (such as this one). This is a nice example of where simply linking to URLs is a recipe for disaster (in the same way, all of GBIF's links from GenBank are broken now).

If these links were stable identifiers we could have followed this up and figured out why the FishBase map as this stingray scattered all over the planet, and GBIF itself doesn't.

I suspect the problem may lie with FishBase and not GBIF.

Roderic Page said...

Tim tells me that the frog has finally swum home, and all specimens are now safely in Madagascar.

Anonymous said...

Dear Rod,

Thanks for your comment. The records for Dayatis garouaensis retrieved from FishBase are then mistakenly being attributed to GBIF "Portal": see here.

I apologize for the harsh tone of my previous message (but just think about how do you feel if you want to make use of these wonderful online biodiversity databases to do some real science and stumbled on something like that mess...). To be sure, other data I have retrieved from GBIF looks nice (*and* truly useful!).

Keep up the excelent job in this (and iSpecies) blogs!

BTW, is there any perspective to ressurrect the Taxonomic Search Engine?

Cheers,

Anonymous said...
This comment has been removed by a blog administrator.