Monday, February 02, 2009

Wiki modelling - Part 3

I rather skirted around the notion of "taxonomic concepts" in the previous post, partly because it's easy to end up with trying to have a concept for each utterance every made by a taxonomist, and that doesn't seem, er, scalable. So, I have a more limited view of a taxonomic concept, namely a name attached to some data. For example, I think the NCBI Taxonomy provides useful taxonomic concepts, in that names are explicitly linked to data, such as sequences:

Having data means we can make inferences that have some basis, other than trying to figure out what a taxonomist "meant".

However, things start to get a little messy once I try and extract more information out of NCBI GenBank. Some time ago I pointed out the potential utility of host association records in GenBank. In some (many?) cases the host taxa won't be in GenBank, so the link will be between DNA sequence and taxon name. This is, of course, a simplification. It would be nice to model things more accurately. For example, a parasite will typically be obtained from a host organism, so it might be nice if, say, we had voucher specimen codes for both parasite and host, and could model the link as one between organisms (or samples of/from organisms). However, this is unlikely to be feasible in most cases, hence we have sequences linked to names:

2 comments:

Pete DeVries said...

The NCBI taxon id works in a way similar to the GeoSpecies ID's in that the ID stays the same despite changes in nomenclature. I would use this except that there is no NCBI ID for most species. There needs to be a sequence for there to be an NCBI ID.

In this sense, the NCBI ID and GeoSpecies ID are compatible types of OTU's.

It seems that the genomics people are already using forms of the NCBI ID as Species URI's

This gets into what makes a good species URI.

I think that there is a species and different scientists publish taxonomic hypotheses for that species.

It seems that most others see this the other way around.

They see it from this perspective. There are publications (each with a slightly different meaning for the species they discuss). They point these at each other merge etc. and create an identifier.

Is this a useful, meaningful identifier? Maybe for publications and as additional information related to species concepts, but not something I would use to tie specimens to a operational taxonomic unit (OTU).

RelatedTopic: I have treebaseID's in the GeoSpecies database, but I have not figured out the best way to express them. Do you have a standard URI?

Roderic Page said...

Pete,

NCBI ids are widely used as taxon URIs, partly because they are stable (something biodiversity databases have -- until recently -- failed to ensure), and partly because they are connected to data that people care about - sequences. They also have the advantage of providing a single "view" of what each taxon is (although that view may change over time).

I haven't thought about this much, but I would lean to the idea that taxonomic names should have URIs (they are essentially tags), that we are interested in information tagged with those names, and that some people/agencies/organisations will provide a view on what is (or isn't) included in a particular taxon. Those views that people find useful (e.g., NCBI taxon ids) will get used. I personally don't find debates about the existence of taxa terribly illuminating.

Re TreeBASE, the nearest I've come across are the URL API links described at http://www.treebase.org/treebase/urlapi.html