On Wed, Jul 21, 2010 at 2:36 PM, Jakob jakob.voss@s1999.tu-chemnitz.dewrote:
Hi,
Talking about identifiers for bibliographic records I just want to stress one crucial point:
This gives us the following key, guaranteed to be unique: KangHsuKrajbich20091011b
There is absolutely no such thing as a "guaranteed unique identifier" that can be derived from existing metadata. You will *always* have false positives (different publications get the same identifier [1]) and false negatives (same publication has different identifiers [2]). Fuzzy identifiers even occur if they are created by the publisher or author himself (for instance duplicate ISBNs for definitely different editions or even totally different books). If you argue about identifiers please keep in mind that you *always* talk about heuristics but not about something "unique per se". Existing identifiers only differ in the ratio of false positives and false negatives.
The only way you may get unique identifiers is to assign your own identifiers that are *not* derived from the content - such as auto-incremented record ids in a database. Even then they are not unique if you change the content because the identity of the object may change. A MD5 or SHA-sum on the full content [3] or the version id in a versioning database (like MediaWiki) is unique but not practical if you want to change content. A solution to this problem is to let people decide in every single case about how an identifier looks like and when it should change (example: Wikipedia article titles). But then the identifiers are not permanent (records may split and join and be renamed).
That's the way it is. You have to decide which problem to solve with an identifier and then be aware of its limitations. As Brooks [3] wrote there is no silver bullet - so there is no silver identifier.
Cheers Jakob
[1] For instance if you have a common name and a general title or if you want to distinguish the printed version and the presentation slides of the same publication etc.
[2] For instance different ways to abbreviate and/or write the name of an author and/or title, different years (year of preprint vs year of printed version) etc.
[3] See http://en.wikipedia.org/wiki/No_Silver_Bullet which cites an article that has been published in 1986 and 1987, and probably reprinted in another year - so what's the identifier? ;-)
Hi Jakob,
I would like to counter this point with the following rule: There is always a way to adjudicate ambiguity. It is easy to create a rule that works in 90% of cases:
Author1Author2Author3EtAl10
It is easy to modify this rule to work in 99% of cases:
Author1Author2Author3EtAl20101011b
Modifying the rule to work in 100% of cases requires a community of users to adjudicate the relatively small number of special cases.
Brian