On Wed, Jul 21, 2010 at 2:42 AM, Daniel Kinzler daniel@brightbyte.dewrote:
- The first three author names separated by slashes
why not separate by pluses? they don't form part of names either, and don't cause problems with wiki page titles.
I like this... however, how would you represent this in a URL? Also note that using plusses in page names don't work with all server configurations, since plus has a special meaning in URLs.
- Some or all of the date. For instance, if there is only one source by
this set of authors that year, we can just use YYYY. However, once
another
source by those set of authors is added, the key should change to
MMDDYYYY
or similar.
I don't think it is a good idea to change one key as a function of updates on another, except for a generic disambiguation tag.
I agree. And if you *have* to use the full date, use YYYYMMDD, not the other way around, please.
Since the slashes are somewhat cumbersome, perhaps we can not make them mandatory, but similarly use them only when they are necessary in order
to
"escape" a name. In the case that one of the authors does not have a
slash
in their name - the dominant case - we can stick to the easily legible
and
niecly compact CamelCase format.
Example keys generated by this algorithm:
KangHsuKrajbichEtAl2009
Kang+Hsu+Krajbich+2009+the+wick+in or Kang+Hsu+Krajbich+2009+twi
Both seem good, though i would suggest to form a convention to ignore any leading "the" and "a", to a more distinctive 3 word suffix.
Of course, it does not have to be _exactly_ three authors, nor three words from the title, and it does not solve the John Smith (or Zheng Wang) problem.
It also doesn't solve issues with transliteration: Merik Möller may become "Moeller" or "Moller", Jakob Voß may become "Voss" or "Vosz" or even "VoB", etc. In case of chinese names, it's often not easy to decide which part is the last name.
To avoid this kind of ambiguity, i suggest to automatically apply some type of normalization and/or hashing. There is quite a bit of research about this kind of normalisation out there, generally with the aim of detecting duplicates. Perhaps we can learn from bibsonomy.org, have a look how they do it: http://www.bibsonomy.org/help/doc/inside.html.
Gotta love open source university research projects :)
-- daniel
Hey Daniel,
Bibsonomy seems to suffer from the same problem as CiteULike - urls which convey no meaning. An example url id from CiteULike is 2434335, and one from Bibsonomy is 29be860f0bdea4a29fba38ef9e6dd6a09. I hope to continue to steer the conversation away from that direction. These IDs guarantee uniqueness, but I believe that we can create keys that both guarantee uniqueness and convey some meaning to humans. Consider that this key will be embedded in wiki articles any time a source is cited. It's important that it make some sense.
Plus signs and slashes in the key appear to be cumbersome. Perhaps we can avoid this by truncating last names that involve a slash to either the portion before or after the slash.
Changing the key seems to be a bad idea, so we want a key system that is unique from the start. That means we should use the full date, YYYYMMDD as suggested by Daniel.
In the event that multiple sources are published by the same set of authors on the same day, we can use a, b, c disambiguation.
This gives us the following key, guaranteed to be unique: KangHsuKrajbich20091011b
Brian