On Wed, Jul 21, 2010 at 2:42 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
>> 1) The first three author names separated by slashes
> why not separate by pluses? they don't form part of names either, and
> don't cause problems with wiki page titles.

I like this... however, how would you represent this in a URL? Also note that
using plusses in page names don't work with all server configurations, since
plus has a special meaning in URLs.

>> 3) Some or all of the date. For instance, if there is only one source by
>> this set of authors that year, we can just use YYYY. However, once another
>> source by those set of authors is added, the key should change to MMDDYYYY
>> or similar.
> I don't think it is a good idea to change one key as a function of
> updates on another, except for a generic disambiguation tag.

I agree. And if you *have* to use the full date, use YYYYMMDD, not the other way
around, please.

>> Since the slashes are somewhat cumbersome, perhaps we can not make them
>> mandatory, but similarly use them only when they are necessary in order to
>> "escape" a name. In the case that one of the authors does not have a slash
>> in their name - the dominant case - we can stick to the easily legible and
>> niecly compact CamelCase format.
>>
>> Example keys generated by this algorithm:
>>
>> KangHsuKrajbichEtAl2009
> Kang+Hsu+Krajbich+2009+the+wick+in
> or
> Kang+Hsu+Krajbich+2009+twi

Both seem good, though i would suggest to form a convention to ignore any
leading "the" and "a", to a more distinctive 3 word suffix.

> Of course, it does not have to be _exactly_ three authors, nor three
> words from the title, and it does not solve the John Smith (or Zheng
> Wang) problem.

It also doesn't solve issues with transliteration: Merik Möller may become
"Moeller" or "Moller", Jakob Voß may become "Voss" or "Vosz"  or even "VoB",
etc. In case of chinese names, it's often not easy to decide which part is the
last name.

To avoid this kind of ambiguity, i suggest to automatically apply some type of
normalization and/or hashing. There is quite a bit of research about this kind
of normalisation out there, generally with the aim of detecting duplicates.
Perhaps we can learn from bibsonomy.org, have a look how they do it:
<http://www.bibsonomy.org/help/doc/inside.html>.

Gotta love open source university research projects :)

-- daniel

Hey Daniel,

Bibsonomy seems to suffer from the same problem as CiteULike - urls which convey no meaning. An example url id from CiteULike is 2434335, and one from Bibsonomy is 29be860f0bdea4a29fba38ef9e6dd6a09. I hope to continue to steer the conversation away from that direction. These IDs guarantee uniqueness, but I believe that we can create keys that both guarantee uniqueness and convey some meaning to humans. Consider that this key will be embedded in wiki articles any time a source is cited. It's important that it make some sense.

Plus signs and slashes in the key appear to be cumbersome. Perhaps we can avoid this by truncating last names that involve a slash to either the portion before or after the slash. 

Changing the key seems to be a bad idea, so we want a key system that is unique from the start. That means we should use the full date, YYYYMMDD as suggested by Daniel. 

In the event that multiple sources are published by the same set of authors on the same day, we can use a, b, c disambiguation. 

This gives us the following key, guaranteed to be unique: KangHsuKrajbich20091011b

Brian