On 21 Jul 2010, at 09:42, Daniel Kinzler wrote:
This seems best to me of what's proposed so far.
Both seem good, though i would suggest to form a
convention to ignore any
leading "the" and "a", to a more distinctive 3 word suffix.
While that's a good idea, then we'd have to know all "indistinctive"
words in all languages. (Die, Der, La, L', ...)
There are still going to be duplicates, alas...
Of course, it does not have to be _exactly_ three
authors, nor three
words from the title, and it does not solve the John Smith (or Zheng
It also doesn't solve issues with transliteration: Merik Möller may become
"Moeller" or "Moller", Jakob Voß may become "Voss" or
"Vosz" or even "VoB",
etc. In case of chinese names, it's often not easy to decide which part is the
To avoid this kind of ambiguity, i suggest to automatically apply some type of
normalization and/or hashing. There is quite a bit of research about this kind
of normalisation out there, generally with the aim of detecting duplicates.
Perhaps we can learn from bibsonomy.org
, have a look how they do it: