Georgi Kobilarov wrote:
In this
particular one, it's two articles about the
same
topic, but there could be some cases where the two articles are about
something different.
Yes, such as
http://en.wikipedia.org/wiki/FROG
and
http://en.wikipedia.org/wiki/Frog
I agree that this can be annoying. One have to make sure to not lose the
case information (as it happened to me with
lookup.dbpedia.org once, hence
merging FROG and Frog).
But what do you suggest to do about that, Paul? Should Wikipedia make URLs
case-insensitive and then enforce disambiguation with ()?
If (wikipedia) were my site, I'd do two things:
(i) map all case-variant forms to a single form (New yOrK cITy -> New
York City;) "FROG" gets renamed to "FROG Cipher" or "Frog
(Cipher)"
(ii) do a permanent redirect from variant forms to the canonical form
I think what dbpedia is doing is reasonable considering the situation.
My own system for handling generic databases has both a VARBINARY
and VARCHAR field for dbpedia URLs/labels. It does a case-insensitive
lookup first, and if that fails, looks at the alternatives that turn
up. It's also got some heuristics for dealing with redirects,
disambiguation, and all that. In the big picture I see "naming and
identity" as a specific functional module for this kind of system...