[Mediawiki-l] Character equivalence
Lars Aronsson
lars at aronsson.se
Fri Oct 21 12:35:14 UTC 2005
Kyle Moore wrote:
> This is because google, understandably, is a bit smarter than mediawiki,
> Google has a list of equivalent (or at least similar) characters so that
This is a fact, but not understandable. I'm not sure if you're
talking about MediaWiki's own search or Wikipedia's use of Lucene,
a free software search engine. But both ought to have the same
capabilities as Google. Accent neutrality is as central to search
as case (capitalization) neutrality.
Google's character equivalence classes (a = á = ä) are sometimes
too generous. In German and Swedish, a and ä ought to be treated
as different characters, even though a and á are the same, for
example Kalla and Källa are two different words. Google offers
the plus operator to override the default behaviour. A Google
search for +Kalla will not find Källa or Kálla.
In fact, some users without a German keyboard will type AE instead
of Ä. It could be argued that any vowel followed by an E should
be treated as equivalent to the first vowel alone (a = ä = ae) in
searches.
I would have thought that basic string functions such as strcmp(),
strcasecmp(), strcoll(), wcscasecmp() and qsort() are now
universally based on a collation defined by the locale and are
thus able to distinguish between big differences (alpha < beta)
and small ones (alpha = Alpha = älpha), and that this capability
would be reflected through higher level languages such as PHP.
We're not in the 1980s anymore.
--
Lars Aronsson (lars at aronsson.se)
Aronsson Datateknik - http://aronsson.se
More information about the MediaWiki-l
mailing list