[Mediawiki-l] Character equivalence

Lars Aronsson lars at aronsson.se
Fri Oct 21 12:35:14 UTC 2005


Kyle Moore wrote:

> This is because google, understandably, is a bit smarter than mediawiki,
> Google has a list of equivalent (or at least similar) characters so that

This is a fact, but not understandable.  I'm not sure if you're 
talking about MediaWiki's own search or Wikipedia's use of Lucene, 
a free software search engine.  But both ought to have the same 
capabilities as Google.  Accent neutrality is as central to search 
as case (capitalization) neutrality.

Google's character equivalence classes (a = á = ä) are sometimes 
too generous.  In German and Swedish, a and ä ought to be treated 
as different characters, even though a and á are the same, for 
example Kalla and Källa are two different words.  Google offers 
the plus operator to override the default behaviour.  A Google 
search for +Kalla will not find Källa or Kálla.

In fact, some users without a German keyboard will type AE instead 
of Ä.  It could be argued that any vowel followed by an E should 
be treated as equivalent to the first vowel alone (a = ä = ae) in 
searches.

I would have thought that basic string functions such as strcmp(), 
strcasecmp(), strcoll(), wcscasecmp() and qsort() are now 
universally based on a collation defined by the locale and are 
thus able to distinguish between big differences (alpha < beta) 
and small ones (alpha = Alpha = älpha), and that this capability 
would be reflected through higher level languages such as PHP. 
We're not in the 1980s anymore.


-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se



More information about the MediaWiki-l mailing list