[Wikitech-l] Search now working on Unicode Wikipedias

23 Nov 2002

I hacked up the fulltext search/index code a bit to work on UTF-8
despite MySQL's lack of direct support: a Language::stripForSearch()
function is called to do any necessary mangling of character sets before
we store the indexable version of the text.

For Esperanto, Polish, Russian, Czech and Korean I set it to just fold
the text to lowercase (so search is case insensitive) and then convert
all UTF-8 sequences into hex strings which MySQL won't mistreat.

For Chinese and Japanese, things are a bit more complicated, as there is
no word spacing in the original text but the fulltext search works on
words. For Chinese I just set it to put spaces around every character;
it needs a lot of tweaking, but it sort of works. If you search a single
character it works great, but multi-character sequences don't behave as
expected.

For Japanese, I have it divide up the text at boundaries around chunks
of the same type of character (hiragana, katakana, or kanji), which does
a pretty good first approximation of dividing at the right place. It
could probably use some more work as well. When searching a word/short
phrase that divides across character types (ie, 'furansugo' which mixes
katakana and kanji) results may not be as expected.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Search now working on Unicode Wikipedias