Hey,
How hard would it be to come up with these word-stem normalizers for other languages (i.e. did you base Esperanto off of another similar language or did you come up with it yourself relatively easily)? Is there a good description somewhere on how to come up with them?
That may require some linguistic abilities and some coding abilities :) There are several stemmers floating around, with one being used by various opensource software - snowball (http://snowball.tartarus.org/). It has English, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish supports. Maybe there's possibility to adapt rules from there.
Cheers, Domas
Domas Mituzas wrote:
various opensource software - snowball (http://snowball.tartarus.org/). It has English, French, Spanish, Portuguese, Italian, German, Dutch,
I have made bad experiences with Snowball for the german language. I.e. the word "Vater" (father) becomes "vat" (a whisky label :-), Mutter (mother) morphes into "mutt" (a mail program), Müller (miller) changes into "mull" or - converting the umlaut 'ü' into "ue" - into "muell" (waste). These are rediculous sematic results, which unsharpen search results considerably.
On the other hand many plural word like "Autos" or "Fotos" (cars, photos) do not change into the desired singular form by Snowball.
Therefore I decided to do my fulltext database "joda" without stemming. The cost is low: Only some megabytes more of disk space is needed for the BTree which deals the first level of the retrieving process. The performance loss is nearly immeasurable. Search results are considerably better (sharper).
For retrieving, a wildcard at the end of a word (*) helps in most cases (at least in German) and is a tool which every user understands and accepts. Maybe there are better stemming tools like snowball for the german language, but in practice there is no big need for them: Please note, that most search items are substantives or proper names which often needs no stemming or are even intolerat to any stemming.
Cheers
jo
wikitech-l@lists.wikimedia.org