I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:- Mirandese (mwl) analysis plugin built from Portuguese and French parts, plus a stop list provided by an mwl editor
- plugin to merge high surrogates and low surrogates that get split up by the Chinese analyzer
- plugin to do automatic homoglyph corrections
- plugin to do transliteration for languages where it is relatively easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map)
- look into ways of automatically generating a stemmer from Wiktionary conjugation/declension data (maybe start with Estonian?)
- compare the analyzers for the top 5-10 wiki languages by volume, and look for ways to increase consistency among them
- develop a different statistical approach to detect wrong keyboard typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row
- update RelForge with some additional metrics I’ve been collecting
- project Wordnet or other thesaurus/ontology onto short strings (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest
- recheck differences in unpacked vs monolithic analyzers (eliminating our automatic upgrades, which 98% likely to have caused the diffs)
- “Bollywood detector”—identify and map Bollywood movie names into multiple scripts