Greetings Deb/Trey/Erik,
I'd enjoy joining the discussions on these hackathon topics also.
Specifically, I'd like to see I can help improve MWF's search relevance using additional machine learning techniques/ML-packages.
Thanks, --justin
On Wed, May 2, 2018 at 8:53 AM, Deborah Tankersley < dtankersley@wikimedia.org> wrote:
Nice stuff!
Should we set up a meeting to talk more in depth about this, as we're about 2 weeks out from the Hackathon right now?
Cheers,
Deb
--
deb tankersley
Program Manager, Engineering
Wikimedia Foundation
On Wed, May 2, 2018 at 8:39 AM, Trey Jones tjones@wikimedia.org wrote:
I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:
- Mirandese (mwl) analysis plugin built from Portuguese and French
parts, plus a stop list provided by an mwl editor
- plugin to merge high surrogates and low surrogates that get split
up by the Chinese analyzer
- plugin to do automatic homoglyph corrections
- plugin to do transliteration for languages where it is relatively
easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map)
- look into ways of automatically generating a stemmer from
Wiktionary conjugation/declension data (maybe start with Estonian?)
- compare the analyzers for the top 5-10 wiki languages by volume,
and look for ways to increase consistency among them
- develop a different statistical approach to detect wrong keyboard
typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row
- update RelForge with some additional metrics I’ve been collecting
- project Wordnet or other thesaurus/ontology onto short strings
(e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest
- recheck differences in unpacked vs monolithic analyzers
(eliminating our automatic upgrades, which 98% likely to have caused the diffs)
- “Bollywood detector”—identify and map Bollywood movie names into
multiple scripts
I was planning to work on the Mirandese analysis plugin and maybe one of the next three on the list. But if anyone wants to collaborate on any of the others, I'm happy to do so.
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery