Hackathon ideas

Discovery Weekly Update for the...

Erik Bernhardson

1 May 2018 1 May '18

7:14 p.m.

With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Web UI for cirrus debug/devel features: - Settings dump - Mappings dump - Copy version of settings+mappings suitable to create index with curl - cirrusDumpQuery - cirrusDumpResult - cirrusExplain - cirrusUserTesting

Top level idea is to make it easy to access all of these things. Could be a userscript run on-page in the wiki. Could be an SPA run from tool labs (or even people.wikimedia.org).

============

docker setup to initialize elasticsearch, import latest cirrus dump, and attach a kibana instance for UI. Probably with a modified mapping more amicable to kibana inspection.

============

Some script to manage elasticsearch allocation manually via api? Pointless, but perhaps fun.

===========

phabricator formatted export for jupyter - problem: images? -- Seems would need to upload separately and then reference them in final output -- There is an api for this, but then we can't just emit something to paste into a field the whole export needs to happen over api then. - better, but worse: data-uri's would be great. But i dunno if phab is built for megabyte sized posts. They also don't support data-uri's. Browsers also hate when you copy/paste excessive amounts of data.

==========

Custom implementation to find similar images in commons: - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.5151&rep=re... - http://www.deepideas.net/building-content-based-search-engine-quantifying-si... - Convert image into a feature vector - Use clustering to generate an image signature - Find k-nearest-neighbors via Earth Mover Distance (EMD), can utilize pyemd library. - It's very not-obvious how the signature + weight gets plugged into pyemd - EMD is expensive, no clue how this would scale to millions of images - This would probably perform poorly, more interesting to get to understand some of the history of similar image retrieval

=========

https://github.com/beniz/deepdetect.git ? - Use pre-trained ML to detect objects in images and then label those objects. - Can compare similarity of objects detected for similar images. Can probably extend with color information - Do we actually have a use case for images similar to other images? Perhaps on upload?

==========

Elasticsearch cluster balance simulator - Allow to Simulate valuate how the cluster balancing performs under various simulated conditions - no way this could be done in a weekend hackathon. It would probably be completely wrong as well and simulate some idealized cluster that doesn't act like ours.

==========

Prototype Lire plugin for elasticsearch - Lire = Lucene Image REtrieval - I know nothing about it, other than it exists - Plugin already exists plugging it into solr, so how hard could it be? - Maybe try it out standalone with some small test set to see what it does

Attachments:

attachment.htm (text/html — 4.0 KB)

Show replies by date

Trey Jones

2 May 2 May

11:39 a.m.

I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:

- Mirandese (mwl) analysis plugin built from Portuguese and French parts, plus a stop list provided by an mwl editor - plugin to merge high surrogates and low surrogates that get split up by the Chinese analyzer - plugin to do automatic homoglyph corrections - plugin to do transliteration for languages where it is relatively easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map) - look into ways of automatically generating a stemmer from Wiktionary conjugation/declension data (maybe start with Estonian?) - compare the analyzers for the top 5-10 wiki languages by volume, and look for ways to increase consistency among them - develop a different statistical approach to detect wrong keyboard typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row - update RelForge with some additional metrics I’ve been collecting - project Wordnet or other thesaurus/ontology onto short strings (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest - recheck differences in unpacked vs monolithic analyzers (eliminating our automatic upgrades, which 98% likely to have caused the diffs) - “Bollywood detector”—identify and map Bollywood movie names into multiple scripts

I was planning to work on the Mirandese analysis plugin and maybe one of the next three on the list. But if anyone wants to collaborate on any of the others, I'm happy to do so.

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson <ebernhardson@wikimedia.org

...

wrote:

...

With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Deborah Tankersley

12:53 p.m.

Nice stuff!

Should we set up a meeting to talk more in depth about this, as we're about 2 weeks out from the Hackathon right now?

Cheers,

Deb

deb tankersley

Program Manager, Engineering

Wikimedia Foundation

On Wed, May 2, 2018 at 8:39 AM, Trey Jones tjones@wikimedia.org wrote:

...

I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:

Mirandese (mwl) analysis plugin built from Portuguese and French

parts, plus a stop list provided by an mwl editor

plugin to merge high surrogates and low surrogates that get split up

by the Chinese analyzer

plugin to do automatic homoglyph corrections

plugin to do transliteration for languages where it is relatively

easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map)

look into ways of automatically generating a stemmer from Wiktionary

conjugation/declension data (maybe start with Estonian?)

compare the analyzers for the top 5-10 wiki languages by volume, and

look for ways to increase consistency among them

develop a different statistical approach to detect wrong keyboard

typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row

update RelForge with some additional metrics I’ve been collecting

project Wordnet or other thesaurus/ontology onto short strings

(e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest

recheck differences in unpacked vs monolithic analyzers (eliminating

our automatic upgrades, which 98% likely to have caused the diffs)

“Bollywood detector”—identify and map Bollywood movie names into

multiple scripts

I was planning to work on the Mirandese analysis plugin and maybe one of the next three on the list. But if anyone wants to collaborate on any of the others, I'm happy to do so.

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Justin Ormont

2:09 p.m.

Greetings Deb/Trey/Erik,

I'd enjoy joining the discussions on these hackathon topics also.

Specifically, I'd like to see I can help improve MWF's search relevance using additional machine learning techniques/ML-packages.

Thanks, --justin

On Wed, May 2, 2018 at 8:53 AM, Deborah Tankersley < dtankersley@wikimedia.org> wrote:

...

Nice stuff!

Should we set up a meeting to talk more in depth about this, as we're about 2 weeks out from the Hackathon right now?

Cheers,

Deb

--

deb tankersley

Program Manager, Engineering

Wikimedia Foundation

On Wed, May 2, 2018 at 8:39 AM, Trey Jones tjones@wikimedia.org wrote:

...
I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:

Mirandese (mwl) analysis plugin built from Portuguese and French

parts, plus a stop list provided by an mwl editor

plugin to merge high surrogates and low surrogates that get split

up by the Chinese analyzer

plugin to do automatic homoglyph corrections

plugin to do transliteration for languages where it is relatively

easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map)

look into ways of automatically generating a stemmer from

Wiktionary conjugation/declension data (maybe start with Estonian?)

compare the analyzers for the top 5-10 wiki languages by volume,

and look for ways to increase consistency among them

develop a different statistical approach to detect wrong keyboard

typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row

update RelForge with some additional metrics I’ve been collecting

project Wordnet or other thesaurus/ontology onto short strings

(e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest

recheck differences in unpacked vs monolithic analyzers

(eliminating our automatic upgrades, which 98% likely to have caused the diffs)

“Bollywood detector”—identify and map Bollywood movie names into

multiple scripts

I was planning to work on the Mirandese analysis plugin and maybe one of the next three on the list. But if anyone wants to collaborate on any of the others, I'm happy to do so.

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Trey Jones

4:06 p.m.

Deb—We talked about some of these in our Wednesday meeting, but didn't do much deciding or prioritizing. After that, at the hackathon travel meeting, Rachel reminded us that the hackathon is "a community-focused event" and that we as WMF staff should be "supporting, connecting, and helping volunteer and affiliate developers." So, I think I'm going to update my hackathon participation info to include a link to the list of projects I want to work on, and hope that someone from outside the WMF contacts me about something. On the learning side, I've already gotten David to agree to help me with some of the technical bits I need for my some of my proposed projects, either before or at the hackathon (yay!). I also hope that the "Tell me why your search sucks" sign will encourage people to stop and chat with us. I figure random people chatting with us about search and anyone who wants to work with us would take precedence over any other projects we might prefer to work on at the hackathon, though I plan to fall back to my list if I run out of other things to do or people to talk to.

Justin—We can definitely talk about ways to keep improving the ML ranking (or other ML approaches for search). I don't know if there's time during the hackathon to pull something together—I guess it depends on how complex it is. More broadly—and Erik can speak more definitively about this—I'd say while there's always some ML-related stuff going on in the background, our Q4 goals https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q4#Program_1:_Make_knowledge_more_easily_discoverable are less about Learn-to-Rank/ML, so there may not be much bandwidth for any complex projects in the short term. That said, I'm gathering ideas for NLP applications for search—which often overlaps with ML applications—so if you have any ideas (or if anyone else does!), please share them, whether here or off-list.

—Trey

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Wed, May 2, 2018 at 1:09 PM, Justin Ormont justin.ormont@gmail.com wrote:

...

Greetings Deb/Trey/Erik,

I'd enjoy joining the discussions on these hackathon topics also.

Specifically, I'd like to see I can help improve MWF's search relevance using additional machine learning techniques/ML-packages.

Thanks, --justin

On Wed, May 2, 2018 at 8:53 AM, Deborah Tankersley < dtankersley@wikimedia.org> wrote:

...
Nice stuff!

Should we set up a meeting to talk more in depth about this, as we're about 2 weeks out from the Hackathon right now?

Cheers,

Deb

--

deb tankersley

Program Manager, Engineering

Wikimedia Foundation

On Wed, May 2, 2018 at 8:39 AM, Trey Jones tjones@wikimedia.org wrote:

...
I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:

Mirandese (mwl) analysis plugin built from Portuguese and French

parts, plus a stop list provided by an mwl editor

plugin to merge high surrogates and low surrogates that get split

up by the Chinese analyzer

plugin to do automatic homoglyph corrections

plugin to do transliteration for languages where it is relatively

easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map)

look into ways of automatically generating a stemmer from

Wiktionary conjugation/declension data (maybe start with Estonian?)

compare the analyzers for the top 5-10 wiki languages by volume,

and look for ways to increase consistency among them

develop a different statistical approach to detect wrong keyboard

typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row

update RelForge with some additional metrics I’ve been collecting

project Wordnet or other thesaurus/ontology onto short strings

(e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest

recheck differences in unpacked vs monolithic analyzers

(eliminating our automatic upgrades, which 98% likely to have caused the diffs)

“Bollywood detector”—identify and map Bollywood movie names into

multiple scripts

I was planning to work on the Mirandese analysis plugin and maybe one of the next three on the list. But if anyone wants to collaborate on any of the others, I'm happy to do so.

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Deborah Tankersley

4:14 p.m.

Yes! The "Tell me why your search sucks" sign was a success last year, and I'm looking forward to seeing/hearing all the cool questions folks will ask this time! :)

I also just got a chance to look at Erik's slides (final version https://upload.wikimedia.org/wikipedia/commons/4/4c/From_Clicks_to_Models_The_Wikimedia_LTR_Pipeline.pdf) that he presented at Haystack and I think it might be cool to reprise that presentation in a breakout session...if Erik is up for it. :)

Cheers,

Deb

deb tankersley

Program Manager, Engineering

Wikimedia Foundation

On Wed, May 2, 2018 at 1:06 PM, Trey Jones tjones@wikimedia.org wrote:

...

Deb—We talked about some of these in our Wednesday meeting, but didn't do much deciding or prioritizing. After that, at the hackathon travel meeting, Rachel reminded us that the hackathon is "a community-focused event" and that we as WMF staff should be "supporting, connecting, and helping volunteer and affiliate developers." So, I think I'm going to update my hackathon participation info to include a link to the list of projects I want to work on, and hope that someone from outside the WMF contacts me about something. On the learning side, I've already gotten David to agree to help me with some of the technical bits I need for my some of my proposed projects, either before or at the hackathon (yay!). I also hope that the "Tell me why your search sucks" sign will encourage people to stop and chat with us. I figure random people chatting with us about search and anyone who wants to work with us would take precedence over any other projects we might prefer to work on at the hackathon, though I plan to fall back to my list if I run out of other things to do or people to talk to.

Justin—We can definitely talk about ways to keep improving the ML ranking (or other ML approaches for search). I don't know if there's time during the hackathon to pull something together—I guess it depends on how complex it is. More broadly—and Erik can speak more definitively about this—I'd say while there's always some ML-related stuff going on in the background, our Q4 goals https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q4#Program_1:_Make_knowledge_more_easily_discoverable are less about Learn-to-Rank/ML, so there may not be much bandwidth for any complex projects in the short term. That said, I'm gathering ideas for NLP applications for search—which often overlaps with ML applications—so if you have any ideas (or if anyone else does!), please share them, whether here or off-list.

—Trey

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Wed, May 2, 2018 at 1:09 PM, Justin Ormont justin.ormont@gmail.com wrote:

...
Greetings Deb/Trey/Erik,

I'd enjoy joining the discussions on these hackathon topics also.

Specifically, I'd like to see I can help improve MWF's search relevance using additional machine learning techniques/ML-packages.

Thanks, --justin

On Wed, May 2, 2018 at 8:53 AM, Deborah Tankersley < dtankersley@wikimedia.org> wrote:

...
Nice stuff!

Should we set up a meeting to talk more in depth about this, as we're about 2 weeks out from the Hackathon right now?

Cheers,

Deb

--

deb tankersley

Program Manager, Engineering

Wikimedia Foundation

On Wed, May 2, 2018 at 8:39 AM, Trey Jones tjones@wikimedia.org wrote:

...
I've got my own list of more language-focused not-necessarily-great ideas, in order of my current desire to work on them:

Mirandese (mwl) analysis plugin built from Portuguese and French

parts, plus a stop list provided by an mwl editor

plugin to merge high surrogates and low surrogates that get split

up by the Chinese analyzer

plugin to do automatic homoglyph corrections

plugin to do transliteration for languages where it is relatively

easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a char map)

look into ways of automatically generating a stemmer from

Wiktionary conjugation/declension data (maybe start with Estonian?)

compare the analyzers for the top 5-10 wiki languages by volume,

and look for ways to increase consistency among them

develop a different statistical approach to detect wrong keyboard

typing and build a search-only filter to generate alternative tokens—for Russian/English, Hebrew/English, OR one hand on wrong home row

update RelForge with some additional metrics I’ve been collecting

project Wordnet or other thesaurus/ontology onto short strings

(e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest

recheck differences in unpacked vs monolithic analyzers

(eliminating our automatic upgrades, which 98% likely to have caused the diffs)

“Bollywood detector”—identify and map Bollywood movie names into

multiple scripts

I was planning to work on the Mirandese analysis plugin and maybe one of the next three on the list. But if anyone wants to collaborate on any of the others, I'm happy to do so.

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Stas Malyshev

9:33 p.m.

I have some ideas for potential hackathon projects for myself:

Finish https://commons.wikimedia.org/wiki/User:TabulistBot which is essentially a persistent delayed query functionality. There is a bunch of requests for something like that.

===============================

Figure out if and how we could do https://phabricator.wikimedia.org/T179879 - different capabilities (mainly timeouts) for some oauth'ed users.

===============================

Set up data collection for labeling Wikidata prefix searches, so that we could do ML processing. Basically, we need a way to know which search result the user selected for that, and right now it's not easy to figure out from logs.

===============================

Try to make some progress in https://phabricator.wikimedia.org/T190454 - making Wikidata prefix search search in both item and article namespaces. Would probably require diving deep into the guts of the UI, so not sure if I can do it, but would be fun to try.

===============================

Try to dig into Blazegraph guts and fix some of the annoying bugs like https://phabricator.wikimedia.org/T168876

===============================

Work on Geo-lookup service described in https://phabricator.wikimedia.org/T179991

===============================

Port Yuri's implementation of tabular data binding ro WDQS https://phabricator.wikimedia.org/T181319 (can go very well with with TabulistBot task above)

That's what I have thought about so far, now I need to choose what I actually want to do :)

-- Stas Malyshev smalyshev@wikimedia.org

2430

Age (days ago)

2432

Last active (days ago)

discovery@lists.wikimedia.org

6 comments

5 participants

tags (0)

participants (5)

Deborah Tankersley
Erik Bernhardson
Justin Ormont
Stas Malyshev
Trey Jones