With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks:

Web UI for cirrus debug/devel features:

- Settings dump

- Mappings dump

- Copy version of settings+mappings suitable to create index with curl

- cirrusDumpQuery

- cirrusDumpResult

- cirrusExplain

- cirrusUserTesting

Top level idea is to make it easy to access all of these things. Could be

a userscript run on-page in the wiki. Could be an SPA run from tool labs

(or even people.wikimedia.org).

============

docker setup to initialize elasticsearch, import latest cirrus dump, and

attach a kibana instance for UI. Probably with a modified mapping more

amicable to kibana inspection.

============

Some script to manage elasticsearch allocation manually via api? Pointless, but

perhaps fun.

===========

phabricator formatted export for jupyter

- problem: images?

-- Seems would need to upload separately and then reference them in final output

-- There is an api for this, but then we can't just emit something to paste into a field

the whole export needs to happen over api then.

- better, but worse: data-uri's would be great. But i dunno if phab is built for megabyte sized posts. They also

don't support data-uri's. Browsers also hate when you copy/paste excessive amounts of data.

==========

Custom implementation to find similar images in commons:

- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.5151&rep=rep1&type=pdf

- http://www.deepideas.net/building-content-based-search-engine-quantifying-similarity/

- Convert image into a feature vector

- Use clustering to generate an image signature

- Find k-nearest-neighbors via Earth Mover Distance (EMD), can utilize pyemd library.

- It's very not-obvious how the signature + weight gets plugged into pyemd

- EMD is expensive, no clue how this would scale to millions of images

- This would probably perform poorly, more interesting to get to understand some of the history of similar image retrieval

=========

https://github.com/beniz/deepdetect.git ?

- Use pre-trained ML to detect objects in images and then label those objects.

- Can compare similarity of objects detected for similar images. Can probably

extend with color information

- Do we actually have a use case for images similar to other images? Perhaps on upload?

==========

Elasticsearch cluster balance simulator

- Allow to Simulate valuate how the cluster balancing performs under various simulated conditions

- no way this could be done in a weekend hackathon. It would probably be

completely wrong as well and simulate some idealized cluster that doesn't act

like ours.

==========

Prototype Lire plugin for elasticsearch

- Lire = Lucene Image REtrieval

- I know nothing about it, other than it exists

- Plugin already exists plugging it into solr, so how hard could it be?

- Maybe try it out standalone with some small test set to see what it does