On Wed, Feb 28, 2018 at 3:17 AM, Georg Sorst <g.sorst(a)findologic.com> wrote:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot
with Simple Wikipedia articles. It's a great data set because it's
comprehensive and not domain specific so when building search on top of it
humans can easily judge result quality, and it's still small enough to be
handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The
idea is to look at result clicks from an internal search search engine,
feed that into the Machine Learning and adjust search accordingly so that
the top-clicked results actually rank best. We will be using Solr LTR for
this purpose.
I forgot to mention, Solr is great but you might want to consider
elasticsearch as well. We release weekly dumps of the production search
indices in elasticsearch bulk imput format (json document per line) at
https://dumps.wikimedia.org/other/cirrussearch/ and have co-developed an
LTR plugin for elasticsearch that is fairly similar to the Solr one at
http://elasticsearch-learning-to-rank.readthedocs.io/en/latest/. Your task
might be easier since this is already put together, but if you are familiar
with solr it probably wouldn't be too hard to convert the elasticsearch
format into solr batch format
I would love to base this on Simple Wikipedia data
since it would fit well
into the rest of the lecture. Unfortunately, I could not find that data.
The closest I came is
https://meta.wikimedia.org/wiki/Research:Wikipedia_
clickstream but this covers neither Simple Wikipedia nor does it specify
internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it
myself from raw data? Ideally I would need (query-document) pairs with the
number of occurrences.
Thank you!
Georg
--
*Georg M. Sorst I CTO*
[image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020
<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
Salzburg
<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
I T.: +43 662 456708 <+43%20662%20456708>
E.: g.sorst(a)findologic.com
www.findologic.com Folgen Sie uns auf: XING
<https://www.xing.com/profile/Georg_Sorst> facebook
<http://www.facebook.com/Findologic/> Twitter
<https://twitter.com/findologic>
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
A6 Stand E130 in München*! Hier
<beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin
vereinbaren!
Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
<beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin
vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin
vereinbaren!
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics