On Wed, Feb 28, 2018 at 3:17 AM, Georg Sorst g.sorst@findologic.com wrote:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
This is actually how our production search ranking is built for around the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though. Because of that we do have the data you need, but the problem will be that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I would love to base this on Simple Wikipedia data since it would fit well
into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_ clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics