Erik,is there some documentation / further reading available on the machine ranking used for Wikipedia? This sounds very interesting!
And can you elaborate on how the aggregated search queries are PII?
GeorgGeorg Sorst <g.sorst@findologic.com> schrieb am Mo., 5. März 2018 um 20:31 Uhr:Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?Erik:Is the raw data available? Can I compute the clickstream myself?Nuria:Hi all,sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
> Datasets do not include simple wiki, there are calculated for a few wikis
some or which are not very large so you might be able to use them.
> This is actually how our production search ranking is built for around the
top 20 sites by search volume that we host. Simple wikipedia isn't one of
those we currently use machine ranking for though.
> Because of that we do have the data you need, but the problem will be that the actual search
queries are considered PII (Personally Identifiable Information) and not
something I can release publicly. It may be possible to release aggregated
data sets that don't include the actual search terms, but at that point I
don't think the data will be useful to you anymore.I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
--Thank you!Georg--Georg Sorst <g.sorst@findologic.com> schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:--Hi list,as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_ but this covers neither Simple Wikipedia nor does it specify internal search queries.clickstream Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.Thank you!GeorgGeorg M. Sorst I CTO
Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: g.sorst@findologic.com
www.findologic.com Folgen Sie uns auf: XING facebook TwitterWir sehen uns auf der Internet World - am 06.03. & 07.03.2018 in Halle A6 Stand E130 in München! Hier Termin vereinbaren!Wir sehen uns auf der SHOPTALK von 18. bis 21. März in Las Vegas! Hier Termin vereinbaren!Wir sehen uns auf der SOM am 18.04. & 19.04.2018 in Halle 7 Stand G.17 in Zürich! Hier Termin vereinbaren!Hier geht es zu unserer Homepage!Georg M. Sorst I CTO
Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: g.sorst@findologic.com
www.findologic.com Folgen Sie uns auf: XING facebook TwitterWir sehen uns auf der Internet World - am 06.03. & 07.03.2018 in Halle A6 Stand E130 in München! Hier Termin vereinbaren!Wir sehen uns auf der SHOPTALK von 18. bis 21. März in Las Vegas! Hier Termin vereinbaren!Wir sehen uns auf der SOM am 18.04. & 19.04.2018 in Halle 7 Stand G.17 in Zürich! Hier Termin vereinbaren!Hier geht es zu unserer Homepage!Georg M. Sorst I CTOFINDOLOGIC GmbH
Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708Wir sehen uns auf der SHOPTALK von 18. bis 21.03 in Las Vegas! Hier Termin vereinbaren!Wir sehen uns auf der SOM am 18.04. & 19.04. in Halle 7 Stand G.17 in Zürich! Hier Termin vereinbaren!Wir sehen uns auf dem SHOPWARE Community Day am 18.05. in Duisburg! Hier Termin vereinbaren!Wir sehen uns auf der OXID Commons am 14.06. in Freiburg! Hier Termin vereinbaren!Hier geht es zu unserer Homepage!
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics