Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you!
Georg
--