Short answer, no, this data is not available publicy such you can compute
the dataset yourself as it is Private data.
Thanks,
Nuria
On Mon, Mar 5, 2018 at 11:31 AM, Georg Sorst <g.sorst(a)findologic.com> wrote:
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't
directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are
calculated for a few wikis
some or which are not very large so you might be able to
use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search
ranking is built for around
the
top 20 sites by search volume that we host. Simple wikipedia isn't one of
those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc.
maybe even source code?
Because of that we do have the data you need, but
the problem will be
that the actual search
queries are considered PII (Personally Identifiable Information) and not
something I can release publicly. It may be possible to release aggregated
data sets that don't include the actual search terms, but at that point I
don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently
aggregated to not be considered PII?
Thank you!
Georg
Georg Sorst <g.sorst(a)findologic.com> schrieb am Mi., 28. Feb. 2018 um
12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot
with Simple Wikipedia articles. It's a great data set because it's
comprehensive and not domain specific so when building search on top of it
humans can easily judge result quality, and it's still small enough to be
handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The
idea is to look at result clicks from an internal search search engine,
feed that into the Machine Learning and adjust search accordingly so that
the top-clicked results actually rank best. We will be using Solr LTR for
this purpose.
I would love to base this on Simple Wikipedia data since it would fit
well into the rest of the lecture. Unfortunately, I could not find that
data. The closest I came is
https://meta.wikimedia.org/
wiki/Research:Wikipedia_clickstream but this covers neither Simple
Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it
myself from raw data? Ideally I would need (query-document) pairs with the
number of occurrences.
Thank you!
Georg
--
*Georg M. Sorst I CTO*
[image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020
<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
Salzburg
<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
I T.: +43 662 456708 <+43%20662%20456708>
E.: g.sorst(a)findologic.com
www.findologic.com Folgen Sie uns auf: XING
<https://www.xing.com/profile/Georg_Sorst> facebook
<http://www.facebook.com/Findologic/> Twitter
<https://twitter.com/findologic>
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
A6 Stand E130 in München*! Hier
<beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin
vereinbaren!
Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
Hier <beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin
vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
G.17 in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin
vereinbaren!
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
--
*Georg M. Sorst I CTO*
[image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020
<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
Salzburg
<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
I T.: +43 662 456708 <+43%20662%20456708>
E.: g.sorst(a)findologic.com
www.findologic.com Folgen Sie uns auf: XING
<https://www.xing.com/profile/Georg_Sorst> facebook
<http://www.facebook.com/Findologic/> Twitter
<https://twitter.com/findologic>
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
A6 Stand E130 in München*! Hier
<beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin
vereinbaren!
Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
<beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin
vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin
vereinbaren!
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics