Re: [Analytics] Wikipedia internal search clickstream

14 Mar 2018

Sorry for the delayed response, I've been out the last week. Responses
inline.

On Mon, Mar 12, 2018 at 1:27 AM, Georg Sorst &lt;g.sorst(a)findologic.com&gt; wrote:

...
  Erik,

 is there some documentation / further reading available on the machine
 ranking used for Wikipedia? This sounds very interesting!

 The code for managing all the data and training models is in
https://github.com/wikimedia/search-mjolnir. This is a pyspark application
that starts with the logged click data and transforms it into trained
models. The models are currently trained using xgboost, but we are
considering lightgbm as a replacement. Collecting click data is done
separately with some processing of web request logs to match up search
requests with their clicks.

And can you elaborate on how the aggregated search queries are PII?
...

 The problem is that any aggregation of search queries that wants to be used to
learn a ranking function needs to be provided the original query
string. That string is then not aggregated, it is passed straight through
from the users keyboard to the output data. We unfortunately don't have the
kind of search volume, and don't keep long enough records (only 90 days) ,
to place arbitrary limits for minimum unique sessions issuing a query,
and still have data that is representative of the whole. For example on
english wikipedia, which is by far the most popular, only 60% of search
sessions involve a query that was issued more than 10 times in the last 90
days. And 10 times is *way* too low for public release (I'm not sure where
a reasonable cutoff might be, but its certainly not 10).

Thank you!

...
  Georg

 Georg Sorst &lt;g.sorst(a)findologic.com&gt; schrieb am Mo., 5. März 2018 um
 20:31 Uhr:

> Hi all,
>
> sorry for this messy post - I forgot to subscribe to the list so I can't
> directly reply to your responses.
>
> Nuria:
>
> > Datasets do not include simple wiki, there are calculated for a few
> wikis
> some or which are not very large so you might be able to use them.
>
> Is the raw data available? Can I compute the clickstream myself?
>
> Erik:
>
> > This is actually how our production search ranking is built for around
> the
> top 20 sites by search volume that we host. Simple wikipedia isn't one of
> those we currently use machine ranking for though.
>
> Awesome! Is there more info available somewhere? Algorithms used etc.
> maybe even source code?
>
> We use a DBN (chapelle, 2009) to transform click stream data into labeled
search result data, and then LambdaMART for the final ranking model. Link
to mjolnir which does the training linked above.

...
  > Because of that we do have the data you need, but
the problem will be
> that the actual search
> queries are considered PII (Personally Identifiable Information) and not
> something I can release publicly. It may be possible to release aggregated
> data sets that don't include the actual search terms, but at that point I
> don't think the data will be useful to you anymore.
>
> I think I'm fine with query-document pairs. Isn't that sufficiently
> aggregated to not be considered PII?
>
> As mentioned above, the query is the hard part. Query strings contain arbitrary
information and if you want to build a ranking function you have
to have those original queries to do feature collection.

...
  Thank you!
  Georg

 Georg Sorst &lt;g.sorst(a)findologic.com&gt; schrieb am Mi., 28. Feb. 2018 um
 12:17 Uhr:

  Hi list,

 as part of a lecture on Information Retrieval I am giving we work a lot
 with Simple Wikipedia articles. It's a great data set because it's
 comprehensive and not domain specific so when building search on top of it
 humans can easily judge result quality, and it's still small enough to be
 handled by a regular computer.

 This year I want to cover the topic of Machine Learning for search. The
 idea is to look at result clicks from an internal search search engine,
 feed that into the Machine Learning and adjust search accordingly so that
 the top-clicked results actually rank best. We will be using Solr LTR for
 this purpose.

 I would love to base this on Simple Wikipedia data since it would fit
 well into the rest of the lecture. Unfortunately, I could not find that
 data. The closest I came is https://meta.wikimedia.org/
 wiki/Research:Wikipedia_clickstream but this covers neither Simple
 Wikipedia nor does it specify internal search queries.

 Did I miss something? Is this data available somewhere? Can I produce it
 myself from raw data? Ideally I would need (query-document) pairs with the
 number of occurrences.

 Thank you!
 Georg
 --
 *Georg M. Sorst I CTO*
 [image: FINDOLOGIC Logo]

 Jakob-Haringer-Str. 5a | 5020

<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
  Salzburg

<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
 I T.: +43 662 456708 <+43%20662%20456708>
 E.: g.sorst(a)findologic.com
 www.findologic.com Folgen Sie uns auf: XING
 <https://www.xing.com/profile/Georg_Sorst> facebook
 <http://www.facebook.com/Findologic/> Twitter
 <https://twitter.com/findologic>

 Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
 A6 Stand E130 in München*! Hier
 &lt;beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen&gt; Termin
 vereinbaren!
 Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
 Hier &lt;beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas&gt; Termin
 vereinbaren!
 Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
 G.17 in Zürich*! Hier
 &lt;beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich&gt; Termin vereinbaren!
 Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
  --
 *Georg M. Sorst I CTO*
 [image: FINDOLOGIC Logo]

 Jakob-Haringer-Str. 5a | 5020

<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
  Salzburg

<https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
 I T.: +43 662 456708 <+43%20662%20456708>
 E.: g.sorst(a)findologic.com
 www.findologic.com Folgen Sie uns auf: XING
 <https://www.xing.com/profile/Georg_Sorst> facebook
 <http://www.facebook.com/Findologic/> Twitter
 <https://twitter.com/findologic>

 Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
 A6 Stand E130 in München*! Hier
 &lt;beratung(a)findologic.com?subject=Internet%20World%20M%C3%BCnchen&gt; Termin
 vereinbaren!
 Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
 Hier &lt;beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas&gt; Termin
 vereinbaren!
 Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
 G.17 in Zürich*! Hier &lt;beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich&gt; Termin
 vereinbaren!
 Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
  --
 *Georg M. Sorst I CTO*
 FINDOLOGIC GmbH

 Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
 E.: g.sorst(a)findologic.com
 www.findologic.com Folgen Sie uns auf: XING
 <https://www.xing.com/profile/Georg_Sorst>facebook
 <https://www.facebook.com/Findologic> Twitter
 <https://twitter.com/findologic>

 Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier
 &lt;beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas&gt; Termin
 vereinbaren!
 Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in
 Zürich*! Hier &lt;beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich&gt; Termin
 vereinbaren!
 Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*!
 Hier &lt;beratung(a)findologic.com?subject=Shopware%20Community%20Day&gt; Termin
 vereinbaren!
 Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier
 &lt;beratung(a)findologic.com?subject=OXID%20Commons&gt; Termin vereinbaren!
 Hier <http://www.findologic.com> geht es zu unserer *Homepage*!

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Wikipedia internal search clickstream