Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg
Did I miss something? Is this data available somewhere?
You can find more information about click streams datasets here: https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
Datasets do not include simple wiki, there are calculated for a few wikis some or which are not very large so you might be able to use them.
On Wed, Feb 28, 2018 at 3:17 AM, Georg Sorst g.sorst@findologic.com wrote:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_ clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Feb 28, 2018 at 3:17 AM, Georg Sorst g.sorst@findologic.com wrote:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
This is actually how our production search ranking is built for around the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though. Because of that we do have the data you need, but the problem will be that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I would love to base this on Simple Wikipedia data since it would fit well
into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_ clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Feb 28, 2018 at 3:17 AM, Georg Sorst g.sorst@findologic.com wrote:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I forgot to mention, Solr is great but you might want to consider elasticsearch as well. We release weekly dumps of the production search indices in elasticsearch bulk imput format (json document per line) at https://dumps.wikimedia.org/other/cirrussearch/ and have co-developed an LTR plugin for elasticsearch that is fairly similar to the Solr one at http://elasticsearch-learning-to-rank.readthedocs.io/en/latest/. Your task might be easier since this is already put together, but if you are familiar with solr it probably wouldn't be too hard to convert the elasticsearch format into solr batch format
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_ clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are calculated for a few wikis
some or which are not very large so you might be able to use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search ranking is built for around the
top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?
Because of that we do have the data you need, but the problem will be
that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
Thank you! Georg
Georg Sorst g.sorst@findologic.com schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Short answer, no, this data is not available publicy such you can compute the dataset yourself as it is Private data.
Thanks,
Nuria
On Mon, Mar 5, 2018 at 11:31 AM, Georg Sorst g.sorst@findologic.com wrote:
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are calculated for a few wikis
some or which are not very large so you might be able to use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search ranking is built for around
the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?
Because of that we do have the data you need, but the problem will be
that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
Thank you! Georg
Georg Sorst g.sorst@findologic.com schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/ wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Erik,
is there some documentation / further reading available on the machine ranking used for Wikipedia? This sounds very interesting!
And can you elaborate on how the aggregated search queries are PII?
Thank you! Georg
Georg Sorst g.sorst@findologic.com schrieb am Mo., 5. März 2018 um 20:31 Uhr:
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are calculated for a few wikis
some or which are not very large so you might be able to use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search ranking is built for around
the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?
Because of that we do have the data you need, but the problem will be
that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
Thank you! Georg
Georg Sorst g.sorst@findologic.com schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Sorry for the delayed response, I've been out the last week. Responses inline.
On Mon, Mar 12, 2018 at 1:27 AM, Georg Sorst g.sorst@findologic.com wrote:
Erik,
is there some documentation / further reading available on the machine ranking used for Wikipedia? This sounds very interesting!
The code for managing all the data and training models is in
https://github.com/wikimedia/search-mjolnir. This is a pyspark application that starts with the logged click data and transforms it into trained models. The models are currently trained using xgboost, but we are considering lightgbm as a replacement. Collecting click data is done separately with some processing of web request logs to match up search requests with their clicks.
And can you elaborate on how the aggregated search queries are PII?
The problem is that any aggregation of search queries that wants to be
used to learn a ranking function needs to be provided the original query string. That string is then not aggregated, it is passed straight through from the users keyboard to the output data. We unfortunately don't have the kind of search volume, and don't keep long enough records (only 90 days) , to place arbitrary limits for minimum unique sessions issuing a query, and still have data that is representative of the whole. For example on english wikipedia, which is by far the most popular, only 60% of search sessions involve a query that was issued more than 10 times in the last 90 days. And 10 times is *way* too low for public release (I'm not sure where a reasonable cutoff might be, but its certainly not 10).
Thank you!
Georg
Georg Sorst g.sorst@findologic.com schrieb am Mo., 5. März 2018 um 20:31 Uhr:
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are calculated for a few
wikis some or which are not very large so you might be able to use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search ranking is built for around
the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?
We use a DBN (chapelle, 2009) to transform click stream data into labeled search result data, and then LambdaMART for the final ranking model. Link to mjolnir which does the training linked above.
Because of that we do have the data you need, but the problem will be that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
As mentioned above, the query is the hard part. Query strings contain
arbitrary information and if you want to build a ranking function you have to have those original queries to do feature collection.
Thank you!
Georg
Georg Sorst g.sorst@findologic.com schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/ wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* FINDOLOGIC GmbH
Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708 E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorstfacebook https://www.facebook.com/Findologic Twitter https://twitter.com/findologic
Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! Hier <beratung@findologic.com?subject=Shopware%20Community%20Day> Termin vereinbaren! Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier <beratung@findologic.com?subject=OXID%20Commons> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Erik,
Erik Bernhardson ebernhardson@wikimedia.org schrieb am Mi., 14. März 2018 um 15:34 Uhr:
Sorry for the delayed response, I've been out the last week. Responses inline.
On Mon, Mar 12, 2018 at 1:27 AM, Georg Sorst g.sorst@findologic.com wrote:
Erik,
is there some documentation / further reading available on the machine ranking used for Wikipedia? This sounds very interesting!
The code for managing all the data and training models is in
https://github.com/wikimedia/search-mjolnir. This is a pyspark application that starts with the logged click data and transforms it into trained models. The models are currently trained using xgboost, but we are considering lightgbm as a replacement. Collecting click data is done separately with some processing of web request logs to match up search requests with their clicks.
Great stuff, thank you!
And can you elaborate on how the aggregated search queries are PII?
The problem is that any aggregation of search queries that wants to be
used to learn a ranking function needs to be provided the original query string. That string is then not aggregated, it is passed straight through from the users keyboard to the output data. We unfortunately don't have the kind of search volume, and don't keep long enough records (only 90 days) , to place arbitrary limits for minimum unique sessions issuing a query, and still have data that is representative of the whole. For example on english wikipedia, which is by far the most popular, only 60% of search sessions involve a query that was issued more than 10 times in the last 90 days. And 10 times is *way* too low for public release (I'm not sure where a reasonable cutoff might be, but its certainly not 10).
Thank you!
Georg
Georg Sorst g.sorst@findologic.com schrieb am Mo., 5. März 2018 um 20:31 Uhr:
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are calculated for a few
wikis some or which are not very large so you might be able to use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search ranking is built for around
the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?
We use a DBN (chapelle, 2009) to transform click stream data into labeled search result data, and then LambdaMART for the final ranking model. Link to mjolnir which does the training linked above.
Because of that we do have the data you need, but the problem will be that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
As mentioned above, the query is the hard part. Query strings contain
arbitrary information and if you want to build a ranking function you have to have those original queries to do feature collection.
Just for my understanding (not a Machine Learning expert yet :) ): I would need (query -> document) pairs such as ("machine learning" -> https://en.wikipedia.org/wiki/Machine_learning) and how often each of these pairs has ocurred, right? Even if this pair has only occured once, how is this PII? Or do I need more than just (query -> document)?
Thank you so much, this is all very enlightening! Georg
Thank you!
Georg
Georg Sorst g.sorst@findologic.com schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* FINDOLOGIC GmbH
Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708 E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorstfacebook https://www.facebook.com/Findologic Twitter https://twitter.com/findologic
Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! Hier <beratung@findologic.com?subject=Shopware%20Community%20Day> Termin vereinbaren! Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier <beratung@findologic.com?subject=OXID%20Commons> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Mar 15, 2018 6:41 AM, "Georg Sorst" g.sorst@findologic.com wrote:
Hi Erik,
Erik Bernhardson ebernhardson@wikimedia.org schrieb am Mi., 14. März 2018 um 15:34 Uhr:
Sorry for the delayed response, I've been out the last week. Responses inline.
On Mon, Mar 12, 2018 at 1:27 AM, Georg Sorst g.sorst@findologic.com wrote:
Erik,
is there some documentation / further reading available on the machine ranking used for Wikipedia? This sounds very interesting!
The code for managing all the data and training models is in
https://github.com/wikimedia/search-mjolnir. This is a pyspark application that starts with the logged click data and transforms it into trained models. The models are currently trained using xgboost, but we are considering lightgbm as a replacement. Collecting click data is done separately with some processing of web request logs to match up search requests with their clicks.
Great stuff, thank you!
And can you elaborate on how the aggregated search queries are PII?
The problem is that any aggregation of search queries that wants to be
used to learn a ranking function needs to be provided the original query string. That string is then not aggregated, it is passed straight through from the users keyboard to the output data. We unfortunately don't have the kind of search volume, and don't keep long enough records (only 90 days) , to place arbitrary limits for minimum unique sessions issuing a query, and still have data that is representative of the whole. For example on english wikipedia, which is by far the most popular, only 60% of search sessions involve a query that was issued more than 10 times in the last 90 days. And 10 times is *way* too low for public release (I'm not sure where a reasonable cutoff might be, but its certainly not 10).
Thank you!
Georg
Georg Sorst g.sorst@findologic.com schrieb am Mo., 5. März 2018 um 20:31 Uhr:
Hi all,
sorry for this messy post - I forgot to subscribe to the list so I can't directly reply to your responses.
Nuria:
Datasets do not include simple wiki, there are calculated for a few
wikis some or which are not very large so you might be able to use them.
Is the raw data available? Can I compute the clickstream myself?
Erik:
This is actually how our production search ranking is built for around
the top 20 sites by search volume that we host. Simple wikipedia isn't one of those we currently use machine ranking for though.
Awesome! Is there more info available somewhere? Algorithms used etc. maybe even source code?
We use a DBN (chapelle, 2009) to transform click stream data into labeled search result data, and then LambdaMART for the final ranking model. Link to mjolnir which does the training linked above.
Because of that we do have the data you need, but the problem will be that the actual search queries are considered PII (Personally Identifiable Information) and not something I can release publicly. It may be possible to release aggregated data sets that don't include the actual search terms, but at that point I don't think the data will be useful to you anymore.
I think I'm fine with query-document pairs. Isn't that sufficiently aggregated to not be considered PII?
As mentioned above, the query is the hard part. Query strings contain
arbitrary information and if you want to build a ranking function you have to have those original queries to do feature collection.
Just for my understanding (not a Machine Learning expert yet :) ): I would need (query -> document) pairs such as ("machine learning" -> https://en.wikipedia.org/wiki/Machine_learning) and how often each of these pairs has ocurred, right? Even if this pair has only occured once, how is this PII? Or do I need more than just (query -> document)?
You only need the (query, document) pairs, the PII is the query string itself. To start at the beginning of the pipeline with clicks you don't want click counts, but search sessions. Each search session here is a query, a list of matched documents in the order presented, and timestamped clicks to those results. From that data it can be aggregated to click counts if you want, but I think that would lose very important position bias information from the sessions.
We consider all submitted search queries to potentially be PII. In the past when we have released lists of search queries we have multiple people under NDA review and remove PII which typically filters around 10% of the queries. This means any kind of phone number, serial number, or non-notable address. We remove searches for any specific URL, and names of non-notable companies and non-notable people (those that don't have wiki articles and aren't mentioned prominently in any other article). Some time ago, before any current members of search platform were involved, there was a release of unfiltered search queries. This had to be taken down almost immediately after the community reported PII leakage with specific examples from the release.
Thank you so much, this is all very enlightening! Georg
Thank you!
Georg
Georg Sorst g.sorst@findologic.com schrieb am Mi., 28. Feb. 2018 um 12:17 Uhr:
Hi list,
as part of a lecture on Information Retrieval I am giving we work a lot with Simple Wikipedia articles. It's a great data set because it's comprehensive and not domain specific so when building search on top of it humans can easily judge result quality, and it's still small enough to be handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The idea is to look at result clicks from an internal search search engine, feed that into the Machine Learning and adjust search accordingly so that the top-clicked results actually rank best. We will be using Solr LTR for this purpose.
I would love to base this on Simple Wikipedia data since it would fit well into the rest of the lecture. Unfortunately, I could not find that data. The closest I came is https://meta.wikimedia.org/wik i/Research:Wikipedia_clickstream but this covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it myself from raw data? Ideally I would need (query-document) pairs with the number of occurrences.
Thank you! Georg -- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* [image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020 https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g Salzburg https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g I T.: +43 662 456708 <+43%20662%20456708> E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorst facebook http://www.facebook.com/Findologic/ Twitter https://twitter.com/findologic
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle A6 Stand E130 in München*! Hier <beratung@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin vereinbaren! Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
-- *Georg M. Sorst I CTO* FINDOLOGIC GmbH
Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708 E.: g.sorst@findologic.com www.findologic.com Folgen Sie uns auf: XING https://www.xing.com/profile/Georg_Sorstfacebook https://www.facebook.com/Findologic Twitter https://twitter.com/findologic
Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier <beratung@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in Zürich*! Hier <beratung@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! Hier <beratung@findologic.com?subject=Shopware%20Community%20Day> Termin vereinbaren! Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier <beratung@findologic.com?subject=OXID%20Commons> Termin vereinbaren! Hier http://www.findologic.com geht es zu unserer *Homepage*!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics