CCing the WMF Search and Discovery mailing list (https://lists.wikimedia.org/mailman/listinfo/discovery )
On Wed, Aug 17, 2016 at 6:00 AM, Felix Engelmann fengelmann@uni-koblenz.de wrote:
Hi everybody,
I’m currently writing by bachelor thesis at University Koblenz, Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
Thanks for your help! Felix Engelmann _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Felix,
There was recently a discussion about releasing raw queries, and the decision was made by WMF not to release raw queries for privacy reasons. Personally, I support that decision because the risks seem to far outweigh the benefits. The staff from Discovery may be able to provide you with more detail or alternatives, but I would say that the odds of releasing raw data from is low.
Sometimes WMF allows access to sensitive data if an NDA is signed. In this case, I feel that the risks are too high even for that to be allowed. That's a personal opinion only; the official answer will come from WMF.
Pine
On Aug 17, 2016 08:39, "Tilman Bayer" tbayer@wikimedia.org wrote:
CCing the WMF Search and Discovery mailing list (https://lists.wikimedia.org/mailman/listinfo/discovery )
On Wed, Aug 17, 2016 at 6:00 AM, Felix Engelmann fengelmann@uni-koblenz.de wrote:
Hi everybody,
I’m currently writing by bachelor thesis at University Koblenz, Germany.
The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
Thanks for your help! Felix Engelmann _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
The best I could offer you Felix is a very small subset of queries that have been manually reviewed for release. These queries are within our result grading platform, Discernatron. You will need to first login at https://discernatron.wmflabs.org/ and then visit https://discernatron. wmflabs.org/scores/all?json=1. This will output a list of query results that have been graded, from that you can extract the individual queries that were used. You may be able to use these scores for an nDCG calculation. Unfortunately the list of graded queries is very small. There are 95 unique queries, and 4219 scored result pages.
On Wed, Aug 17, 2016 at 9:51 AM, Pine W wiki.pine@gmail.com wrote:
Hi Felix,
There was recently a discussion about releasing raw queries, and the decision was made by WMF not to release raw queries for privacy reasons. Personally, I support that decision because the risks seem to far outweigh the benefits. The staff from Discovery may be able to provide you with more detail or alternatives, but I would say that the odds of releasing raw data from is low.
Sometimes WMF allows access to sensitive data if an NDA is signed. In this case, I feel that the risks are too high even for that to be allowed. That's a personal opinion only; the official answer will come from WMF.
Pine
On Aug 17, 2016 08:39, "Tilman Bayer" tbayer@wikimedia.org wrote:
CCing the WMF Search and Discovery mailing list (https://lists.wikimedia.org/mailman/listinfo/discovery )
On Wed, Aug 17, 2016 at 6:00 AM, Felix Engelmann fengelmann@uni-koblenz.de wrote:
Hi everybody,
I’m currently writing by bachelor thesis at University Koblenz,
Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
Thanks for your help! Felix Engelmann _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
I’m currently writing by bachelor thesis at University Koblenz, Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
We do have query logs, but they are not publicly accessible for privacy reasons. You may want to check this out though: https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
Unfortunately WMF policy to release search queries to the public is too strict. (Although there are privacy concerns, I'm sure anyone here could easily think of some simple whitelist rules. For more details please refer to https://phabricator.wikimedia.org/T115085 or https://phabricator.wikimedia.org/T8373 or similar bugs in phabricator)
As a workaround you can use other data as approximation to what users look for (though you don't get the query itself, only the result - under assumption the users find what they look for): https://wikimedia.org/api/rest_v1/ - page view data or as dump: https://dumps.wikimedia.org/other/analytics/
Other options (they have their own caveats but you can try): * Search for "Special:Search/QUERY" in the pagecounts-all-sites linked above (zcat DUMP | grep "Search/") - this can provide you results such as "commons.m.m Special:Search/Jnnjjjnnnnjnjjnbnjbnjnjj 1 5418" so you know 1 user seared for "Jnnjjjnnnnjnjjnbnjbnjnjj" in mobile, at 2016-05-15 13:00-14:00 * Use google trends
On Wed, Aug 17, 2016 at 8:18 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I’m currently writing by bachelor thesis at University Koblenz, Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
We do have query logs, but they are not publicly accessible for privacy reasons. You may want to check this out though: https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
-- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
The discussion around the difficulty of providing such a list (and it's relative usefulness) is well summarized in Trey's notes from his research into the matter.
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Sear...
On Wed, Aug 17, 2016 at 12:58 PM, Eran Rosenthal eranroz89@gmail.com wrote:
Unfortunately WMF policy to release search queries to the public is too strict. (Although there are privacy concerns, I'm sure anyone here could easily think of some simple whitelist rules. For more details please refer to https://phabricator.wikimedia.org/T115085 or https://phabricator.wikimedia.org/T8373 or similar bugs in phabricator)
As a workaround you can use other data as approximation to what users look for (though you don't get the query itself, only the result - under assumption the users find what they look for): https://wikimedia.org/api/rest_v1/ - page view data or as dump: https://dumps.wikimedia.org/other/analytics/
Other options (they have their own caveats but you can try):
- Search for "Special:Search/QUERY" in the pagecounts-all-sites linked
above (zcat DUMP | grep "Search/") - this can provide you results such as "commons.m.m Special:Search/Jnnjjjnnnnjnjjnbnjbnjnjj 1 5418" so you know 1 user seared for "Jnnjjjnnnnjnjjnbnjbnjnjj" in mobile, at 2016-05-15 13:00-14:00
- Use google trends
On Wed, Aug 17, 2016 at 8:18 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I’m currently writing by bachelor thesis at University Koblenz, Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
We do have query logs, but they are not publicly accessible for privacy reasons. You may want to check this out though: https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
-- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
I agree that user privacy is paramount, and people have thought of various whitelist rules and other automatic approaches to filter out personally identifiable information (PII), but they tend not to work once you dig into the data.
One caveat on the link Chris provided: I was only looking at "unsuccessful" queries. Felix seems to be after all queries—and there are plenty of successful queries that give good results that I didn't consider. All the queries that match titles and redirects would dilute (but not at all eliminate) the queries that cause privacy concerns.
I second Erik's suggestion of the Discernatron data. It's not perfect and there's not a lot of it, but it's available.
A moderate effort way to mine for queries would be to get volunteers to let you have their Wikipedia search history. In Chrome, for example, you can get an extension that will let you view all of your browser history at once (rather than one page at a time). I searched *wikipedia special:search*, clicked "All History", "select all" and pasted to a text file. I was able to gather almost 1200 queries in less than a minute. My home computer yielded 130 or so (that's probably more typical—I search a lot at work, for work). 20 volunteers would get you an admittedly biased sample of ~2,000 queries. It's not a great source, but it's something.
Such a manually mined corpus *would* have the advantage of being actual human queries. We get a lot of bots, and a lot of queries that aren't something you would necessarily want to optimize to improve human users' experience.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 17, 2016 at 2:10 PM, Chris Koerner ckoerner@wikimedia.org wrote:
The discussion around the difficulty of providing such a list (and it's relative usefulness) is well summarized in Trey's notes from his research into the matter.
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/ Top_Unsuccessful_Search_Queries
On Wed, Aug 17, 2016 at 12:58 PM, Eran Rosenthal eranroz89@gmail.com wrote:
Unfortunately WMF policy to release search queries to the public is too strict. (Although there are privacy concerns, I'm sure anyone here could easily think of some simple whitelist rules. For more details please refer to https://phabricator.wikimedia.org/T115085 or https://phabricator.wikimedia.org/T8373 or similar bugs in phabricator)
As a workaround you can use other data as approximation to what users look for (though you don't get the query itself, only the result - under assumption the users find what they look for): https://wikimedia.org/api/rest_v1/ - page view data or as dump: https://dumps.wikimedia.org/other/analytics/
Other options (they have their own caveats but you can try):
- Search for "Special:Search/QUERY" in the pagecounts-all-sites linked
above (zcat DUMP | grep "Search/") - this can provide you results such as "commons.m.m Special:Search/Jnnjjjnnnnjnjjnbnjbnjnjj 1 5418" so you know 1 user seared for "Jnnjjjnnnnjnjjnbnjbnjnjj" in mobile, at 2016-05-15 13:00-14:00
- Use google trends
On Wed, Aug 17, 2016 at 8:18 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I’m currently writing by bachelor thesis at University Koblenz, Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
We do have query logs, but they are not publicly accessible for privacy reasons. You may want to check this out though: https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
-- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
Other options (they have their own caveats but you can try):
- Search for "Special:Search/QUERY" in the pagecounts-all-sites linked
above (zcat DUMP | grep "Search/") - this can provide you results such as "commons.m.m Special:Search/Jnnjjjnnnnjnjjnbnjbnjnjj 1 5418" so you know 1 user seared for "Jnnjjjnnnnjnjjnbnjbnjnjj" in mobile, at 2016-05-15 13:00-14:00
I wouldn't rely on this. It looks like a bug that it is in public data and this will probably be gone soon. Generally, the current official answer to the question of "where one can get logs of the search queries for public use" AFAIK is "there's no way to get it without signing papers and going through procedures".
Hi!
I wouldn't rely on this. It looks like a bug that it is in public data and this will probably be gone soon. Generally, the current official answer to the question of "where one can get logs of the search queries for public use" AFAIK is "there's no way to get it without signing papers and going through procedures".
See also: https://meta.wikimedia.org/wiki/Discovery/Data_access_guidelines