I agree that user privacy is paramount, and people have thought of various whitelist rules and other automatic approaches to filter out personally identifiable information (PII), but they tend not to work once you dig into the data.
One caveat on the link Chris provided: I was only looking at "unsuccessful" queries. Felix seems to be after all queries—and there are plenty of successful queries that give good results that I didn't consider. All the queries that match titles and redirects would dilute (but not at all eliminate) the queries that cause privacy concerns.
I second Erik's suggestion of the Discernatron data. It's not perfect and there's not a lot of it, but it's available.
A moderate effort way to mine for queries would be to get volunteers to let you have their Wikipedia search history. In Chrome, for example, you can get an extension that will let you view all of your browser history at once (rather than one page at a time). I searched *wikipedia special:search*, clicked "All History", "select all" and pasted to a text file. I was able to gather almost 1200 queries in less than a minute. My home computer yielded 130 or so (that's probably more typical—I search a lot at work, for work). 20 volunteers would get you an admittedly biased sample of ~2,000 queries. It's not a great source, but it's something.
Such a manually mined corpus *would* have the advantage of being actual human queries. We get a lot of bots, and a lot of queries that aren't something you would necessarily want to optimize to improve human users' experience.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 17, 2016 at 2:10 PM, Chris Koerner ckoerner@wikimedia.org wrote:
The discussion around the difficulty of providing such a list (and it's relative usefulness) is well summarized in Trey's notes from his research into the matter.
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/ Top_Unsuccessful_Search_Queries
On Wed, Aug 17, 2016 at 12:58 PM, Eran Rosenthal eranroz89@gmail.com wrote:
Unfortunately WMF policy to release search queries to the public is too strict. (Although there are privacy concerns, I'm sure anyone here could easily think of some simple whitelist rules. For more details please refer to https://phabricator.wikimedia.org/T115085 or https://phabricator.wikimedia.org/T8373 or similar bugs in phabricator)
As a workaround you can use other data as approximation to what users look for (though you don't get the query itself, only the result - under assumption the users find what they look for): https://wikimedia.org/api/rest_v1/ - page view data or as dump: https://dumps.wikimedia.org/other/analytics/
Other options (they have their own caveats but you can try):
- Search for "Special:Search/QUERY" in the pagecounts-all-sites linked
above (zcat DUMP | grep "Search/") - this can provide you results such as "commons.m.m Special:Search/Jnnjjjnnnnjnjjnbnjbnjnjj 1 5418" so you know 1 user seared for "Jnnjjjnnnnjnjjnbnjbnjnjj" in mobile, at 2016-05-15 13:00-14:00
- Use google trends
On Wed, Aug 17, 2016 at 8:18 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I’m currently writing by bachelor thesis at University Koblenz, Germany. The goal is to improve Wikipedia search by exploiting the text structure of Wikipedia articles. To conduct unbiased user studies I need real world queries so I can compare the novel algorithms agains the currently used ones. Are there any query logs existing which I can use for this purpose?
We do have query logs, but they are not publicly accessible for privacy reasons. You may want to check this out though: https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
-- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery