Hi all,
I would like to suggest a new *highly valuable* data dump for Wikipedia: the release of aggregated search query logs. I am aware that a previous release of search data was retracted due to privacy concerns. However, I believe there is a privacy-preserving approach that could still provide great value to researchers.
My proposal is to release only aggregated query data—specifically, queries that have been observed more than X times within a given day or week. The dataset could follow a simple format such as:
[day or week] [query text] [frequency]
This method would eliminate the risk of exposing personal or unique search queries. The dataset would be especially useful if released regularly (e.g., monthly) and broken down by language-specific Wikipedias.
Is this the best forum for posting this suggestion?
If you have suggestions for where to direct this proposal, or ideas for an alternative approach, I would be grateful.
Best regards, -- Sérgio Nunes
Hi,
What would be the best Wikimedia interface to try to get this moving?
Thanks for any sugestions -- Sérgio Nunes
On Mon, 7 Jul 2025 at 13:23, Sérgio Nunes sergio.nunes@fe.up.pt wrote:
Hi all,
I would like to suggest a new *highly valuable* data dump for Wikipedia: the release of aggregated search query logs. I am aware that a previous release of search data was retracted due to privacy concerns. However, I believe there is a privacy-preserving approach that could still provide great value to researchers.
My proposal is to release only aggregated query data—specifically, queries that have been observed more than X times within a given day or week. The dataset could follow a simple format such as:
[day or week] [query text] [frequency]
This method would eliminate the risk of exposing personal or unique search queries. The dataset would be especially useful if released regularly (e.g., monthly) and broken down by language-specific Wikipedias.
Is this the best forum for posting this suggestion?
If you have suggestions for where to direct this proposal, or ideas for an alternative approach, I would be grateful.
Best regards,
Sérgio Nunes
Hi Sérgio, thanks for your message. Apologies for the delayed response.
Speaking on behalf of the Data Platform Engineering (where the Search Platform team resides and where most of the crucial knowledge for this sort of dataset creation resides), we're not presently considering production of this sort of dataset, as the focus is on different problems. It would be difficult to prioritize this sort of dataset creation and maintenance.
However, could you tell us a bit more here on the list about some of the intended use cases and end users (direct and indirect) for such a dataset?
Would you like to be connected with product management to discuss more about your use cases? I wouldn't want to suggest that it means the type of work will be prioritized, but our product management folks are looking for themes in the various use cases as they help set the context for user needs for the roadmap.
Thanks! -Adam
On Thu, Jul 24, 2025 at 5:57 AM Sérgio Nunes sergio.nunes@fe.up.pt wrote:
Hi,
What would be the best Wikimedia interface to try to get this moving?
Thanks for any sugestions
Sérgio Nunes
On Mon, 7 Jul 2025 at 13:23, Sérgio Nunes sergio.nunes@fe.up.pt wrote:
Hi all,
I would like to suggest a new *highly valuable* data dump for Wikipedia: the release of aggregated search query logs. I am aware that a previous release of search data was retracted due to privacy concerns. However, I believe there is a privacy-preserving approach that could still provide great value to researchers.
My proposal is to release only aggregated query data—specifically,
queries
that have been observed more than X times within a given day or week. The dataset could follow a simple format such as:
[day or week] [query text] [frequency]
This method would eliminate the risk of exposing personal or unique
search
queries. The dataset would be especially useful if released regularly (e.g., monthly) and broken down by language-specific Wikipedias.
Is this the best forum for posting this suggestion?
If you have suggestions for where to direct this proposal, or ideas for
an
alternative approach, I would be grateful.
Best regards,
Sérgio Nunes
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
wiki-research-l@lists.wikimedia.org