Hi Adam,
Thanks for reaching out.
Why this matters:
Today, most researchers rely on indirect signals—social-media trends or
Google’s autosuggest API—to infer web users' interests. Direct, aggregate
search-query data would be a primary source: a real-time window into topics
that are gaining (or losing) attention both globally and within each
language community.
With such a dataset we could:
+ map emerging interests across regions and languages;
+ study the life-cycle of topics (how fast they spike and fade);
+ improve ranking algorithms by pairing queries with the results users
actually click;
+ build applications that surface underserved information needs.
And these are just some 'top of the head' ideas...
Privacy:
To avoid any risk of personal data leakage, only queries that appear more
than X times in a given day/week would be released—never unique or
low-frequency strings.
Releasing aggregated click-through pairs (query: clicked page, count) would
add tremendous research value without compromising user anonymity.
I am happy to dive deeper or brainstorm implementation details whenever
helpful.
Cheers,
--
Sérgio Nunes
On Thu, 24 Jul 2025 at 13:09, Adam Baso <abaso(a)wikimedia.org> wrote:
> Hi Sérgio, thanks for your message. Apologies for the delayed response.
>
> Speaking on behalf of the Data Platform Engineering (where the Search
> Platform team resides and where most of the crucial knowledge for this sort
> of dataset creation resides), we're not presently considering production of
> this sort of dataset, as the focus is on different problems. It would be
> difficult to prioritize this sort of dataset creation and maintenance.
>
> However, could you tell us a bit more here on the list about some of the
> intended use cases and end users (direct and indirect) for such a dataset?
>
> Would you like to be connected with product management to discuss more
> about your use cases? I wouldn't want to suggest that it means the type of
> work will be prioritized, but our product management folks are looking for
> themes in the various use cases as they help set the context for user needs
> for the roadmap.
>
> Thanks!
> -Adam
>
> On Thu, Jul 24, 2025 at 5:57 AM Sérgio Nunes <sergio.nunes(a)fe.up.pt>
> wrote:
>
> > Hi,
> >
> > What would be the best Wikimedia interface to try to get this moving?
> >
> > Thanks for any sugestions
> > --
> > Sérgio Nunes
> >
> >
> > On Mon, 7 Jul 2025 at 13:23, Sérgio Nunes <sergio.nunes(a)fe.up.pt> wrote:
> >
> > > Hi all,
> > >
> > > I would like to suggest a new *highly valuable* data dump for
> Wikipedia:
> > > the release of aggregated search query logs. I am aware that a previous
> > > release of search data was retracted due to privacy concerns. However,
> I
> > > believe there is a privacy-preserving approach that could still provide
> > > great value to researchers.
> > >
> > > My proposal is to release only aggregated query data—specifically,
> > queries
> > > that have been observed more than X times within a given day or week.
> The
> > > dataset could follow a simple format such as:
> > >
> > > [day or week] [query text] [frequency]
> > >
> > > This method would eliminate the risk of exposing personal or unique
> > search
> > > queries. The dataset would be especially useful if released regularly
> > > (e.g., monthly) and broken down by language-specific Wikipedias.
> > >
> > >
> > > Is this the best forum for posting this suggestion?
> > >
> > > If you have suggestions for where to direct this proposal, or ideas for
> > an
> > > alternative approach, I would be grateful.
> > >
> > > Best regards,
> > > --
> > > Sérgio Nunes
> > >
> > _______________________________________________
> > Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
> > To unsubscribe send an email to
> wiki-research-l-leave(a)lists.wikimedia.org
> >
> _______________________________________________
> Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
> To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org
>