Shilad, Edgard,
replying here to this and related threads.
1) Analytics Engineering (primarily Christian Aistleitner and Erik Zachte) is currently
working on defining how pageviews are counted and extracted from the raw request logs.
This work is part of a larger effort to replace the legacy webstatscollector [1], which
will produce more accurate data as well as the ability to parse requests in a more
flexible way. Christian and Erik should be able to comment on whether referral and search
query string extraction for inbound traffic are use cases falling within the initial scope
of this project. The publication of this data is a different matter (see below).
2) Releasing anonymized internal search data is not AFAIK one of the priorities the team
is currently working on. As Andrew noted in a previous thread, engineering effort aside,
further releases of private data will be subject to the new privacy policy which is
currently being discussed on Meta [2]. I don’t expect we’ll invest any effort into
anonymizing or aggregating data for the purpose of publication until the privacy policy
consultation is settled. Search is also undergoing a major overhaul [3]
3) As per Federico, a short description of all logs generated at Wikimedia (including
MediaWiki logs, page request logs, search logs and EventLogging data) can be found at on
Wikitech [4]
Dario
[1]
https://wikitech.wikimedia.org/wiki/Analytics/Webstatscollector
[2]
https://meta.wikimedia.org/wiki/Privacy_policy
[3]
https://www.mediawiki.org/wiki/Search
[4]
https://wikitech.wikimedia.org/wiki/Logs
On Dec 10, 2013, at 10:09 PM, Shilad Sen <ssen(a)macalester.edu> wrote:
Greetings!
I'm a Professor at Macalester College in Minnesota, and I have been collaborating
with Brent Hecht and many students to develop a Java framework for extracting multilingual
knowledge from Wikipedia [1]. The framework is pre-alpha now, but we hope to offer a
stable release in the next month.
Given a phrase (e.g. "apple"), our library must identifying articles associated
with a phrase. This is a probabilistic question. How likely is the phrase
"apple" to refer to the article about the fruit vs the company? This simple task
(often called Wikification or disambiguation) forms the basis of many NLP algorithms.
Google and Stanford have released an awesome dataset to support this task [2]. It
contains the *text* of all internet hyperlinks to Wikipedia articles. This dataset makes
the problem much easier, but it has two serious deficiencies. First, it only contains
links to articles in English Wikipedia. Second, it was generated once by Google, and it is
unlikely Google will update it.
The WMF could create a similar dataset by publishing the most common inbound search
queries for all WP pages across all language editions. This dataset would enable
individuals, researchers and small companies (not just Google and Microsoft) to harness
Wikipedia data for their applications.
Does this seem remotely possible? I've thought a little about engineering and privacy
issues related to the dataset. Neither are trivial, but I think they are feasible, and
I'd be happy to volunteer my engineering effort.
If you think the idea has legs, how do we develop a more formal proposal about the
dataset?
Thanks for your feedback!
-Shilad
[1]
https://github.com/shilad/wikAPIdia
[2]
http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.…
--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen(a)macalester.edu
651-696-6273
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics