Hi Alexander!
This indeed seems like an interesting project. Responding to your
suggestions:
First, I am ready to collaborate with you on making this data available as
other researchers have done in the past. I would
appreciate if you let me
know which steps I need to take in order to work with you on this task.
I'd suggest you apply for a research project here[1]. The research team
will discuss the project with you. And if it gets approved, you can sign
and NDA and have access to the raw data. You can also apply for a grant
here[2].
Second, you can consider making this data available
after achieving the
necessary level of confidentiality. For example, you can group request
types so that each group has at least 1000 unique IP-addresses.
There are a couple tasks[3] in our backlog about effectively anonymizing
the pageview data for a general purpose. We used an algorithm similar to
what you proposed. Our experience, though, is that anonymization (for
general purpose) is a non-trivial task. We plan to work on this in the
mid-term (actually, we already started to work on it, see tasks) but we
have other priorities for the next quarter. I'd suggest again that you
apply for a specific project for the needs of your study here[1][2].
Another challenge, I guess, would be categorizing the articles as
educational or entertainment. The categories in Wikipedia are a cool way to
browse, but not an exact way of clustering contents. And I guess the
frontier between educational and entertainment can be sometimes fuzzy, no?
A very interesting challenge anyway.
cheers!
[1]
https://meta.wikimedia.org/wiki/Research:New_project
[2]
https://meta.wikimedia.org/wiki/Grants:Project
[3]
https://phabricator.wikimedia.org/T114675
https://phabricator.wikimedia.org/T118839
https://phabricator.wikimedia.org/T118838
https://phabricator.wikimedia.org/T118841
On Wed, Dec 14, 2016 at 5:02 PM, Alexander Ugarov <augarov(a)email.uark.edu>
wrote:
Dear members of the Analytics Team!
Please, consider my request for information or collaboration. I am
conducting the research project on the international determinants of
education quality. In my view, Wikimedia statistics is the priceless
resource of information on how much learning people do outside of
educational institutions.
I would like to access the data on Wikipedia pageviews by country,
language and content area to measure the private learning in different
countries. My previous empirical results suggest that Wikipedia pageviews
are highly correlated with education quality. Unfortunately, the available
data does not allow to separate the educational pageviews from the pure
entertainment pageviews (for example, celebrities biographies).
I am aware that the data currently is not the part of the publicly
available dataset. Please, consider two options. First, I am ready to
collaborate with you on making this data available as other researchers
have done in the past. I would appreciate if you let me know which steps I
need to take in order to work with you on this task. Second, you can
consider making this data available after achieving the necessary level of
confidentiality. For example, you can group request types so that each
group has at least 1000 unique IP-addresses.
I am looking forward to hear from you on my opportunities to use this
data. I think that it is going to be very interesting to know how much
people learn from Wikipedia, for example, in India versus Brazil and Egypt.
Do people in Indonesia learn less than people in Germany due to poor
quality school systems or low private incentives for learning? I am also
sure that many social scientists will also benefit from using such
information (if you make it available) and will produce some
policy-relevant research.
Best regards,
Alexander Ugarov,
Ph. D. Candidate.
Sam M. Walton College of Business
Department of Economics
University of Arkansas
Office: ECOB260
E-mail: augarov(a)uark.edu.
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation