Dear list,
I'm posting a recent conversation with Dan below, as well as a few follow-up questions.
Dan was kind enough to point out this list. I apologize that the post is "backward" (in email-thread format) due to my ignorance, will use this list from now on.
Thanks, Daniel
----
Hi Dan
Thanks for getting back to me so quickly!
Thanks for writing. In general these questions are best asked on our public list, so other people can see and benefit from any answers: https://lists.wikimedia.org/mailman/listinfo/ analytics
Thanks, I've joined this list and will ask subsequent questions there.
- pairs of pages: we have two datasets that are mentioned in this task https://
phabricator.wikimedia.org/T158972 which should be very interesting for this purpose. They aren't being updated right now, and the task is to do just that. We'll probably get to that within the next 3 months, but a bunch of us are on paternity leave this summer, so things are a little slower than normal
This seems close to what I need. From the descriptions I gather the linkage is by session. Is there also a linkage by ip (with IP's removed of course)?
- country data for pageviews: for privacy reasons we only allow access to this with an
NDA. We have good data on it, but you need to sign this NDA and use our cluster to access it, being careful about what you report about it to the world at large. Here's information on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
I've read this and am happy to sign an NDA. I understand it is best to be as specific as possible about the reasoning, intentions with the data, and permissions required. For me to figure this out it would be useful to know the relevant parts of the database schema, and perhaps a hint as to which data might be most interesting there. Would you be able to point me towards that?
Hope that helps, and feel free to write back to the public list in the future.
Definitely, very helpful and thank you!
Best, Daniel
On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) d.l.oberski@uu.nl wrote: Dear Dan,
My name is Daniel Oberski, I'm an associate professor of data science methodology in the department of statistics at Utrecht University in the Netherlands.
I've been using your incredibly useful pageviews API to study correlations between the amount of interest people show in a topic (pageviews) with other data such as political party preference over time. That has yielded some interesting results (which I have yet to write up).
However, to do a better study it would be very helpful to have slightly more information than is in the API. Specifically, it would be very useful to be able to query, for each _pair_ of pages, how many people (or IP's) viewed _both_ of those pages. That way I can find out which pages are really indicative of interest in a specific common topic, rather than just correlated by accident. In addition, I've found it hard to figure out pageviews for specific pages by country rather than language.
My question is, would you happen to know if is there any way to obtain this information? (does not necessarily have to be through the API.) Or do you know if there are people to whom I might talk about this?
Thanks for reading (to) the end and best regards,
Daniel