Dear list,
I'm posting a recent conversation with Dan below, as well as a few follow-up questions.
Dan was kind enough to point out this list. I apologize that the post is "backward" (in email-thread format) due to my ignorance, will use this list from now on.
Thanks, Daniel
----
Hi Dan
Thanks for getting back to me so quickly!
Thanks for writing. In general these questions are best asked on our public list, so other people can see and benefit from any answers: https://lists.wikimedia.org/mailman/listinfo/ analytics
Thanks, I've joined this list and will ask subsequent questions there.
- pairs of pages: we have two datasets that are mentioned in this task https://
phabricator.wikimedia.org/T158972 which should be very interesting for this purpose. They aren't being updated right now, and the task is to do just that. We'll probably get to that within the next 3 months, but a bunch of us are on paternity leave this summer, so things are a little slower than normal
This seems close to what I need. From the descriptions I gather the linkage is by session. Is there also a linkage by ip (with IP's removed of course)?
- country data for pageviews: for privacy reasons we only allow access to this with an
NDA. We have good data on it, but you need to sign this NDA and use our cluster to access it, being careful about what you report about it to the world at large. Here's information on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
I've read this and am happy to sign an NDA. I understand it is best to be as specific as possible about the reasoning, intentions with the data, and permissions required. For me to figure this out it would be useful to know the relevant parts of the database schema, and perhaps a hint as to which data might be most interesting there. Would you be able to point me towards that?
Hope that helps, and feel free to write back to the public list in the future.
Definitely, very helpful and thank you!
Best, Daniel
On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) d.l.oberski@uu.nl wrote: Dear Dan,
My name is Daniel Oberski, I'm an associate professor of data science methodology in the department of statistics at Utrecht University in the Netherlands.
I've been using your incredibly useful pageviews API to study correlations between the amount of interest people show in a topic (pageviews) with other data such as political party preference over time. That has yielded some interesting results (which I have yet to write up).
However, to do a better study it would be very helpful to have slightly more information than is in the API. Specifically, it would be very useful to be able to query, for each _pair_ of pages, how many people (or IP's) viewed _both_ of those pages. That way I can find out which pages are really indicative of interest in a specific common topic, rather than just correlated by accident. In addition, I've found it hard to figure out pageviews for specific pages by country rather than language.
My question is, would you happen to know if is there any way to obtain this information? (does not necessarily have to be through the API.) Or do you know if there are people to whom I might talk about this?
Thanks for reading (to) the end and best regards,
Daniel
Daniel,
Singining an NDA is not enough to get access to the data, you also need to be part of a formal research collaboration with our research team, they have a number of those and they are not likely to accept any more soon but you can contact them on that regard: https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
Thanks,
Nuria
On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski daniel.oberski@gmail.com wrote:
Dear list,
I'm posting a recent conversation with Dan below, as well as a few follow-up questions.
Dan was kind enough to point out this list. I apologize that the post is "backward" (in email-thread format) due to my ignorance, will use this list from now on.
Thanks, Daniel
Hi Dan
Thanks for getting back to me so quickly!
Thanks for writing. In general these questions are best asked on our
public list, so other
people can see and benefit from any answers: https://lists.wikimedia.org/
mailman/listinfo/
analytics
Thanks, I've joined this list and will ask subsequent questions there.
- pairs of pages: we have two datasets that are mentioned in this task
https://
phabricator.wikimedia.org/T158972 which should be very interesting for
this purpose. They
aren't being updated right now, and the task is to do just that. We'll
probably get to
that within the next 3 months, but a bunch of us are on paternity leave
this summer, so
things are a little slower than normal
This seems close to what I need. From the descriptions I gather the linkage is by session. Is there also a linkage by ip (with IP's removed of course)?
- country data for pageviews: for privacy reasons we only allow access to
this with an
NDA. We have good data on it, but you need to sign this NDA and use our
cluster to access
it, being careful about what you report about it to the world at large.
Here's information
I've read this and am happy to sign an NDA. I understand it is best to be as specific as possible about the reasoning, intentions with the data, and permissions required. For me to figure this out it would be useful to know the relevant parts of the database schema, and perhaps a hint as to which data might be most interesting there. Would you be able to point me towards that?
Hope that helps, and feel free to write back to the public list in the
future.
Definitely, very helpful and thank you!
Best, Daniel
On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) d.l.oberski@uu.nl wrote: Dear Dan,
My name is Daniel Oberski, I'm an associate professor of data science methodology in the department of statistics at Utrecht University in the Netherlands.
I've been using your incredibly useful pageviews API to study correlations between the amount of interest people show in a topic (pageviews) with other data such as political party preference over time. That has yielded some interesting results (which I have yet to write up).
However, to do a better study it would be very helpful to have slightly more information than is in the API. Specifically, it would be very useful to be able to query, for each _pair_ of pages, how many people (or IP's) viewed _both_ of those pages. That way I can find out which pages are really indicative of interest in a specific common topic, rather than just correlated by accident. In addition, I've found it hard to figure out pageviews for specific pages by country rather than language.
My question is, would you happen to know if is there any way to obtain this information? (does not necessarily have to be through the API.) Or do you know if there are people to whom I might talk about this?
Thanks for reading (to) the end and best regards,
Daniel
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'll review Daniel's email and will get back to him/you on this list in the next day or so.
Leila
-- Leila Zia Senior Research Scientist Wikimedia Foundation
On Mon, Jul 24, 2017 at 7:59 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Daniel,
Singining an NDA is not enough to get access to the data, you also need to be part of a formal research collaboration with our research team, they have a number of those and they are not likely to accept any more soon but you can contact them on that regard: https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
Thanks,
Nuria
On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski daniel.oberski@gmail.com wrote:
Dear list,
I'm posting a recent conversation with Dan below, as well as a few follow-up questions.
Dan was kind enough to point out this list. I apologize that the post is "backward" (in email-thread format) due to my ignorance, will use this list from now on.
Thanks, Daniel
Hi Dan
Thanks for getting back to me so quickly!
Thanks for writing. In general these questions are best asked on our public list, so other people can see and benefit from any answers: https://lists.wikimedia.org/mailman/listinfo/ analytics
Thanks, I've joined this list and will ask subsequent questions there.
- pairs of pages: we have two datasets that are mentioned in this task
https:// phabricator.wikimedia.org/T158972 which should be very interesting for this purpose. They aren't being updated right now, and the task is to do just that. We'll probably get to that within the next 3 months, but a bunch of us are on paternity leave this summer, so things are a little slower than normal
This seems close to what I need. From the descriptions I gather the linkage is by session. Is there also a linkage by ip (with IP's removed of course)?
- country data for pageviews: for privacy reasons we only allow access to
this with an NDA. We have good data on it, but you need to sign this NDA and use our cluster to access it, being careful about what you report about it to the world at large. Here's information on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
I've read this and am happy to sign an NDA. I understand it is best to be as specific as possible about the reasoning, intentions with the data, and permissions required. For me to figure this out it would be useful to know the relevant parts of the database schema, and perhaps a hint as to which data might be most interesting there. Would you be able to point me towards that?
Hope that helps, and feel free to write back to the public list in the future.
Definitely, very helpful and thank you!
Best, Daniel
On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) d.l.oberski@uu.nl wrote: Dear Dan,
My name is Daniel Oberski, I'm an associate professor of data science methodology in the department of statistics at Utrecht University in the Netherlands.
I've been using your incredibly useful pageviews API to study correlations between the amount of interest people show in a topic (pageviews) with other data such as political party preference over time. That has yielded some interesting results (which I have yet to write up).
However, to do a better study it would be very helpful to have slightly more information than is in the API. Specifically, it would be very useful to be able to query, for each _pair_ of pages, how many people (or IP's) viewed _both_ of those pages. That way I can find out which pages are really indicative of interest in a specific common topic, rather than just correlated by accident. In addition, I've found it hard to figure out pageviews for specific pages by country rather than language.
My question is, would you happen to know if is there any way to obtain this information? (does not necessarily have to be through the API.) Or do you know if there are people to whom I might talk about this?
Thanks for reading (to) the end and best regards,
Daniel
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Daniel,
I reviewed your request.
== Context == * The data you're asking for is one of the most frequently requested data-sets. We also receive quite a bit of interest for that data specifically for the general research direction you're interested in. * Resources are highly limited on our end. Every formal collaboration will need to be created taking into account this constraint and the commitments we have already made.
== When Research can sign up for formal collaborations? == At least one of the conditions below should hold for us to be able to consider creating a new formal collaboration at this point in time: * The outside research is (tightly) aligned with one of our annual plan commitments (for the period of July 1, 2017 to June 30, 2018). [1] * If a researcher in Research team picks up a specific direction for exploration based on their expertise/interest. * If access to data is broadly agreed upon as strategic for humanity. The examples in this direction are rare, but to give you a sense: if there is an epidemic and we know, with some certainty, that the data we have can help control it or help understanding the research and development in that space.
== Access to data == At this point, unfortunately we cannot create a formal collaboration for your request . I hope that this email can transfer our disappointment to convey this message . :(
Th e above being said, I think there is one data-set that can be helpful for your research and that's Wikipedia Clickstream dataset. [2] You can use that dataset to compute the transition probabilit y of moving from one English Wikipedia article to another. The data is not refreshed frequently, but refreshing that at specific snapshots in time is something we can consider. Please work with the dataset, if you haven't, and let us know if that can be of help for you.
Best, Leila
[1] All programs Research has committed to are listed below. Specific objectives within each Program Research has signed up for is at https://phabricator.wikimedia.org/tag/research-programs/
Program 4 https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
Program 7 https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
Program 9 https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
Program 11 https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
Program 12 https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
CD - Community Health https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
CD - Structured Data https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/F...
[2] https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
-- Leila Zia Senior Research Scientist Wikimedia Foundation
On Mon, Jul 24, 2017 at 9:24 AM, Leila Zia leila@wikimedia.org wrote:
I'll review Daniel's email and will get back to him/you on this list in the next day or so.
Leila
-- Leila Zia Senior Research Scientist Wikimedia Foundation
On Mon, Jul 24, 2017 at 7:59 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Daniel,
Singining an NDA is not enough to get access to the data, you also need
to
be part of a formal research collaboration with our research team, they have a number of those and they are not likely to accept any more soon
but
you can contact them on that regard: https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
Thanks,
Nuria
On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski <
daniel.oberski@gmail.com>
wrote:
Dear list,
I'm posting a recent conversation with Dan below, as well as a few follow-up questions.
Dan was kind enough to point out this list. I apologize that the post is "backward" (in email-thread format) due to my ignorance, will use this list from now
on.
Thanks, Daniel
Hi Dan
Thanks for getting back to me so quickly!
Thanks for writing. In general these questions are best asked on our public list, so other people can see and benefit from any answers: https://lists.wikimedia.org/mailman/listinfo/ analytics
Thanks, I've joined this list and will ask subsequent questions there.
- pairs of pages: we have two datasets that are mentioned in this task
https:// phabricator.wikimedia.org/T158972 which should be very interesting for this purpose. They aren't being updated right now, and the task is to do just that. We'll probably get to that within the next 3 months, but a bunch of us are on paternity leave this summer, so things are a little slower than normal
This seems close to what I need. From the descriptions I gather the linkage is by session. Is there also a linkage by ip (with IP's removed of course)?
- country data for pageviews: for privacy reasons we only allow access
to
this with an NDA. We have good data on it, but you need to sign this NDA and use
our
cluster to access it, being careful about what you report about it to the world at large. Here's information on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
I've read this and am happy to sign an NDA. I understand it is best to
be
as specific as possible about the reasoning, intentions with the data, and permissions required. For me to figure this out it would be useful to know the relevant parts of the database schema, and perhaps a hint as to which data might be most interesting there. Would
you
be able to point me towards that?
Hope that helps, and feel free to write back to the public list in the future.
Definitely, very helpful and thank you!
Best, Daniel
On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) d.l.oberski@uu.nl wrote: Dear Dan,
My name is Daniel Oberski, I'm an associate professor of data science methodology in the department of statistics at Utrecht University in the Netherlands.
I've been using your incredibly useful pageviews API to study
correlations
between the amount of interest people show in a topic (pageviews) with other data
such
as political party preference over time. That has yielded some interesting results (which I have yet to write up).
However, to do a better study it would be very helpful to have slightly more information than is in the API. Specifically, it would be very useful to be able to query, for each _pair_ of pages, how many people (or IP's) viewed _both_ of those pages. That way I can find out which pages are really indicative of interest in a specific common topic, rather than just correlated by accident. In addition, I've found it hard to figure
out
pageviews for specific pages by country rather than language.
My question is, would you happen to know if is there any way to obtain this information? (does not necessarily have to be through the API.) Or do you know if
there
are people to whom I might talk about this?
Thanks for reading (to) the end and best regards,
Daniel
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics