We would like to have URI addresses of requests for some time of usage - let's say 1 month.
According to the data format https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest, the attributes of Webrequests we need are following:
http_method, uri_host, uri_path, uri_query, ts, access_method, agent_type, pageview_info, page_id
Do we need to go through NDA process or it is possible to get the data right away from the public dataset?
Thank you,
M.
Can you be more specific about what you need, Michal? If you truly need access to the private data that we keep in wmf.webrequest for a limited time, then you'd have to go through a process to sign an NDA. But if you tell us what you need, there may be a public dataset that you can use.
On Thu, Mar 3, 2016 at 2:48 PM, Michal Bystricky <michal.bystricky@stuba.sk mailto:michal.bystricky@stuba.sk> wrote:
Hello Analytics Team, We would like to have one-time access to wmf.webrequest data. What is the correct way of accessing the data? In our research group, we want to simulate the requests for specific version of WikiMedia. Thanks, Michal Bystricky _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Michal,
it seems that what you want is a data set, which would be very similar to what I recently issued a request for: see this phabricator item https://phabricator.wikimedia.org/T128132
There has been a public data set for the year 2007, part of which you publicly available [1]. See [2] for a study using the 2007 data set.
My focus has been on simulating the performance of WMF's caching servers, for which the 2007 data set is insufficient. However, a different research domain might require a slightly different focus of capturing the data set.
The 2007 data set was captured with a sampling rate of 1:10. For my project, such a high sampling rate would be perfect (1:100 might also work). However, I learned that the current request rate is much higher so we'd have to narrow the scope of the data set (e.g., by focussing on specific WMF projects, like the English Wikipedia). You can find a discussion on the phabricator page linked above.
What would be the lowest sampling rate allowable for your project? I assume the publicly available hourly access data [3], [4] would be insufficient?
Feel free to comment on the phabricator item, maybe we can compile a single data set that works for both of our research domains and other helps other people?
Best, Daniel
[1] http://www.wikibench.eu/?page_id=60 [2] http://www.distributed-systems.net/papers/2009.comnet-wiki.pdf [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites [4] http://dumps.wikimedia.org/other/pagecounts-all-sites/
On 03/21/2016 10:11 AM, Michal Bystricky wrote:
We would like to have URI addresses of requests for some time of usage
- let's say 1 month.
According to the data format https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest, the attributes of Webrequests we need are following:
http_method, uri_host, uri_path, uri_query, ts, access_method, agent_type, pageview_info, page_id
Do we need to go through NDA process or it is possible to get the data right away from the public dataset?
Thank you,
M.
Can you be more specific about what you need, Michal? If you truly need access to the private data that we keep in wmf.webrequest for a limited time, then you'd have to go through a process to sign an NDA. But if you tell us what you need, there may be a public dataset that you can use.
On Thu, Mar 3, 2016 at 2:48 PM, Michal Bystricky michal.bystricky@stuba.sk wrote:
Hello Analytics Team, We would like to have one-time access to wmf.webrequest data. What is the correct way of accessing the data? In our research group, we want to simulate the requests for specific version of WikiMedia. Thanks, Michal Bystricky _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
then yes you do need to go through an NDA process because you are asking
to see raw user agent strings, and that's among the data that we guard very carefully. To clarify, raw user agents are only available for the last 60 days. Data is aggreggated after that period.
On Tue, Mar 22, 2016 at 5:42 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Michal, if what Daniel is saying is not sufficient, then yes you do need to go through an NDA process because you are asking to see raw user agent strings, and that's among the data that we guard very carefully.
*From: *Daniel Berger *Sent: *Monday, March 21, 2016 06:21 *To: *analytics@lists.wikimedia.org *Reply To: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject: *Re: [Analytics] [wmf.webrequest data] one-time access
Hi Michal,
it seems that what you want is a data set, which would be very similar to what I recently issued a request for: see this phabricator item https://phabricator.wikimedia.org/T128132
There has been a public data set for the year 2007, part of which you publicly available [1]. See [2] for a study using the 2007 data set.
My focus has been on simulating the performance of WMF's caching servers, for which the 2007 data set is insufficient. However, a different research domain might require a slightly different focus of capturing the data set.
The 2007 data set was captured with a sampling rate of 1:10. For my project, such a high sampling rate would be perfect (1:100 might also work). However, I learned that the current request rate is much higher so we'd have to narrow the scope of the data set (e.g., by focussing on specific WMF projects, like the English Wikipedia). You can find a discussion on the phabricator page linked above.
What would be the lowest sampling rate allowable for your project? I assume the publicly available hourly access data [3], [4] would be insufficient?
Feel free to comment on the phabricator item, maybe we can compile a single data set that works for both of our research domains and other helps other people?
Best, Daniel
[1] http://www.wikibench.eu/?page_id=60 [2] http://www.distributed-systems.net/papers/2009.comnet-wiki.pdf [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites [4] http://dumps.wikimedia.org/other/pagecounts-all-sites/
On 03/21/2016 10:11 AM, Michal Bystricky wrote:
We would like to have URI addresses of requests for some time of usage - let's say 1 month.
According to the data format https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest, the attributes of Webrequests we need are following:
http_method, uri_host, uri_path, uri_query, ts, access_method, agent_type, pageview_info, page_id
Do we need to go through NDA process or it is possible to get the data right away from the public dataset?
Thank you,
M.
Can you be more specific about what you need, Michal? If you truly need access to the private data that we keep in wmf.webrequest for a limited time, then you'd have to go through a process to sign an NDA. But if you tell us what you need, there may be a public dataset that you can use.
On Thu, Mar 3, 2016 at 2:48 PM, Michal Bystricky < michal.bystricky@stuba.skmichal.bystricky@stuba.sk> wrote:
Hello Analytics Team,
We would like to have one-time access to wmf.webrequest data. What is the correct way of accessing the data?
In our research group, we want to simulate the requests for specific version of WikiMedia.
Thanks, Michal Bystricky
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied.
Pine
Pine, there are actually two separate requests and they shouldn't be mixed. The performance-related one is research as far as I understand, and the other one we have no details yet. I welcome a public discussion of either, and of course would respect any opinions held by the analytics community at large. We have every intention to be good stewards of this data and for what it's worth, I'm very skeptical of allowing access to private data, unless for obviously beneficial purposes like flu forecasting, etc.
On Tue, Mar 22, 2016 at 1:37 PM, Pine W wiki.pine@gmail.com wrote:
I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied.
Pine
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Dan,
Agreed, I think it makes sense to consider a subject-specific request for pages that are within the scope of epidemiology, such as influenza, where we have reason to think that there could be public health benefits in analyzing the data and there are reasonable safeguards to protect user anonymity.
A request for 1 month of the private data requested here, which appears to be for all pages on all projects, is far too broadly scoped. Also, in general, I my instinct would be to deny external requests for WMF private data for purposes of performance testing. It seems to me that the risks far outweigh the benefits to Wikimedia, and that processing requests like these would be a suboptimal use of WMF staff time.
Pine
On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Pine, there are actually two separate requests and they shouldn't be mixed. The performance-related one is research as far as I understand, and the other one we have no details yet. I welcome a public discussion of either, and of course would respect any opinions held by the analytics community at large. We have every intention to be good stewards of this data and for what it's worth, I'm very skeptical of allowing access to private data, unless for obviously beneficial purposes like flu forecasting, etc.
On Tue, Mar 22, 2016 at 1:37 PM, Pine W wiki.pine@gmail.com wrote:
I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied.
Pine
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi everyone,
as the one, who requested data for performance research/testing, I'm happy to participate in the discussion.
The second request, by Michal, might not be about performance. I believe Michal hasn't provided any details, as yet. I thought I could help Michal by pointing out similarities to my request, but I now see that the two requests might be quite different.
It is my goal to compile a dataset, which does not include any private data. My request essentially asks for a higher-resolution version of the publicly available pagecounts data. And an update to a dataset, which has been made public in 2007 [1].
Specifically, the data set would hold the same fields as the pagecounts data, at a higher sampling rate: 1:10 instead of hourly. In addition to the pagecounts fields, the public 2007 dataset has one additional field "save_flag", which indicates whether the request changed a web page. In order to compile this save_flag, three other webrequest fields need to be accessed, as pointed out in Tim Starling's email [2]. Tim was the one, who helped compiling the 2007 dataset.
In my understanding these fields do not include any "personal information" as per the WMF privacy policy. Please correct me if I'm wrong here.
I also would like to point out that I'm asking to make this dataset public (as opposed to giving it to only my research group). If helpful, I'd be willing to host this dataset on my institutions web server, or in a public AWS S3 bucket to facilitate access by the community.
I made a few updates to clarify these points in the phabricator item, were you can find further information: https://phabricator.wikimedia.org/T128132 The comments on that page discuss how we can restrict the scope to only the English Wikipedia and to individual WMF caching servers to scale down the dataset size.
Let me know what you think.
Best, Daniel
[1] http://www.wikibench.eu/?page_id=60 [2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408
On 03/22/2016 08:55 PM, Pine W wrote:
Hi Dan,
Agreed, I think it makes sense to consider a subject-specific request for pages that are within the scope of epidemiology, such as influenza, where we have reason to think that there could be public health benefits in analyzing the data and there are reasonable safeguards to protect user anonymity.
A request for 1 month of the private data requested here, which appears to be for all pages on all projects, is far too broadly scoped. Also, in general, I my instinct would be to deny external requests for WMF private data for purposes of performance testing. It seems to me that the risks far outweigh the benefits to Wikimedia, and that processing requests like these would be a suboptimal use of WMF staff time.
Pine
On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote:
Pine, there are actually two separate requests and they shouldn't be mixed. The performance-related one is research as far as I understand, and the other one we have no details yet. I welcome a public discussion of either, and of course would respect any opinions held by the analytics community at large. We have every intention to be good stewards of this data and for what it's worth, I'm very skeptical of allowing access to private data, unless for obviously beneficial purposes like flu forecasting, etc. On Tue, Mar 22, 2016 at 1:37 PM, Pine W <wiki.pine@gmail.com <mailto:wiki.pine@gmail.com>> wrote: I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied. Pine _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
In my understanding these fields do not include any "personal information"
as per the WMF privacy policy. Please correct me if I'm wrong here. This is correct for data requested here: https://phabricator.wikimedia.org/T128132
On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger berger@cs.uni-kl.de wrote:
Hi everyone,
as the one, who requested data for performance research/testing, I'm happy to participate in the discussion.
The second request, by Michal, might not be about performance. I believe Michal hasn't provided any details, as yet. I thought I could help Michal by pointing out similarities to my request, but I now see that the two requests might be quite different.
It is my goal to compile a dataset, which does not include any private data. My request essentially asks for a higher-resolution version of the publicly available pagecounts data. And an update to a dataset, which has been made public in 2007 [1].
Specifically, the data set would hold the same fields as the pagecounts data, at a higher sampling rate: 1:10 instead of hourly. In addition to the pagecounts fields, the public 2007 dataset has one additional field "save_flag", which indicates whether the request changed a web page. In order to compile this save_flag, three other webrequest fields need to be accessed, as pointed out in Tim Starling's email [2]. Tim was the one, who helped compiling the 2007 dataset.
In my understanding these fields do not include any "personal information" as per the WMF privacy policy. Please correct me if I'm wrong here.
I also would like to point out that I'm asking to make this dataset public (as opposed to giving it to only my research group). If helpful, I'd be willing to host this dataset on my institutions web server, or in a public AWS S3 bucket to facilitate access by the community.
I made a few updates to clarify these points in the phabricator item, were you can find further information: https://phabricator.wikimedia.org/T128132 The comments on that page discuss how we can restrict the scope to only the English Wikipedia and to individual WMF caching servers to scale down the dataset size.
Let me know what you think.
Best, Daniel
[1] http://www.wikibench.eu/?page_id=60 [2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408
On 03/22/2016 08:55 PM, Pine W wrote:
Hi Dan,
Agreed, I think it makes sense to consider a subject-specific request for pages that are within the scope of epidemiology, such as influenza, where we have reason to think that there could be public health benefits in analyzing the data and there are reasonable safeguards to protect user anonymity.
A request for 1 month of the private data requested here, which appears to be for all pages on all projects, is far too broadly scoped. Also, in general, I my instinct would be to deny external requests for WMF private data for purposes of performance testing. It seems to me that the risks far outweigh the benefits to Wikimedia, and that processing requests like these would be a suboptimal use of WMF staff time.
Pine
On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Pine, there are actually two separate requests and they shouldn't be mixed. The performance-related one is research as far as I understand, and the other one we have no details yet. I welcome a public discussion of either, and of course would respect any opinions held by the analytics community at large. We have every intention to be good stewards of this data and for what it's worth, I'm very skeptical of allowing access to private data, unless for obviously beneficial purposes like flu forecasting, etc.
On Tue, Mar 22, 2016 at 1:37 PM, Pine W < wiki.pine@gmail.com wiki.pine@gmail.com> wrote:
I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied.
Pine
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Daniel, I think your request is probably one that can be worked out in such a way that private information is sufficiently protected. The request from Michal, at least as I understand its current form, is of a much different scope. Thanks for following up.
Pine
On Wed, Mar 23, 2016 at 7:29 AM, Nuria Ruiz nuria@wikimedia.org wrote:
In my understanding these fields do not include any "personal
information" as per the WMF privacy policy. Please correct me if I'm wrong here. This is correct for data requested here: https://phabricator.wikimedia.org/T128132
On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger berger@cs.uni-kl.de wrote:
Hi everyone,
as the one, who requested data for performance research/testing, I'm happy to participate in the discussion.
The second request, by Michal, might not be about performance. I believe Michal hasn't provided any details, as yet. I thought I could help Michal by pointing out similarities to my request, but I now see that the two requests might be quite different.
It is my goal to compile a dataset, which does not include any private data. My request essentially asks for a higher-resolution version of the publicly available pagecounts data. And an update to a dataset, which has been made public in 2007 [1].
Specifically, the data set would hold the same fields as the pagecounts data, at a higher sampling rate: 1:10 instead of hourly. In addition to the pagecounts fields, the public 2007 dataset has one additional field "save_flag", which indicates whether the request changed a web page. In order to compile this save_flag, three other webrequest fields need to be accessed, as pointed out in Tim Starling's email [2]. Tim was the one, who helped compiling the 2007 dataset.
In my understanding these fields do not include any "personal information" as per the WMF privacy policy. Please correct me if I'm wrong here.
I also would like to point out that I'm asking to make this dataset public (as opposed to giving it to only my research group). If helpful, I'd be willing to host this dataset on my institutions web server, or in a public AWS S3 bucket to facilitate access by the community.
I made a few updates to clarify these points in the phabricator item, were you can find further information: https://phabricator.wikimedia.org/T128132 The comments on that page discuss how we can restrict the scope to only the English Wikipedia and to individual WMF caching servers to scale down the dataset size.
Let me know what you think.
Best, Daniel
[1] http://www.wikibench.eu/?page_id=60 [2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408
On 03/22/2016 08:55 PM, Pine W wrote:
Hi Dan,
Agreed, I think it makes sense to consider a subject-specific request for pages that are within the scope of epidemiology, such as influenza, where we have reason to think that there could be public health benefits in analyzing the data and there are reasonable safeguards to protect user anonymity.
A request for 1 month of the private data requested here, which appears to be for all pages on all projects, is far too broadly scoped. Also, in general, I my instinct would be to deny external requests for WMF private data for purposes of performance testing. It seems to me that the risks far outweigh the benefits to Wikimedia, and that processing requests like these would be a suboptimal use of WMF staff time.
Pine
On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
Pine, there are actually two separate requests and they shouldn't be mixed. The performance-related one is research as far as I understand, and the other one we have no details yet. I welcome a public discussion of either, and of course would respect any opinions held by the analytics community at large. We have every intention to be good stewards of this data and for what it's worth, I'm very skeptical of allowing access to private data, unless for obviously beneficial purposes like flu forecasting, etc.
On Tue, Mar 22, 2016 at 1:37 PM, Pine W < wiki.pine@gmail.com wiki.pine@gmail.com> wrote:
I'd appreciate a clarification about the purpose of this request if Wikimedia private data is involved. If I am understanding correctly, the purpose of this request is for access to Wikimedia private data for assistsnce with 3rd party performance testing. If that is the case, I believe that the access request for private should simply be denied.
Pine
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics