Hi all,
I write this email on the public list hoping that the discussion could be of interest for more people.
I am working with a student on scientific citation on Wikipedia and, very simply put, we would like to use the pageview dataset to have a rough measure of how many times a paper was viewed thanks to Wikipedia.[*]
The full dataset is, as of now, ~ 4.7TB in size.
I have two questions: * if we download this dataset this would entail, from a first estimation, ~ 30 days of continuous download (assuming an average download speed of ~ 2MB/s, which was what we measured over the download of a month of data (~ 64GB)). Here at my University (Trento, Italy) this kind of downloads have to be notified to the IT department. I was wondering if this would be a useful information for the WMF, too. * given the estimation above I was wondering if it is possible to obtain this data over FedEx Bandwith (cit. [1]). i.e. via shipping of a physical disk, I know that in some fields (e.g. neuroscience) this is the standard way to exchange big dataset (in the order of TBs).
Thanks in advance for your help.
Cristian [*] I know these are pageviews and not unique visitors, furthermore there is no guarantee that viewing a citation means anything. I am approaching to this data the same way "impressions" versus "clicktroughs" are treated in the online advertising world. [1] https://what-if.xkcd.com/31/
wiki-research-l@lists.wikimedia.org