Hi all,
I write this email on the public list hoping that the discussion could
be of interest for more people.
I am working with a student on scientific citation on Wikipedia and,
very simply put, we would like to use the pageview dataset to have a
rough measure of how many times a paper was viewed thanks to
Wikipedia.[*]
The full dataset is, as of now, ~ 4.7TB in size.
I have two questions:
* if we download this dataset this would entail, from a first
estimation, ~ 30 days of continuous download (assuming an average
download speed of ~ 2MB/s, which was what we measured over the
download of a month of data (~ 64GB)). Here at my University (Trento,
Italy) this kind of downloads have to be notified to the IT
department. I was wondering if this would be a useful information for
the WMF, too.
* given the estimation above I was wondering if it is possible to
obtain this data over FedEx Bandwith (cit. [1]). i.e. via shipping of
a physical disk, I know that in some fields (e.g. neuroscience) this
is the standard way to exchange big dataset (in the order of TBs).
Thanks in advance for your help.
Cristian
[*] I know these are pageviews and not unique visitors, furthermore
there is no guarantee that viewing a citation means anything. I am
approaching to this data the same way "impressions" versus
"clicktroughs" are treated in the online advertising world.
[1]
https://what-if.xkcd.com/31/