Cristian Consonni, 11/11/2015 15:09:
I am working with a student on scientific citation on
very simply put, we would like to use the pageview dataset to have a
rough measure of how many times a paper was viewed thanks to
The full dataset is, as of now, ~ 4.7TB in size.
I have two questions:
* if we download this dataset this would entail, from a first
estimation, ~ 30 days of continuous download (assuming an average
download speed of ~ 2MB/s, which was what we measured over the
download of a month of data (~ 64GB)). Here at my University (Trento,
Italy) this kind of downloads have to be notified to the IT
department. I was wondering if this would be a useful information for
the WMF, too.
No need to notify such small downloads.
* given the estimation above I was wondering if it is
obtain this data over FedEx Bandwith (cit. ). i.e. via shipping of
a physical disk, I know that in some fields (e.g. neuroscience) this
is the standard way to exchange big dataset (in the order of TBs).
This assumes that some point of the network has faster download from
that machine. The server is very slow for pretty much anyone except rare
), possibly even
inside the cluster. Copying to a hard drive might take many days.
You have two more alternatives:
* scp from Labs, /public/dumps/pagecounts-all-sites/ (sometimes reaches
3-4 MB/s for me);
start download of all months at once and use torrent, will hopefully
saturate your bandwidth because you will download from dozens servers
rather than one).
Thanks in advance for your help.
[*] I know these are pageviews and not unique visitors, furthermore
there is no guarantee that viewing a citation means anything. I am
approaching to this data the same way "impressions" versus
"clicktroughs" are treated in the online advertising world.
Wiki-research-l mailing list