Hi all,
I write this email on the public list hoping that the discussion could be of interest for more people.
I am working with a student on scientific citation on Wikipedia and, very simply put, we would like to use the pageview dataset to have a rough measure of how many times a paper was viewed thanks to Wikipedia.[*]
The full dataset is, as of now, ~ 4.7TB in size.
I have two questions: * if we download this dataset this would entail, from a first estimation, ~ 30 days of continuous download (assuming an average download speed of ~ 2MB/s, which was what we measured over the download of a month of data (~ 64GB)). Here at my University (Trento, Italy) this kind of downloads have to be notified to the IT department. I was wondering if this would be a useful information for the WMF, too. * given the estimation above I was wondering if it is possible to obtain this data over FedEx Bandwith (cit. [1]). i.e. via shipping of a physical disk, I know that in some fields (e.g. neuroscience) this is the standard way to exchange big dataset (in the order of TBs).
Thanks in advance for your help.
Cristian [*] I know these are pageviews and not unique visitors, furthermore there is no guarantee that viewing a citation means anything. I am approaching to this data the same way "impressions" versus "clicktroughs" are treated in the online advertising world. [1] https://what-if.xkcd.com/31/
Cristian Consonni, 11/11/2015 15:09:
I am working with a student on scientific citation on Wikipedia and, very simply put, we would like to use the pageview dataset to have a rough measure of how many times a paper was viewed thanks to Wikipedia.[*]
The full dataset is, as of now, ~ 4.7TB in size.
I have two questions:
- if we download this dataset this would entail, from a first
estimation, ~ 30 days of continuous download (assuming an average download speed of ~ 2MB/s, which was what we measured over the download of a month of data (~ 64GB)). Here at my University (Trento, Italy) this kind of downloads have to be notified to the IT department. I was wondering if this would be a useful information for the WMF, too.
No need to notify such small downloads.
- given the estimation above I was wondering if it is possible to
obtain this data over FedEx Bandwith (cit. [1]). i.e. via shipping of a physical disk, I know that in some fields (e.g. neuroscience) this is the standard way to exchange big dataset (in the order of TBs).
This assumes that some point of the network has faster download from that machine. The server is very slow for pretty much anyone except rare exceptions (https://phabricator.wikimedia.org/T45647 ), possibly even inside the cluster. Copying to a hard drive might take many days.
You have two more alternatives: * scp from Labs, /public/dumps/pagecounts-all-sites/ (sometimes reaches 3-4 MB/s for me); * archive.org for pagecounts-raw https://archive.org/search.php?query=wikipedia_visitor_stats (you can start download of all months at once and use torrent, will hopefully saturate your bandwidth because you will download from dozens servers rather than one).
Nemo
Thanks in advance for your help.
Cristian [*] I know these are pageviews and not unique visitors, furthermore there is no guarantee that viewing a citation means anything. I am approaching to this data the same way "impressions" versus "clicktroughs" are treated in the online advertising world. [1] https://what-if.xkcd.com/31/
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org