Thanks for the release, Alex. I am sorry to see this resource go but agree
the data will be of great interest to researchers / app developers.
In terms of how to best store the data and metadata for long-term
preservation and discoverability, my recommendation is to use an open data
registry where you can describe the dataset, make it citable and
discoverable, add metadata and assign the entry a unique and persistent
identifier.
Services like Zenodo <https://zenodo.org/> or figshare
<https://figshare.com/> (the one we've used for our data releases at WMF,
see for example the clickstream dataset
<https://dx.doi.org/10.6084/m9.figshare.1305770.v21>) are good options to
do this.
Dario
On Sun, Dec 11, 2016 at 11:53 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
Alex Druk, 12/12/2016 08:32:
For a few years I have maintained a web site
wikipediatrends.com
<http://wikipediatrends.com>. For variety of reasons I cannot do it any
more and the site will be closed in January.
However, our DB of English wikipedia pageviews from 2007 can be used for
other projects. Any person who wish to get it please see info below.
Thanks. Can you please upload those files to the Internet Archive? You can
use the
https://internetarchive.readthedocs.io/en/latest/cli.html#upload
CLI with mediatype "data", collection "opensource" and subject
"Wikipedia;
enwiki".
Nemo
A few words about DB. We keep data in separate files for each page. Each
file is csv with lines started with year and
followed by pageviews for
each day. Page name is md5 encoded and used as name of the file. Page
names are in separate Berkley DB file. The total size of DB is about
30GB. It is in 3 archived files ~ 10 GB.
You can download DB as 12/03/2016 from:
https://s3-us-west-2.amazonaws.com/adrouk/november2016/rdd112016_1.tar.gz
https://s3-us-west-2.amazonaws.com/adrouk/november2016/rdd112016_2.tar.gz
https://s3-us-west-2.amazonaws.com/adrouk/november2016/articles112016.db
As June 2015:
https://s3-us-west-2.amazonaws.com/adrouk/june2015/rdd62015_1.tar.gz
<https://s3-us-west-2.amazonaws.com/adrouk/june2015/rdd62015_1.tar.gz>
https://s3-us-west-2.amazonaws.com/adrouk/june2015/rdd62015_2.tar.gz
<https://s3-us-west-2.amazonaws.com/adrouk/june2015/rdd62015_2.tar.gz>
https://s3-us-west-2.amazonaws.com/adrouk/june2015/articles62015.db
<https://s3-us-west-2.amazonaws.com/adrouk/june2015/articles62015.db>
Please do not hesitate to ask any question about DB. If by any chance
you are interested in the site also, please contact me of the list.
Enjoy!
---
Thank you.
Alex Druk, PhD
wikipediatrends.com
<http://wikipediatrends.com/>alex.druk@gmail.com
<mailto:alex.druk@gmail.com>
(775) 237-8550 <tel:(775)%20237-8550> Google voice
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
*Dario Taraborelli *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org •
nitens.org • @readermeter
<http://twitter.com/readermeter>