[Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

emijrp emijrp at gmail.com
Tue Jun 28 17:21:22 UTC 2011


Hi;

@Derrick: I don't trust Amazon. Really, I don't trust Wikimedia Foundation
either. They can't and/or they don't want to provide image dumps (what is
worst?). Community donates images to Commons, community donates money every
year, and now community needs to develop a software to extract all the
images and packed them, and of course, host them in a permanent way. Crazy,
right?

@Milos: Instead of spliting image dump using the first letter of filenames,
I thought about spliting using the upload date (YYYY-MM-DD). So, first
chunks (2005-01-01) will be tiny, and recent ones of several GB (a single
day).

Regards,
emijrp

2011/6/28 Derrick Coetzee <dcoetzee at eecs.berkeley.edu>

> As a Commons admin I've thought a lot about the problem of
> distributing Commons dumps. As for distribution, I believe BitTorrent
> is absolutely the way to go, but the Torrent will require a small
> network of dedicated permaseeds (servers that seed indefinitely).
> These can easily be set up at low cost on Amazon EC2 "small" instances
> - the disk storage for the archives is free, since small instances
> include a  large (~120 GB) ephemeral storage volume at no additional
> cost, and the cost of bandwidth can be controlled by configuring the
> BitTorrent client with either a bandwidth throttle or a transfer cap
> (or both). In fact, I think all Wikimedia dumps should be available
> through such a distribution solution, just as all Linux installation
> media are today.
>
> Additionally, it will be necessary to construct (and maintain) useful
> subsets of Commons media, such as "all media used on the English
> Wikipedia", or "thumbnails of all images on Wikimedia Commons", of
> particular interest to certain content reusers, since the full set is
> far too large to be of interest to most reusers. It's on this latter
> point that I want your feedback: what useful subsets of Wikimedia
> Commons does the research community want? Thanks for your feedback.
>
> --=20
> Derrick Coetzee
> User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator
> http://www.eecs.berkeley.edu/~dcoetzee/
>
>


More information about the foundation-l mailing list