Hi;
@Derrick: I don't trust Amazon. Really, I don't trust Wikimedia Foundation either. They can't and/or they don't want to provide image dumps (what is worst?). Community donates images to Commons, community donates money every year, and now community needs to develop a software to extract all the images and packed them, and of course, host them in a permanent way. Crazy, right?
@Milos: Instead of spliting image dump using the first letter of filenames, I thought about spliting using the upload date (YYYY-MM-DD). So, first chunks (2005-01-01) will be tiny, and recent ones of several GB (a single day).
Regards, emijrp
2011/6/28 Derrick Coetzee dcoetzee@eecs.berkeley.edu
As a Commons admin I've thought a lot about the problem of distributing Commons dumps. As for distribution, I believe BitTorrent is absolutely the way to go, but the Torrent will require a small network of dedicated permaseeds (servers that seed indefinitely). These can easily be set up at low cost on Amazon EC2 "small" instances
- the disk storage for the archives is free, since small instances
include a large (~120 GB) ephemeral storage volume at no additional cost, and the cost of bandwidth can be controlled by configuring the BitTorrent client with either a bandwidth throttle or a transfer cap (or both). In fact, I think all Wikimedia dumps should be available through such a distribution solution, just as all Linux installation media are today.
Additionally, it will be necessary to construct (and maintain) useful subsets of Commons media, such as "all media used on the English Wikipedia", or "thumbnails of all images on Wikimedia Commons", of particular interest to certain content reusers, since the full set is far too large to be of interest to most reusers. It's on this latter point that I want your feedback: what useful subsets of Wikimedia Commons does the research community want? Thanks for your feedback.
--=20 Derrick Coetzee User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator http://www.eecs.berkeley.edu/~dcoetzee/