Let me retitle one of the topics nobody seems to touch.
On Fri, Aug 12, 2011 at 13:44, Brion Vibber brion@pobox.com wrote:
- media files -- these are freely copiable but I'm not sure the state of
easily obtaing them in bulk. As the data set moved into TB it became impractical to just build .tar dumps. There are batch downloader tools available, and the metadata's all in dumps and api.
Right now it is basically locked: there is no way to bulk copy the media files, including doing simply a backup of one wikipedia, or commons. I've tried, I've asked, and the answer was basically to contact a dev and arrange it, which obviously could be done (I know many of the folks) but that isn't the point.
Some explanations were mentioned, mostly mentioning that media and its metadata is quite detached, and thus it's hard to enforce licensing quirks like attribution, special licenses and such. I can guess this is a relevant comment since the text corpus is uniformly licensed under CC/GFDL while the media files are at best non-homogeneous (like commons, where everything's free in a way) and completely chaos at its worst (individual wikipedias, where there may be anything from leftover fair use to copyrighted by various entities to images to be deleted "soon").
Still, I do not believe it's a good method to make it close to impossible to bulk copy the data. I am not sure which technical means is best, as there are many competing ones.
We could, for example, open up an API which would serve media file with its metadata together, possibly supporting mass operations. Still, it's pretty ineffective.
Or we could support zsync, rsync and such (and I again recommend examining zsync's several interesting abilities to offload the work to the client), but there ought to be some pointers to image metadata, at least an oneliner file with every image linking to the license page.
Or we could connect the bulk way to established editor accounts, so we could have at least a bit of an assurance that s/he knows what s/he's doing.