Re: [Wikitech-l] forking media files

15 Aug 2011


      The problem is that 1) the files are bulky, 2) there are many of them, 3) they are in constant flux, and 4) it's likely that your connection would close for whatever reason part-way through the download.. Even taking a snapshot of the filenames is dicey. By the time you finish, it's likely that there will be new ones, and possible that some will be deleted. Probably the best way to make this work is to 1) make a snapshot of files periodically, 2) create an API which returns a tarball using the snapshot of files that also implements Range requests. The snapshot of filenames would have to include file sizes so the server would know where to restart. Once the snapshot had not been accessed in a week, it would be deleted. As a snapshot got older and older it would be less and less accurate, but hey, life is tough that way.
Of course, this would result in a 12-terabyte file on the recipient's host. That wouldn't work very well. I'm pretty sure that the recipient would need an http client which would 1) keep track of the place in the bytestream and 2) split out files and write them to disk as separate files. It's possible that a program like getbot already implements this.
________________________________________
From: wikitech-l-bounces@lists.wikimedia.org [wikitech-l-bounces@lists.wikimedia.org] on behalf of Peter Gervai [grinapo@gmail.com]
Sent: Monday, August 15, 2011 4:45 AM
To: Wikimedia developers
Subject: [Wikitech-l] forking media files
Let me retitle one of the topics nobody seems to touch.
On Fri, Aug 12, 2011 at 13:44, Brion Vibber brion@pobox.com wrote:
...

media files -- these are freely copiable but I'm not sure the state of

easily obtaing them in bulk. As the data set moved into TB it became
impractical to just build .tar dumps. There are batch downloader tools
available, and the metadata's all in dumps and api.
Right now it is basically locked: there is no way to bulk copy the
media files, including doing simply a backup of one wikipedia, or
commons. I've tried, I've asked, and the answer was basically to
contact a dev and arrange it, which obviously could be done (I know
many of the folks) but that isn't the point.
Some explanations were mentioned, mostly mentioning that media and its
metadata is quite detached, and thus it's hard to enforce licensing
quirks like attribution, special licenses and such. I can guess this
is a relevant comment since the text corpus is uniformly licensed
under CC/GFDL while the media files are at best non-homogeneous (like
commons, where everything's free in a way) and completely chaos at its
worst (individual wikipedias, where there may be anything from
leftover fair use to copyrighted by various entities to images to be
deleted "soon").
Still, I do not believe it's a good method to make it close to
impossible to bulk copy the data. I am not sure which technical means
is best, as there are many competing ones.
We could, for example, open up an API which would serve media file
with its metadata together, possibly supporting mass operations.
Still, it's pretty ineffective.
Or we could support zsync, rsync and such (and I again recommend
examining zsync's several interesting abilities to offload the work to
the client), but there ought to be some pointers to image metadata, at
least an oneliner file with every image linking to the license page.
Or we could connect the bulk way to established editor accounts, so we
could have at least a bit of an assurance that s/he knows what s/he's
doing.
--
 byte-byte,
    grin
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] forking media files