I made a quick runthrough of the upload directories to see how big the total file set for each wiki is, with an eye towards getting bulk dumps of uploads ready again.
A pretty pie chart and raw data are here: http://meta.wikimedia.org/wiki/Upload_distribution%2C_June_2006
All together, current-versions of files without thumbnails total about 372 gigabytes. Commons makes up the vast majority, with over 245 gigs. English Wikipedia squeaks in nearly another 60 gigs, German Wikipedia then just shy of 20 gb, then they start rapidly dropping off from there.
Giant tarballs are a rather unwieldy way to distribute file dumps at the larger sizes: they require 2x the disk space (for staging complete and in-progress builds) and of course if anyone downloads them it all comes out of our central bandwidth.
Wegge's doing some testing with BitTorrent; it might or might not be feasible to build torrent files that simply reference all the individual files, so we can use hardlinks to maintain a snapshot without eating up the full disk space on the server. This also avoids the need to keep or extract a large archive file for the downloader.
Given the number of files (about 650k in Commons now) and their wild and crazy filenames this might not be totally feasible, but we can hope.
-- brion vibber (brion @ pobox.com)
On 6/17/06, Brion Vibber brion@pobox.com wrote:
Given the number of files (about 650k in Commons now) and their wild and crazy filenames this might not be totally feasible, but we can hope.
Speaking of wild and crazy filenames, would it be possible to force some basic naming scheme on incoming files, such as at least making the extensions lower case?
Steve
On 6/17/06, Steve Bennett stevage@gmail.com wrote:
On 6/17/06, Brion Vibber brion@pobox.com wrote:
Given the number of files (about 650k in Commons now) and their wild and crazy filenames this might not be totally feasible, but we can hope.
Speaking of wild and crazy filenames, would it be possible to force some basic naming scheme on incoming files, such as at least making the extensions lower case?
This would stop one of the primary causes of duplicate uploads we see on enwiki, where people upload Foo.JPG and then it doesn't work when they [[Image:Foo.jpg]], so they click on it and upload Foo.jpg.
I'm close to beginning a campaign on enwiki to advocate removing the upload link entirely, forcing people to upload by following a redlink because even if we only count fair use tagged images there are around a thousand never-used images uploaded per week.
On 17/06/06, Steve Bennett stevage@gmail.com wrote:
On 6/17/06, Brion Vibber brion@pobox.com wrote:
Given the number of files (about 650k in Commons now) and their wild and crazy filenames this might not be totally feasible, but we can hope.
Speaking of wild and crazy filenames, would it be possible to force some basic naming scheme on incoming files, such as at least making the extensions lower case?
It's quite a sensible idea, but we'd need to watch for collisions with existing files where the lowercase extension version pre-exists. The current collision handling would be fine.
There's an open bug which requests, among other things, to have file extensions be insignificant with respect to using the images in pages.
Rob Church
Rob Church wrote:
On 17/06/06, Steve Bennett stevage@gmail.com wrote:
On 6/17/06, Brion Vibber brion@pobox.com wrote:
Given the number of files (about 650k in Commons now) and their wild and crazy filenames this might not be totally feasible, but we can hope.
Speaking of wild and crazy filenames, would it be possible to force some basic naming scheme on incoming files, such as at least making the extensions lower case?
It's quite a sensible idea, but we'd need to watch for collisions with existing files where the lowercase extension version pre-exists. The current collision handling would be fine.
There's an open bug which requests, among other things, to have file extensions be insignificant with respect to using the images in pages.
Here I should plug my notes on the way files are being stored for the undeletion archive: http://www.mediawiki.org/wiki/FileStore
This is a prototype for the new file storage organization for all uploads; after some shakedown in deletion-land we can see about migrating the public-facing files too.
Since the actual on-disk filenames will be independent of the in-wiki names, that'll be the time to implement full extension-independence, file renaming, image redirects, etc.
Normalizing extensions on upload, though, would be relatively easy. Potentially it could be added now. (Note I put in a function to do some normalization in the Image class; it's used for normalizing the extension used for the FileStore name.)
Things to be careful of: 1) Uploading a new version of an existing file should still work. 2) uh... probably something else
-- brion vibber (brion @ pobox.com)
On 6/17/06, Brion Vibber brion@pobox.com wrote:
Here I should plug my notes on the way files are being stored for the undeletion archive: http://www.mediawiki.org/wiki/FileStore
This is a prototype for the new file storage organization for all uploads; after some shakedown in deletion-land we can see about migrating the public-facing files too.
Oh! I reconize some of that system. Great to see it becoming reality.
BTW- there are a lot of uploads in our existing spools which are exact duplicates.. Have you been logging dups on delete? It would be interesting to watch.
Gregory Maxwell wrote:
BTW- there are a lot of uploads in our existing spools which are exact duplicates.. Have you been logging dups on delete? It would be interesting to watch.
Of 12,209 files deleted from Commons so far, 176 appear multiple times, mostly just twice each.
The winner so far is a goatse-style image which appears in the history of 23 deleted satellite weather maps. Most others look like images which were reverted (probably mostly from those goatse images).
Should one be curious, you can poke in the filearchive table to look for dupes; fa_storage_key contains the hash + file extension. This ought to be available on the toolserver (though it may need a view created for it).
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org