Hi,
For the backups (such as at http://download.wikimedia.org/wikipedia/en/ ), can I please make a suggestion?
Can the intermediate backup files please be placed in a separate directory (e.g. http://download.wikimedia.org/wikipedia/en/in-progress/ ), and only be moved to the real directory once they are complete and there were no errors (such as running out of disk space), so as to avoid confusion about what is and what is not a complete and valid dump?
The reason I ask is that the file listing for this URL currently is very confusing - for example, here are the compressed "full.xml" dump files at this location, and their dates and file sizes:
========================================================= 20050713_pages_full.xml.gz 2005-Jul-16 16:25:28 29.9G 20050924_pages_full.xml.bz2 2005-Oct-01 03:52:03 11.3G 20051002_pages_full.xml.bz2 2005-Oct-03 19:35:09 1.7G =========================================================
I have to presume that the "20051002_pages_full.xml.bz2" is an in-progress dump, because it is one tenth of the size of the file of the week before, despite using the same compression.
Then, once you start to second-guessing whether something is a complete dump or not, it opens a can of worms: you then have to ask whether the 20050924_pages_full.xml.bz2 dump is complete, given that the previous dump is nearly 3 times the size. Or maybe it's because of the difference between gzip and bzip2 .... who can say for sure?
All the best, Nick.
Nick Jenkins wrote:
Can the intermediate backup files please be placed in a separate directory (e.g. http://download.wikimedia.org/wikipedia/en/in-progress/ ), and only be moved to the real directory once they are complete and there were no errors (such as running out of disk space), so as to avoid confusion about what is and what is not a complete and valid dump?
I don't think there's any point in that; the only reason anyone would look in the directories is because the new front-end interface to download.wikimedia.org hasn't been written yet and we took down the old interface temporarily because it was pointing at the wrong files.
========================================================= 20050713_pages_full.xml.gz 2005-Jul-16 16:25:28 29.9G 20050924_pages_full.xml.bz2 2005-Oct-01 03:52:03 11.3G 20051002_pages_full.xml.bz2 2005-Oct-03 19:35:09 1.7G =========================================================
I have to presume that the "20051002_pages_full.xml.bz2" is an in-progress dump, because it is one tenth of the size of the file of the week before, despite using the same compression.
Indeed, in-progress.
Then, once you start to second-guessing whether something is a complete dump or not, it opens a can of worms: you then have to ask whether the 20050924_pages_full.xml.bz2 dump is complete, given that the previous dump is nearly 3 times the size. Or maybe it's because of the difference between gzip and bzip2 .... who can say for sure?
bzip2 does about 3-5 times better on the full-history dumps than gzip does; 7zip does another 3-5 times better than bzip2.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org