I'm doing a test run of the new data dump script on our Korean cluster; currently jawiki (ja.wikipedia.org) is in progress: http://amaryllis.yaseo.wikimedia.org/backup/jawiki/20060118/
Any comments on the page layout and information included in the progress page?
A couple notes:
* The file naming has been changed so they include the database name and date. This should make it easier to figure out what the hell you just downloaded.
* The directory structure is different; the database names are used instead of the weird mix of sites, languages, and database names which was hard to reliably get the scripts to run. Each database has subdirectories for each day it was dumped, plus a 'latest' subdirectory with symbolic links to the files from the last completed dump.
* I renamed 'pages_current' and 'pages_full' to 'pages-meta-current' and 'pages-meta-history'. In addition to the big explanatory labels, this should emphasize that these dumps contain metapages such as discussion and user pages, distancing it from the pages-articles dump.
* I've discontinued 7-Zip compression for the current-versions dumps, since it doesn't do better than bzip2 for those. They are still generated for the history dump, where it compresses significantly better (about 3 vs 11 GB for enwiki)
* Upload tarballs are still not included at the moment.
The backup runner script is written in Python, and is in our CVS in the 'backup' module should anyone feel like laughing at my code.
A few more things need to be fixed up before I start running it on the main cluster, but it's pretty close! (A list of databases in progress, some locking, emailing me on error, and finding the prior XML dump to speed dump generation.)
-- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
I'm doing a test run of the new data dump script on our Korean cluster; currently jawiki (ja.wikipedia.org) is in progress: http://amaryllis.yaseo.wikimedia.org/backup/jawiki/20060118/
Any comments on the page layout and information included in the progress page?
A couple notes:
- The file naming has been changed so they include the database name and date.
This should make it easier to figure out what the hell you just downloaded.
- The directory structure is different; the database names are used instead of
the weird mix of sites, languages, and database names which was hard to reliably get the scripts to run. Each database has subdirectories for each day it was dumped, plus a 'latest' subdirectory with symbolic links to the files from the last completed dump.
- I renamed 'pages_current' and 'pages_full' to 'pages-meta-current' and
'pages-meta-history'. In addition to the big explanatory labels, this should emphasize that these dumps contain metapages such as discussion and user pages, distancing it from the pages-articles dump.
- I've discontinued 7-Zip compression for the current-versions dumps, since it
doesn't do better than bzip2 for those. They are still generated for the history dump, where it compresses significantly better (about 3 vs 11 GB for enwiki)
- Upload tarballs are still not included at the moment.
The backup runner script is written in Python, and is in our CVS in the 'backup' module should anyone feel like laughing at my code.
A few more things need to be fixed up before I start running it on the main cluster, but it's pretty close! (A list of databases in progress, some locking, emailing me on error, and finding the prior XML dump to speed dump generation.)
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Just a comment. Why not adding user_groups table in the public dump? It can be useful for some statistical purposes and I can't see any risk in publishing this information.
Regards,
Emilio Gonzalez
Emilio Gonzalez wrote:
Just a comment. Why not adding user_groups table in the public dump? It can be useful for some statistical purposes and I can't see any risk in publishing this information.
Ummm, because I forgot? :)
I'll switch it.
(The info contained is already available through Special:Listusers, so it's not terribly secret.)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org