Dear Ariel,
Happy New Year. Thank you for your email of 2012-12-31.
1) MEDIA TARBALLS
I looked at the media tarballs on http://ftpmirror.your.org/ and am quite impressed. I also walked the directory tree /pub/wikimedia/images/wikipedia/[language-code]/[0-9a-f]/. WP-MIRROR 0.5 and prior place images under $wgScriptPath/images/[0-9a-f]. I think that I should insert two more directory layers to better match what you have done: $wgScriptPath/images/wikipedia/[language-code]/[0-9a-f]/.
Action Item: WP-MIRROR 0.6 will make use of media tarballs; and reorganize the images directory tree to match your.org.
2) ACCESSIBILITY
Browsing http://ftpmirror.your.org/ turns out to be almost impossible for me. The contrast is too low (cyan on white). I have to bring up the source code (Ctrl-U) and browse that. Please ask someone at your.org to edit the style sheet /pub/misc/lighttpd-white-dir.css. Black on white works for me.
3) MULTISTREAM BZ2
Very interesting. For WP-MIRROR 0.5 and prior the first steps in processing a dump are: download, verify, decompress, and then split into xchunks (of 1000 pages each). xchunks are then scraped for image file names, are fed into importDump.php, etc. I introduced xchunks for reasons of robustness. Every aspect of mirror building has failure modes that prevent processing enwiki in one pass. Failure to process a few xchunks is however quite tolerable.
I am not yet clear as to the ramifications of using your tool. At one end of the scale it might obviate decompressing the dump. At the other end, it might obviate the use of xchunks entirely.
Action Item: Study multistream bz2 for possible use.
4) MWDUMPER
I have run experiments with MWdumper.jar, and cannot conclude that MWdumper.jar is usable. My lab notes can be found in the WP-MIRROR 0.5 Reference Manual. http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf. Several sections may be of interest to mirror builders:
Appendix E.7 Experiments with InnoDB (especially Figure E.1) Appendix E.9 Experiments with MWdumper.jar -- Round 2. Appendix E.11 Experiments with wikix Appendix E.12 Experiments with Downloading Images Appendix E.14 Experiments with Corrupt Images Appendix E.19 Messages (this is a collection of error messages that I have seen)
No part of mirror building is easy. The vision for WP-MIRROR is to automate all the steps so that anyone with enough disk space can build his own mirror.
5) MWIMPORT
Thanks for bringing this to my attention.
Action Item: I will study mwimport for possible use.
6) POTY
I noticed, on this list, several e-mails regarding POTY collections. I downloaded them all. Very sweet. Many thanks to whoever is making this happen. I look forward to seeing the 2012 collection.
I know that producing dump files is a big task. Thanks for all the hard work.
Sincerely Yours, Kent
On 12/31/12, Ariel T. Glenn ariel@wikimedia.org wrote:
Hello wp mirror dev,
If you are not already downloading media tarball bundles for the initial mirror setup, you should be. See
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
for these, available once a month and I hope to make incrementals regularly available mid-month as well.
Additionally, you can use the 'multistream bz2' files with their indexes to manage concurrency, rather than needing to write out a pile of separate xml files.
Lastly, if you are using importDump.php, I strongly recomment you use mwdumper or mwimport to create an sql file which you can feed to MySQL, and then stuff in the various link tables as well. This will cut down the setup time immensely.
Ariel