Re: [Xmldatadumps-l] wpmirror optimizations (was: Re: [WP-MIRROR] Questions regarding Metalink and SPDY)

2 Jan 2013


      Dear Ariel,
Happy New Year.  Thank you for your email of 2012-12-31.
1) MEDIA TARBALLS
I looked at the media tarballs on http://ftpmirror.your.org/ and am
quite impressed.  I also walked the directory tree
/pub/wikimedia/images/wikipedia/[language-code]/[0-9a-f]/.
WP-MIRROR 0.5 and prior place images under
$wgScriptPath/images/[0-9a-f].  I think that I should insert two more
directory layers to better match what you have done:
$wgScriptPath/images/wikipedia/[language-code]/[0-9a-f]/.
Action Item:  WP-MIRROR 0.6 will make use of media tarballs; and
reorganize the images directory tree to match your.org.
2) ACCESSIBILITY
Browsing http://ftpmirror.your.org/ turns out to be almost
impossible for me.  The contrast is too low (cyan on white).  I have
to bring up the source code (Ctrl-U) and browse that.  Please ask
someone at your.org to edit the style sheet
/pub/misc/lighttpd-white-dir.css.  Black on white works for me.
3) MULTISTREAM BZ2
Very interesting.  For WP-MIRROR 0.5 and prior the first steps in
processing a dump are: download, verify, decompress, and then split
into xchunks (of 1000 pages each).   xchunks are then scraped for
image file names, are fed into importDump.php, etc.  I introduced
xchunks for reasons of robustness.  Every aspect of mirror building
has failure modes that prevent processing enwiki in one pass.  Failure
to process a few xchunks is however quite tolerable.
I am not yet clear as to the ramifications of using your tool.  At one
end of the scale it might obviate decompressing the dump.  At the
other end, it might obviate the use of xchunks entirely.
Action Item:  Study multistream bz2 for possible use.
4) MWDUMPER
I have run experiments with MWdumper.jar, and cannot conclude that
MWdumper.jar is usable.  My lab notes can be found in the WP-MIRROR
0.5 Reference Manual.
http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf.  Several
sections may be of interest to mirror builders:
Appendix E.7 Experiments with InnoDB (especially Figure E.1)
Appendix E.9 Experiments with MWdumper.jar -- Round 2.
Appendix E.11 Experiments with wikix
Appendix E.12 Experiments with Downloading Images
Appendix E.14 Experiments with Corrupt Images
Appendix E.19 Messages (this is a collection of error messages that I have seen)
No part of mirror building is easy.  The vision for WP-MIRROR is to
automate all the steps so that anyone with enough disk space can build
his own mirror.
5) MWIMPORT
Thanks for bringing this to my attention.
Action Item:  I will study mwimport for possible use.
6) POTY
I noticed, on this list, several e-mails regarding POTY collections.
I downloaded them all.  Very sweet.  Many thanks to whoever is making
this happen.  I look forward to seeing the 2012 collection.
I know that producing dump files is a big task.  Thanks for all the hard work.
Sincerely Yours,
Kent
On 12/31/12, Ariel T. Glenn ariel@wikimedia.org wrote:
...
Hello wp mirror dev,
If you are not already downloading media tarball bundles for the initial
mirror setup, you should be. See
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
for these, available once a month and I hope to make incrementals
regularly available mid-month as well.
Additionally, you can use the 'multistream bz2' files with their indexes
to manage concurrency, rather than needing to write out a pile of
separate xml files.
Lastly, if you are using importDump.php, I strongly recomment you use
mwdumper or mwimport to create an sql file which you can feed to MySQL,
and then stuff in the various link tables as well.  This will cut down
the setup time immensely.
Ariel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] wpmirror optimizations (was: Re: [WP-MIRROR] Questions regarding Metalink and SPDY)