Dear List Members,
Does anyone know if the WikiMedia Foundation plans to support Metalink or SPDY for its dump files and/or image files? See RFP references below.
WP-MIRROR downloads dump and image files to build a mirror of a set of wikipedias. WP-MIRROR 0.5 is feature complete. I am now looking for ways to optimize performance (i.e. reduce mirror build time). Were the WMF to support the above two protocols, downloads would be faster and require less time spent on validation.
Sincererly Yours, Kent
On 12/29/12, Sumana Harihareswara sumanah@wikimedia.org wrote:
Hello! I'm sorry, but I don't know the answer to these questions; perhaps you could email the dumps mailing list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l ? My apologies.
Sumana Harihareswara Engineering Community Manager Wikimedia Foundation
On Sun, Dec 16, 2012 at 6:14 AM, wp mirror wpmirrordev@gmail.com wrote:
Dear Sumana,
- Metalink. Does the Wikimedia Foundation have any plans to support
metalink for either its dump files or its image files?
Documentation: http://tools.ietf.org/html/rfc5854, "The Metalink Download Description Format" http://tools.ietf.org/html/rfc6249, "Metalink/HTTP: Mirrors and Hashes"
- SPDY. Does the Wikimedia Foundation have any plans to support SPDY?
Documentation: http://www.chromium.org/spdy
- WP-MIRROR. We last communicated 2012-01-06 in regards to WP-MIRROR.
Status: WP-MIRROR 0.5 is `feature complete', and works `out-of-the-box' for the GNU/Linux distributions: Debian 7.0 (wheezy) and Ubuntu 12.10 (quantal).
Future: Attention is turning towards performance enhancement and porting to other distributions.
Homepage: http://www.nongnu.org/wp-mirror/
Please give it a try. Feedback is most welcome.
Sincerely Yours, Kent
(Sumana to BCC, I don't think we need to keep her on CC (unless she wants to be?) but she can still follow along in the archives or by subscribing. Also, if you want your mail to show up on list in a timely fashion then you should subscribe before sending mail mail to it.)
Hi,
On Sun, Dec 30, 2012 at 6:43 PM, wp mirror wpmirrordev@gmail.com wrote:
Does anyone know if the WikiMedia Foundation
"Wikimedia" please. Not WikiMedia. Your nongnu project page needs to fix that too.
Also, it doesn't matter so much if the foundation plans to support it. There's very little chance that either of these things would be done on foundation time and specifically for the dumps any time soon. SPDY has been a topic that's come up several times and if that's ever implemented for the main sites then it's possible dumps could get it too at the same time. But last I checked dumps uses a webserver (lighttpd) that's not used for any other part of the WMF infra. So more than likely it wouldn't get SPDY just because some other part of the infra got it.
The bottom line is the same answer as for creating bittorrents for the dumps: If someone volunteers to do the legwork to get it done then it might be done. If no one takes it and runs with it then it won't happen.
plans to support Metalink or SPDY for its dump files and/or image files? See RFP references below.
[...]
Were the WMF to support the above two protocols, downloads would be faster and require less time spent on validation.
Can you spell out specifically what benefits you would derive from using those protocols? Maybe there's other ways to get the same information that you're looking for?
Also, you know all of these files are available by rsync too right?
-Jeremy
Hello wp mirror dev,
If you are not already downloading media tarball bundles for the initial mirror setup, you should be. See
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
for these, available once a month and I hope to make incrementals regularly available mid-month as well.
Additionally, you can use the 'multistream bz2' files with their indexes to manage concurrency, rather than needing to write out a pile of separate xml files.
Lastly, if you are using importDump.php, I strongly recomment you use mwdumper or mwimport to create an sql file which you can feed to MySQL, and then stuff in the various link tables as well. This will cut down the setup time immensely.
Ariel
Dear Ariel,
Happy New Year. Thank you for your email of 2012-12-31.
1) MEDIA TARBALLS
I looked at the media tarballs on http://ftpmirror.your.org/ and am quite impressed. I also walked the directory tree /pub/wikimedia/images/wikipedia/[language-code]/[0-9a-f]/. WP-MIRROR 0.5 and prior place images under $wgScriptPath/images/[0-9a-f]. I think that I should insert two more directory layers to better match what you have done: $wgScriptPath/images/wikipedia/[language-code]/[0-9a-f]/.
Action Item: WP-MIRROR 0.6 will make use of media tarballs; and reorganize the images directory tree to match your.org.
2) ACCESSIBILITY
Browsing http://ftpmirror.your.org/ turns out to be almost impossible for me. The contrast is too low (cyan on white). I have to bring up the source code (Ctrl-U) and browse that. Please ask someone at your.org to edit the style sheet /pub/misc/lighttpd-white-dir.css. Black on white works for me.
3) MULTISTREAM BZ2
Very interesting. For WP-MIRROR 0.5 and prior the first steps in processing a dump are: download, verify, decompress, and then split into xchunks (of 1000 pages each). xchunks are then scraped for image file names, are fed into importDump.php, etc. I introduced xchunks for reasons of robustness. Every aspect of mirror building has failure modes that prevent processing enwiki in one pass. Failure to process a few xchunks is however quite tolerable.
I am not yet clear as to the ramifications of using your tool. At one end of the scale it might obviate decompressing the dump. At the other end, it might obviate the use of xchunks entirely.
Action Item: Study multistream bz2 for possible use.
4) MWDUMPER
I have run experiments with MWdumper.jar, and cannot conclude that MWdumper.jar is usable. My lab notes can be found in the WP-MIRROR 0.5 Reference Manual. http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf. Several sections may be of interest to mirror builders:
Appendix E.7 Experiments with InnoDB (especially Figure E.1) Appendix E.9 Experiments with MWdumper.jar -- Round 2. Appendix E.11 Experiments with wikix Appendix E.12 Experiments with Downloading Images Appendix E.14 Experiments with Corrupt Images Appendix E.19 Messages (this is a collection of error messages that I have seen)
No part of mirror building is easy. The vision for WP-MIRROR is to automate all the steps so that anyone with enough disk space can build his own mirror.
5) MWIMPORT
Thanks for bringing this to my attention.
Action Item: I will study mwimport for possible use.
6) POTY
I noticed, on this list, several e-mails regarding POTY collections. I downloaded them all. Very sweet. Many thanks to whoever is making this happen. I look forward to seeing the 2012 collection.
I know that producing dump files is a big task. Thanks for all the hard work.
Sincerely Yours, Kent
On 12/31/12, Ariel T. Glenn ariel@wikimedia.org wrote:
Hello wp mirror dev,
If you are not already downloading media tarball bundles for the initial mirror setup, you should be. See
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
for these, available once a month and I hope to make incrementals regularly available mid-month as well.
Additionally, you can use the 'multistream bz2' files with their indexes to manage concurrency, rather than needing to write out a pile of separate xml files.
Lastly, if you are using importDump.php, I strongly recomment you use mwdumper or mwimport to create an sql file which you can feed to MySQL, and then stuff in the various link tables as well. This will cut down the setup time immensely.
Ariel
On Jan 2, 2013, at 12:00 PM, wp mirror wpmirrordev@gmail.com wrote:
- ACCESSIBILITY
Browsing http://ftpmirror.your.org/ turns out to be almost impossible for me. The contrast is too low (cyan on white). I have to bring up the source code (Ctrl-U) and browse that. Please ask someone at your.org to edit the style sheet /pub/misc/lighttpd-white-dir.css. Black on white works for me.
This was changed to those colors because others complained that the default was too ugly. I changed the CSS again to be a bit more high contrast, you may need to empty your browser's cache to see the change immediately.
-- Kevin @ your.org
Dear Kevin,
Much improved. Thank you.
Sincerely Yours, Kent
On 1/2/13, Kevin Day kevin@your.org wrote:
On Jan 2, 2013, at 12:00 PM, wp mirror wpmirrordev@gmail.com wrote:
- ACCESSIBILITY
Browsing http://ftpmirror.your.org/ turns out to be almost impossible for me. The contrast is too low (cyan on white). I have to bring up the source code (Ctrl-U) and browse that. Please ask someone at your.org to edit the style sheet /pub/misc/lighttpd-white-dir.css. Black on white works for me.
This was changed to those colors because others complained that the default was too ugly. I changed the CSS again to be a bit more high contrast, you may need to empty your browser's cache to see the change immediately.
-- Kevin @ your.org
xmldatadumps-l@lists.wikimedia.org