Dear Christian,
Thank you very much for your help with FTP.
0) SITES
WP-MIRROR 0.6 is now able to pull dump files from any of the following:
rsync://ftpmirror.your.org/wikimedia-dumps/ <-- default rsync://wikipedia.c3sl.upfr.br/wikipedia/ rsync://ftp.fi.muni.cz/pub/wikimedia/ ftp://ftpmirror.your.org/pub/wikimedia/dumps/ ftp://wikipedia.c3sl.ufpr.br/wikipedia/ ftp://ftp.fi.muni.cz/pub/wikimedia/ http://dumps.wikimedia.your.org/ http://wikipedia.c3sl.ufpr.br/ http://ftp.fi.muni.cz/pub/wikimedia/ http://dumps.wikimedia.org/
1) DUMP MIRROR INCOMPATIBILITIES
The following remarks may be useful if WMF wishes to encourage many dump mirrors like Debian did. Currently the four WMF dump mirrors have a number of minor incompatibilities that cause the programmer extra effort.
1.1) `LATEST' DIRECTORY
Issue: WMF and YOUR provide a `latest' directory for each wiki. C3SL and MUNI do not.
Workaround: WP-MIRROR 0.6 ignores `latest' directory, and instead obtains a directory listing, sorts that, and uses most recent `yyyymmdd' directory.
1.2) DIRECTORY LISTING
Issue: When using HTTP, WMF and YOUR have `index.html' files that prevent user from getting a directory listing of wikis in the usual way. C3SL and MUNI do offer directory listing of wikis.
Workaround: WP-MIRROR 0.6, as a fallback, looks for `rsync-dirlist-last-1-good.txt', and parses that to make a list of wikis.
1.3) ENCODING
Issue: When using HTTP, MUNI provides a directory listing of wikis. However, while the HTTP header says "Content-Type: text/html;charset=UTF-8", the content is actually ISO-8859-1. Parser throws fatal error when reading non-UTF-8 bytes.
Workaround: WP-MIRROR 0.6 assumes ISO-8859-1 encoding.
1.4) SYNC
Issue: MUNI has not synced since early 2012-Nov. Not all dump mirrors provide that same set of wikis. MUNI has no `enwiki'. Debian mirrors in `push-mode' sync every six hours.
Workaround: WP-MIRROR 0.6 obtains the directory listing of wikis (as above), and then pulls the most recent dumps (as above). User must realize the lack of sync and configure another dump site. Default site is rsync://ftpmirror.your.org/wikimedia-dumps/.
2) IMAGE TARBALLS
I noticed that these are offered by HTTP and FTP, but not RSYNC. Will that change?
3) PROGRESS REPORT
WP-MIRROR 0.6 can now mirror any WMF wiki; not just those from its wikipedia project.
Action items: OPEN
1) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html Description: WP-MIRROR 0.6 will make use of image tarballs.
4) Opened: 2013-01-03 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000661.html Description: WP-MIRROR 0.6 shall use SHA1 digests to validate image files.
7) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html Description: Study mwimport for possible use.
Action items: CLOSED
2) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html Description: WP-MIRROR 0.6 will incorporate list of dump/tarball mirror sites as a configurable parameter. Disposition: Done. Closed: 2013-01-14
3) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html Description: WP-MIRROR 0.6 will make use of rsync. Disposition: Done. Closed: 2013-01-14
5) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html Description: WP-MIRROR 0.6 will reorganize images directory tree to match your.org. Disposition: Done. Closed: 2013-01-14
6) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html Description: Study multistream bz2 for possible use. Disposition: Done. Closed: 2013-01-14
Sincerely Yours, Kent