Dear Christian,
Thank you very much for your help with FTP.
0) SITES
WP-MIRROR 0.6 is now able to pull dump files from any of the following:
<rsync://ftpmirror.your.org/wikimedia-dumps/> <-- default
<rsync://wikipedia.c3sl.upfr.br/wikipedia/>
<rsync://ftp.fi.muni.cz/pub/wikimedia/>
<ftp://ftpmirror.your.org/pub/wikimedia/dumps/>
<ftp://wikipedia.c3sl.ufpr.br/wikipedia/>
<ftp://ftp.fi.muni.cz/pub/wikimedia/>
<http://dumps.wikimedia.your.org/>
<http://wikipedia.c3sl.ufpr.br/>
<http://ftp.fi.muni.cz/pub/wikimedia/>
<http://dumps.wikimedia.org/>
1) DUMP MIRROR INCOMPATIBILITIES
The following remarks may be useful if WMF wishes to encourage many
dump mirrors like Debian did. Currently the four WMF dump mirrors
have a number of minor incompatibilities that cause the programmer
extra effort.
1.1) `LATEST' DIRECTORY
Issue: WMF and YOUR provide a `latest' directory for each wiki. C3SL
and MUNI do not.
Workaround: WP-MIRROR 0.6 ignores `latest' directory, and instead
obtains a directory listing, sorts that, and uses most recent
`yyyymmdd' directory.
1.2) DIRECTORY LISTING
Issue: When using HTTP, WMF and YOUR have `index.html' files that
prevent user from getting a directory listing of wikis in the usual
way. C3SL and MUNI do offer directory listing of wikis.
Workaround: WP-MIRROR 0.6, as a fallback, looks for
`rsync-dirlist-last-1-good.txt', and parses that to make a list of
wikis.
1.3) ENCODING
Issue: When using HTTP, MUNI provides a directory listing of wikis.
However, while the HTTP header says "Content-Type:
text/html;charset=UTF-8", the content is actually ISO-8859-1. Parser
throws fatal error when reading non-UTF-8 bytes.
Workaround: WP-MIRROR 0.6 assumes ISO-8859-1 encoding.
1.4) SYNC
Issue: MUNI has not synced since early 2012-Nov. Not all dump
mirrors provide that same set of wikis. MUNI has no `enwiki'. Debian
mirrors in `push-mode' sync every six hours.
Workaround: WP-MIRROR 0.6 obtains the directory listing of wikis (as
above), and then pulls the most recent dumps (as above). User must
realize the lack of sync and configure another dump site. Default
site is <rsync://ftpmirror.your.org/wikimedia-dumps/>.
2) IMAGE TARBALLS
I noticed that these are offered by HTTP and FTP, but not RSYNC. Will
that change?
3) PROGRESS REPORT
WP-MIRROR 0.6 can now mirror any WMF wiki; not just those from its
wikipedia project.
Action items: OPEN
1) Opened: 2013-01-02
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html>
Description: WP-MIRROR 0.6 will make use of image tarballs.
4) Opened: 2013-01-03
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000661.html>
Description: WP-MIRROR 0.6 shall use SHA1 digests to validate image files.
7) Opened: 2013-01-02
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html>
Description: Study mwimport for possible use.
Action items: CLOSED
2) Opened: 2013-01-02
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html>
Description: WP-MIRROR 0.6 will incorporate list of dump/tarball
mirror sites
as a configurable parameter.
Disposition: Done.
Closed: 2013-01-14
3) Opened: 2013-01-02
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html>
Description: WP-MIRROR 0.6 will make use of rsync.
Disposition: Done.
Closed: 2013-01-14
5) Opened: 2013-01-02
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html>
Description: WP-MIRROR 0.6 will reorganize images directory tree to match
your.org.
Disposition: Done.
Closed: 2013-01-14
6) Opened: 2013-01-02
<http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html>
Description: Study multistream bz2 for possible use.
Disposition: Done.
Closed: 2013-01-14
Sincerely Yours,
Kent