Dear Ariel,
0) WP-MIRROR
WP-MIRROR 0.6 now works with dumps from your.org. I am turning my attention to the other mirror sites.
1) LATEST
I read with interest the thread about `latest' directories that began with http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000610.html. I have some additional questions.
The mirror sites at C3SL and Masaryk Univ. do not have a `latest' directory in the project directories that I looked at. Compare for example:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/ | tail -n 2 drwxr-xr-x 242 2013/01/04 07:52:13 20130102 drwxr-xr-x 1101 2013/01/03 18:48:34 latest (shell)$ rsync wikipedia.c3sl.ufpr.br::wikipedia/enwiki/ | tail -n 2 drwxr-xr-x 61440 2012/11/10 10:47:05 20121101 drwxr-xr-x 61440 2012/12/10 09:21:34 20121201
WP-MIRROR looks for the `latest' directory on the assumption that any links found there point to complete files (i.e. no partials). Whereas files found in dated directories may be partials. For example, the most recent `imagelinks':
This file is complete:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20121201/ | grep imagelinks -rw-r--r-- 356437362 2012/12/01 07:08:54 enwiki-20121201-imagelinks.sql.gz
This file is a partial:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20130102/ | grep imagelinks -rw-r--r-- 20 2013/01/02 07:47:35 enwiki-20130102-imagelinks.sql.gz
The `latest' link points to the complete file:
(shell)$ rsync -a dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/ | grep image lrwxrwxrwx 40 2013/01/02 03:52:49 enwiki-latest-image.sql.gz -> ../20130102/enwiki-20130102-image.sql.gz
So I am wondering what algorythm I should use if I want WP-MIRROR to pull dump files from C3SL and Masaryk U. Can you help with the following questions?
2) C3SL
In the absence of a `latest' directory, can I be sure that all the files found there are complete files (i.e. not partials)? Is the mirroring process atomic?
3) Masaryk Univ.
Several issues: a) No `latest' directories; b) no `enwiki'; and c) most recent dumps date from November:
(shell)$ rsync ftp.fi.muni.cz::pub/wikimedia/zuwiki/ | tail -n 2 drwxr-xr-x 4096 2012/10/23 14:04:02 20121023 drwxr-xr-x 4096 2012/11/05 15:02:33 20121105
Will they be catching up?
Sincerely Yours, Kent