Dear Christian,
Thank you. That was helpful.
1) PARTIALS
In order to pull dumps from C3SL and 'muni.cz', which do not offer 'latest' directories, I need to make some assumptions. Please let me know if I got all of the following right:
a) Listing. I see from http://wikitech.wikimedia.org/view/Dumps/Mirror_status that the mirror sites are built with rsync. I assume this means that, when a new file is uploaded, the name of the partial is prefixed with a dot '.' and also given a suffix, like this one:
.enwiki-20130102-pages-meta-history9.xml-p000832942p000885722.bz2.sUFHUB
The partial is then renamed upon completion.
b) Failed runs. A dump file could be the output of a failed process, but 'dumpruninfo.txt' will say so, like this:
name:imagelinkstable; status:failed; updated:2013-01-02 12:47:37
c) Rsync options. I did not see any '.~tmp~' directories, so I assume you do not use the --delay-updates option.
Please let me know if I got all of that right.
2) FTP
I would like to get WP-MIRROR to work with all mirror sites/protocols given in the tables on http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps. Here are the results of my attempts to gain access to dumps (Y=success, N=failure):
WMF: http(Y) C3SL: http(Y), ftp(Y), rsync(Y) MUNI: http(Y), ftp(N), rsync(Y) YOUR: http(Y), ftp(N), rsync(Y)
With FTP, I can access C3SL, but not 'muni.cz' or 'your.org':
(shell)$ ftp
ftp> open wikipedia.c3sl.ufpr.br Connected to sagres.c3sl.ufpr.br. 220---------- Welcome to Pure-FTPd [privsep] [TLS] ---------- 220-You are user number 44 of 2000 allowed. 220-Local time is now 02:48. Server port: 21. 220-Only anonymous FTP is allowed here 220 You will be disconnected after 15 minutes of inactivity. Name (wikipedia.c3sl.ufpr.br:kmiller): 230 Anonymous user logged in Remote system type is UNIX. Using binary mode to transfer files. ftp> close 221-Goodbye. You uploaded 0 and downloaded 0 kbytes. 221 Logout.
ftp> open ftp.fi.muni.cz Connected to odysseus.fi.muni.cz. 220 ProFTPD 1.3.4b Server (Faculty of Informatics) [2001:718:801:230::cd] Name (ftp.fi.muni.cz:kmiller): 331 Password required for kmiller Password: 530 Login incorrect. Login failed. Remote system type is UNIX. Using binary mode to transfer files.
ftp> open ftpmirror.your.org ftp: connect to address 2001:4978:1:420::cc09:3752: Connection timed out Trying 204.9.55.82... Connected to ftpmirror.your.org. 220 ProFTPD 1.3.4a Server (Your.Org FTP Archive) [::ffff:204.9.55.82] Name (ftpmirror.your.org:kmiller): 331 Password required for kmiller Password: 530 Login incorrect. Login failed. Remote system type is UNIX. Using binary mode to transfer files.
I have network connectivity (see below) and my firewall permits outbound FTP connections.
Do either 'muni.cz' and 'your.org' require a password?
3) SITE CHOICE
In terms of hops and latency, there is a great difference between nearest and farthest mirror site. Therefore the choice of mirror site will be configurable. I hope to get WP-MIRROR to access each of them.
(shell)$ mtr -r -c 1 dumps.wikimedia.org HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 2.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0 (shell)$ traceroute dumps.wikimedia.org 11 dataset2.wikimedia.org (208.80.152.185) 43.632 ms 43.085 ms 40.850 ms
(shell)$ mtr -r -c 1 wikipedia.c3sl.ufpr.br HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 13.|-- sagres.c3sl.ufpr.br 0.0% 1 165.3 165.3 165.3 165.3 0.0
(shell)$ mtr -r -c 1 ftp.fi.muni.cz HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 12.|-- odysseus.ip6.fi.muni.cz 0.0% 1 119.4 119.4 119.4 119.4 0.0
(shell)$ mtr -r -c 1 ftpmirror.your.org HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 7.|-- ftpmirror.your.org 0.0% 1 33.4 33.4 33.4 33.4 0.0
Sincerely Yours, Kent
On 1/5/13, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Kent,
On Fri, Jan 04, 2013 at 01:46:43PM -0500, wp mirror wrote:
Dear Ariel,
allow me to chime in although I'm not Ariel.
Ariel, please bite my head off, if I say anything wrong.
Correction: What I meant to say, is that I rely on `enwiki-latest-md5sums.txt' referencing complete files:
(shell)$ rsync --copy-links dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/enwiki-latest-md5sums.txt . (shell)$ cat enwiki-latest-md5sums.txt | grep imagelinks 8b5524d36a795b020a6423127c269610 enwiki-20121201-imagelinks.sql.gz
Is this assumption valid?
The assumption that *-*-md5sums.txt only references completed files, looks sound to me.
The temporary md5sums files get updated in Runner.runUpdateItemFileInfo, which is only called if a Run's item has status "done" (compare dumpruninfo.txt). So all files referenced in the temporary md5sums file should reference only completed files.
Best regards, Christian
P.S.: However (if that was your intention), I would not rely on *-latest-md5sums.txt listing all files that are necessary for a full, complete dump, if the dump's corresponding dumpruninfo.txt contains jobs that are not marked as "done".
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/