Dear Christian,
Thank you. That was helpful.
1) PARTIALS
In order to pull dumps from C3SL and 'muni.cz', which do not offer
'latest' directories, I need to make some assumptions. Please let me
know if I got all of the following right:
a) Listing. I see from
<http://wikitech.wikimedia.org/view/Dumps/Mirror_status> that the
mirror sites are built with rsync. I assume this means that, when a
new file is uploaded, the name of the partial is prefixed with a dot
'.' and also given a suffix, like this one:
.enwiki-20130102-pages-meta-history9.xml-p000832942p000885722.bz2.sUFHUB
The partial is then renamed upon completion.
b) Failed runs. A dump file could be the output of a failed process,
but 'dumpruninfo.txt' will say so, like this:
name:imagelinkstable; status:failed; updated:2013-01-02 12:47:37
c) Rsync options. I did not see any '.~tmp~' directories, so I assume
you do not use the --delay-updates option.
Please let me know if I got all of that right.
2) FTP
I would like to get WP-MIRROR to work with all mirror sites/protocols
given in the tables on
<http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps>.
Here are the results of my attempts to gain access to dumps
(Y=success, N=failure):
WMF: http(Y)
C3SL: http(Y), ftp(Y), rsync(Y)
MUNI: http(Y), ftp(N), rsync(Y)
YOUR: http(Y), ftp(N), rsync(Y)
With FTP, I can access C3SL, but not 'muni.cz' or 'your.org':
(shell)$ ftp
ftp> open wikipedia.c3sl.ufpr.br
Connected to sagres.c3sl.ufpr.br.
220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 44 of 2000 allowed.
220-Local time is now 02:48. Server port: 21.
220-Only anonymous FTP is allowed here
220 You will be disconnected after 15 minutes of inactivity.
Name (wikipedia.c3sl.ufpr.br:kmiller):
230 Anonymous user logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> close
221-Goodbye. You uploaded 0 and downloaded 0 kbytes.
221 Logout.
ftp> open ftp.fi.muni.cz
Connected to odysseus.fi.muni.cz.
220 ProFTPD 1.3.4b Server (Faculty of Informatics) [2001:718:801:230::cd]
Name (ftp.fi.muni.cz:kmiller):
331 Password required for kmiller
Password:
530 Login incorrect.
Login failed.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> open
ftpmirror.your.org
ftp: connect to address 2001:4978:1:420::cc09:3752: Connection timed out
Trying 204.9.55.82...
Connected to
ftpmirror.your.org.
220 ProFTPD 1.3.4a Server (
Your.Org FTP Archive) [::ffff:204.9.55.82]
Name (ftpmirror.your.org:kmiller):
331 Password required for kmiller
Password:
530 Login incorrect.
Login failed.
Remote system type is UNIX.
Using binary mode to transfer files.
I have network connectivity (see below) and my firewall permits
outbound FTP connections.
Do either 'muni.cz' and 'your.org' require a password?
3) SITE CHOICE
In terms of hops and latency, there is a great difference between
nearest and farthest mirror site. Therefore the choice of mirror site
will be configurable. I hope to get WP-MIRROR to access each of them.
(shell)$ mtr -r -c 1
dumps.wikimedia.org
HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev
2.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
(shell)$ traceroute
dumps.wikimedia.org
11
dataset2.wikimedia.org (208.80.152.185) 43.632 ms 43.085 ms 40.850 ms
(shell)$ mtr -r -c 1 wikipedia.c3sl.ufpr.br
HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev
13.|-- sagres.c3sl.ufpr.br 0.0% 1 165.3 165.3 165.3 165.3 0.0
(shell)$ mtr -r -c 1 ftp.fi.muni.cz
HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev
12.|-- odysseus.ip6.fi.muni.cz 0.0% 1 119.4 119.4 119.4 119.4 0.0
(shell)$ mtr -r -c 1
ftpmirror.your.org
HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev
7.|--
ftpmirror.your.org 0.0% 1 33.4 33.4 33.4 33.4 0.0
Sincerely Yours,
Kent
On 1/5/13, Christian Aistleitner <christian(a)quelltextlich.at> wrote:
Hi Kent,
On Fri, Jan 04, 2013 at 01:46:43PM -0500, wp mirror wrote:
Dear Ariel,
allow me to chime in although I'm not Ariel.
Ariel, please bite my head off, if I say anything wrong.
Correction: What I meant to say, is that I rely
on
`enwiki-latest-md5sums.txt' referencing complete files:
(shell)$ rsync --copy-links
dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/enwiki-latest-md5sums.txt
.
(shell)$ cat enwiki-latest-md5sums.txt | grep imagelinks
8b5524d36a795b020a6423127c269610 enwiki-20121201-imagelinks.sql.gz
Is this assumption valid?
The assumption that *-*-md5sums.txt only references completed files,
looks sound to me.
The temporary md5sums files get updated in
Runner.runUpdateItemFileInfo, which is only called if a Run's item has
status "done" (compare dumpruninfo.txt). So all files referenced in
the temporary md5sums file should reference only completed files.
Best regards,
Christian
P.S.: However (if that was your intention), I would not rely on
*-latest-md5sums.txt listing all files that are necessary for a full,
complete dump, if the dump's corresponding dumpruninfo.txt contains
jobs that are not marked as "done".
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage:
http://quelltextlich.at/
---------------------------------------------------------------