Dear Ariel,
0) WP-MIRROR
WP-MIRROR 0.6 now works with dumps from your.org. I am turning my attention to the other mirror sites.
1) LATEST
I read with interest the thread about `latest' directories that began with http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000610.html. I have some additional questions.
The mirror sites at C3SL and Masaryk Univ. do not have a `latest' directory in the project directories that I looked at. Compare for example:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/ | tail -n 2 drwxr-xr-x 242 2013/01/04 07:52:13 20130102 drwxr-xr-x 1101 2013/01/03 18:48:34 latest (shell)$ rsync wikipedia.c3sl.ufpr.br::wikipedia/enwiki/ | tail -n 2 drwxr-xr-x 61440 2012/11/10 10:47:05 20121101 drwxr-xr-x 61440 2012/12/10 09:21:34 20121201
WP-MIRROR looks for the `latest' directory on the assumption that any links found there point to complete files (i.e. no partials). Whereas files found in dated directories may be partials. For example, the most recent `imagelinks':
This file is complete:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20121201/ | grep imagelinks -rw-r--r-- 356437362 2012/12/01 07:08:54 enwiki-20121201-imagelinks.sql.gz
This file is a partial:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20130102/ | grep imagelinks -rw-r--r-- 20 2013/01/02 07:47:35 enwiki-20130102-imagelinks.sql.gz
The `latest' link points to the complete file:
(shell)$ rsync -a dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/ | grep image lrwxrwxrwx 40 2013/01/02 03:52:49 enwiki-latest-image.sql.gz -> ../20130102/enwiki-20130102-image.sql.gz
So I am wondering what algorythm I should use if I want WP-MIRROR to pull dump files from C3SL and Masaryk U. Can you help with the following questions?
2) C3SL
In the absence of a `latest' directory, can I be sure that all the files found there are complete files (i.e. not partials)? Is the mirroring process atomic?
3) Masaryk Univ.
Several issues: a) No `latest' directories; b) no `enwiki'; and c) most recent dumps date from November:
(shell)$ rsync ftp.fi.muni.cz::pub/wikimedia/zuwiki/ | tail -n 2 drwxr-xr-x 4096 2012/10/23 14:04:02 20121023 drwxr-xr-x 4096 2012/11/05 15:02:33 20121105
Will they be catching up?
Sincerely Yours, Kent
Dear Ariel,
On 1/4/13, wp mirror wpmirrordev@gmail.com wrote: -----snip----
- LATEST
-----snip-----
The `latest' link points to the complete file:
(shell)$ rsync -a dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/ | grep image lrwxrwxrwx 40 2013/01/02 03:52:49 enwiki-latest-image.sql.gz -> ../20130102/enwiki-20130102-image.sql.gz
Correction: What I meant to say, is that I rely on `enwiki-latest-md5sums.txt' referencing complete files:
(shell)$ rsync --copy-links dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/enwiki-latest-md5sums.txt . (shell)$ cat enwiki-latest-md5sums.txt | grep imagelinks 8b5524d36a795b020a6423127c269610 enwiki-20121201-imagelinks.sql.gz
Is this assumption valid?
Sincerely Yours, Kent
Hi Kent,
On Fri, Jan 04, 2013 at 01:46:43PM -0500, wp mirror wrote:
Dear Ariel,
allow me to chime in although I'm not Ariel.
Ariel, please bite my head off, if I say anything wrong.
Correction: What I meant to say, is that I rely on `enwiki-latest-md5sums.txt' referencing complete files:
(shell)$ rsync --copy-links dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/enwiki-latest-md5sums.txt . (shell)$ cat enwiki-latest-md5sums.txt | grep imagelinks 8b5524d36a795b020a6423127c269610 enwiki-20121201-imagelinks.sql.gz
Is this assumption valid?
The assumption that *-*-md5sums.txt only references completed files, looks sound to me.
The temporary md5sums files get updated in Runner.runUpdateItemFileInfo, which is only called if a Run's item has status "done" (compare dumpruninfo.txt). So all files referenced in the temporary md5sums file should reference only completed files.
Best regards, Christian
P.S.: However (if that was your intention), I would not rely on *-latest-md5sums.txt listing all files that are necessary for a full, complete dump, if the dump's corresponding dumpruninfo.txt contains jobs that are not marked as "done".
Dear Christian,
Thank you. That was helpful.
1) PARTIALS
In order to pull dumps from C3SL and 'muni.cz', which do not offer 'latest' directories, I need to make some assumptions. Please let me know if I got all of the following right:
a) Listing. I see from http://wikitech.wikimedia.org/view/Dumps/Mirror_status that the mirror sites are built with rsync. I assume this means that, when a new file is uploaded, the name of the partial is prefixed with a dot '.' and also given a suffix, like this one:
.enwiki-20130102-pages-meta-history9.xml-p000832942p000885722.bz2.sUFHUB
The partial is then renamed upon completion.
b) Failed runs. A dump file could be the output of a failed process, but 'dumpruninfo.txt' will say so, like this:
name:imagelinkstable; status:failed; updated:2013-01-02 12:47:37
c) Rsync options. I did not see any '.~tmp~' directories, so I assume you do not use the --delay-updates option.
Please let me know if I got all of that right.
2) FTP
I would like to get WP-MIRROR to work with all mirror sites/protocols given in the tables on http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps. Here are the results of my attempts to gain access to dumps (Y=success, N=failure):
WMF: http(Y) C3SL: http(Y), ftp(Y), rsync(Y) MUNI: http(Y), ftp(N), rsync(Y) YOUR: http(Y), ftp(N), rsync(Y)
With FTP, I can access C3SL, but not 'muni.cz' or 'your.org':
(shell)$ ftp
ftp> open wikipedia.c3sl.ufpr.br Connected to sagres.c3sl.ufpr.br. 220---------- Welcome to Pure-FTPd [privsep] [TLS] ---------- 220-You are user number 44 of 2000 allowed. 220-Local time is now 02:48. Server port: 21. 220-Only anonymous FTP is allowed here 220 You will be disconnected after 15 minutes of inactivity. Name (wikipedia.c3sl.ufpr.br:kmiller): 230 Anonymous user logged in Remote system type is UNIX. Using binary mode to transfer files. ftp> close 221-Goodbye. You uploaded 0 and downloaded 0 kbytes. 221 Logout.
ftp> open ftp.fi.muni.cz Connected to odysseus.fi.muni.cz. 220 ProFTPD 1.3.4b Server (Faculty of Informatics) [2001:718:801:230::cd] Name (ftp.fi.muni.cz:kmiller): 331 Password required for kmiller Password: 530 Login incorrect. Login failed. Remote system type is UNIX. Using binary mode to transfer files.
ftp> open ftpmirror.your.org ftp: connect to address 2001:4978:1:420::cc09:3752: Connection timed out Trying 204.9.55.82... Connected to ftpmirror.your.org. 220 ProFTPD 1.3.4a Server (Your.Org FTP Archive) [::ffff:204.9.55.82] Name (ftpmirror.your.org:kmiller): 331 Password required for kmiller Password: 530 Login incorrect. Login failed. Remote system type is UNIX. Using binary mode to transfer files.
I have network connectivity (see below) and my firewall permits outbound FTP connections.
Do either 'muni.cz' and 'your.org' require a password?
3) SITE CHOICE
In terms of hops and latency, there is a great difference between nearest and farthest mirror site. Therefore the choice of mirror site will be configurable. I hope to get WP-MIRROR to access each of them.
(shell)$ mtr -r -c 1 dumps.wikimedia.org HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 2.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0 (shell)$ traceroute dumps.wikimedia.org 11 dataset2.wikimedia.org (208.80.152.185) 43.632 ms 43.085 ms 40.850 ms
(shell)$ mtr -r -c 1 wikipedia.c3sl.ufpr.br HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 13.|-- sagres.c3sl.ufpr.br 0.0% 1 165.3 165.3 165.3 165.3 0.0
(shell)$ mtr -r -c 1 ftp.fi.muni.cz HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 12.|-- odysseus.ip6.fi.muni.cz 0.0% 1 119.4 119.4 119.4 119.4 0.0
(shell)$ mtr -r -c 1 ftpmirror.your.org HOST: darkstar-7 Loss% Snt Last Avg Best Wrst StDev 7.|-- ftpmirror.your.org 0.0% 1 33.4 33.4 33.4 33.4 0.0
Sincerely Yours, Kent
On 1/5/13, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Kent,
On Fri, Jan 04, 2013 at 01:46:43PM -0500, wp mirror wrote:
Dear Ariel,
allow me to chime in although I'm not Ariel.
Ariel, please bite my head off, if I say anything wrong.
Correction: What I meant to say, is that I rely on `enwiki-latest-md5sums.txt' referencing complete files:
(shell)$ rsync --copy-links dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/enwiki-latest-md5sums.txt . (shell)$ cat enwiki-latest-md5sums.txt | grep imagelinks 8b5524d36a795b020a6423127c269610 enwiki-20121201-imagelinks.sql.gz
Is this assumption valid?
The assumption that *-*-md5sums.txt only references completed files, looks sound to me.
The temporary md5sums files get updated in Runner.runUpdateItemFileInfo, which is only called if a Run's item has status "done" (compare dumpruninfo.txt). So all files referenced in the temporary md5sums file should reference only completed files.
Best regards, Christian
P.S.: However (if that was your intention), I would not rely on *-latest-md5sums.txt listing all files that are necessary for a full, complete dump, if the dump's corresponding dumpruninfo.txt contains jobs that are not marked as "done".
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Hi Kent,
On Sun, Jan 06, 2013 at 01:11:52AM -0500, wp mirror wrote:
- PARTIALS
In order to pull dumps from C3SL and 'muni.cz', which do not offer 'latest' directories, I need to make some assumptions. Please let me know if I got all of the following right: [...] b) Failed runs. A dump file could be the output of a failed process, but 'dumpruninfo.txt' will say so, like this:
name:imagelinkstable; status:failed; updated:2013-01-02 12:47:37
Yes. But as there are more stati than just “failed” and “done” (markers for waiting, in-progress, etc IIRC), I'd rather check for status “done” (instead of dropping only “failed” jobs) to get all the jobs that passed without problems.
[ Mirror specific questions ]
On http://wikitech.wikimedia.org/view/Dumps/Mirror_status the mirrors are marked as not being run by Wikimedia. So if no one provides the information you requested in time, it might well be that no one knows :-) If you need information on how the mirrors are set up, I guess it's best to nicely ask the mirror providers directly.
- FTP
[...] With FTP, I can access C3SL, but not 'muni.cz' or 'your.org':
Use anonymous ftp access.
ftp> open ftp.fi.muni.cz [...] Name (ftp.fi.muni.cz:kmiller):
^^^^^^^
331 Password required for kmiller [...] Login failed. [...]
Anonymous FTP access works for me.
-----8<-----Terminal transcript-----8<-----Begin------ ncftp ftp.fi.muni.cz NcFTP 3.2.5 (Feb 02, 2011) by Mike Gleason (http://www.NcFTP.com/contact/). Connecting to 147.251.48.205... ProFTPD 1.3.4b Server (Faculty of Informatics) [::ffff:147.251.48.205] Logging in... Hello, UNKNOWN at **********!
Vitejte na FTP serveru Welcome to the FTP server of Fakulty informatiky Faculty of Informatics Masarykovy univerzity v Brne Masaryk University, Brno
This FTP site is in Brno, Czech Republic, Europe. The local time is Mon Jan 07 10:50:09 2013. You are user number 5 out of maximium 1200. All transfers to and from archive are logged. If you do not like this policy, disconnect now!
We serve as the ftp.fi.muni.cz, ftp.linux.cz, and ftp.cstug.cz archive, and we have lot of Linux-, UNIX-, and TeX-related stuff here. Look at the /pub/ROADMAP (or /pub/ROADMAP.html) for details. The file /pub/README.uploads states the rules for uploading data to this server. The server is avaliable via rsync and HTTP protocols. Use the following URLs: rsync://ftp.fi.muni.cz/pub and http://ftp.fi.muni.cz/pub/. The server is available via FTP over IPv6 at ftp://ftp6.linux.cz/ as well. Look at http://www.linux.cz/stats/ for the hardware configuration and statistics of this server.
-System Administrator ftp-admin@fi.muni.cz Anonymous access granted, restrictions apply Logged in to ftp.fi.muni.cz. ncftp / > ll drw-r--r-- 0 0 24 Feb 1 2011 etc drw-r--r-- 0 0 4096 Feb 1 2011 http drw-r--r-- 0 0 0 Jän 7 10:14 mount drw-r--r-- 0 108 4096 Jän 7 06:56 pub ncftp / > quit -----8<-----Terminal transcript-----8<-----End------
ftp> open ftpmirror.your.org [...] Name (ftpmirror.your.org:kmiller):
^^^^^^^
331 Password required for kmiller [...] Login failed. [...]
Anonymous FTP access works for me.
-----8<-----Terminal transcript-----8<-----Begin------ ncftp ftpmirror.your.org NcFTP 3.2.5 (Feb 02, 2011) by Mike Gleason (http://www.NcFTP.com/contact/). Connecting to 204.9.55.82... ProFTPD 1.3.4a Server (Your.Org FTP Archive) [::ffff:204.9.55.82] Logging in... This mirror is sponsored by Your.Org.
This data is available by IPv4 and IPv6, by FTP, HTTP and rsync.
/pub/FreeBSD Primary FreeBSD FTP Archive /pub/FreeBSD-Archive Historical archive of past releases /pub/FreeBSD-CVS Full CVS Repository /pub/deft DEFT Linux live cd /pub/wikimedia Archives from the Wikimedia foundation /pub/wikimedia/dumps Wikimedia database dumps
The above directories are also accessible via anonymous rsync. Use "rsync ftpmirror.your.org::" to see a list of rsyncable modules.
This host is also a full FreeBSD cvsup mirror. Anonymous access granted, restrictions apply Logged in to ftpmirror.your.org. ncftp / > ll drwxr-xr-x 0 14 Feb 28 2012 pub -rw-r--r-- 0 14 597 Mai 10 2012 README.txt ncftp / > quit -----8<-----Terminal transcript-----8<-----End------
Best regards, Christian
Dear Christian,
Thank you very much for your help with FTP.
0) SITES
WP-MIRROR 0.6 is now able to pull dump files from any of the following:
rsync://ftpmirror.your.org/wikimedia-dumps/ <-- default rsync://wikipedia.c3sl.upfr.br/wikipedia/ rsync://ftp.fi.muni.cz/pub/wikimedia/ ftp://ftpmirror.your.org/pub/wikimedia/dumps/ ftp://wikipedia.c3sl.ufpr.br/wikipedia/ ftp://ftp.fi.muni.cz/pub/wikimedia/ http://dumps.wikimedia.your.org/ http://wikipedia.c3sl.ufpr.br/ http://ftp.fi.muni.cz/pub/wikimedia/ http://dumps.wikimedia.org/
1) DUMP MIRROR INCOMPATIBILITIES
The following remarks may be useful if WMF wishes to encourage many dump mirrors like Debian did. Currently the four WMF dump mirrors have a number of minor incompatibilities that cause the programmer extra effort.
1.1) `LATEST' DIRECTORY
Issue: WMF and YOUR provide a `latest' directory for each wiki. C3SL and MUNI do not.
Workaround: WP-MIRROR 0.6 ignores `latest' directory, and instead obtains a directory listing, sorts that, and uses most recent `yyyymmdd' directory.
1.2) DIRECTORY LISTING
Issue: When using HTTP, WMF and YOUR have `index.html' files that prevent user from getting a directory listing of wikis in the usual way. C3SL and MUNI do offer directory listing of wikis.
Workaround: WP-MIRROR 0.6, as a fallback, looks for `rsync-dirlist-last-1-good.txt', and parses that to make a list of wikis.
1.3) ENCODING
Issue: When using HTTP, MUNI provides a directory listing of wikis. However, while the HTTP header says "Content-Type: text/html;charset=UTF-8", the content is actually ISO-8859-1. Parser throws fatal error when reading non-UTF-8 bytes.
Workaround: WP-MIRROR 0.6 assumes ISO-8859-1 encoding.
1.4) SYNC
Issue: MUNI has not synced since early 2012-Nov. Not all dump mirrors provide that same set of wikis. MUNI has no `enwiki'. Debian mirrors in `push-mode' sync every six hours.
Workaround: WP-MIRROR 0.6 obtains the directory listing of wikis (as above), and then pulls the most recent dumps (as above). User must realize the lack of sync and configure another dump site. Default site is rsync://ftpmirror.your.org/wikimedia-dumps/.
2) IMAGE TARBALLS
I noticed that these are offered by HTTP and FTP, but not RSYNC. Will that change?
3) PROGRESS REPORT
WP-MIRROR 0.6 can now mirror any WMF wiki; not just those from its wikipedia project.
Action items: OPEN
1) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html Description: WP-MIRROR 0.6 will make use of image tarballs.
4) Opened: 2013-01-03 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000661.html Description: WP-MIRROR 0.6 shall use SHA1 digests to validate image files.
7) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html Description: Study mwimport for possible use.
Action items: CLOSED
2) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html Description: WP-MIRROR 0.6 will incorporate list of dump/tarball mirror sites as a configurable parameter. Disposition: Done. Closed: 2013-01-14
3) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000651.html Description: WP-MIRROR 0.6 will make use of rsync. Disposition: Done. Closed: 2013-01-14
5) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html Description: WP-MIRROR 0.6 will reorganize images directory tree to match your.org. Disposition: Done. Closed: 2013-01-14
6) Opened: 2013-01-02 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000654.html Description: Study multistream bz2 for possible use. Disposition: Done. Closed: 2013-01-14
Sincerely Yours, Kent
On Jan 14, 2013, at 8:00 PM, wp mirror wpmirrordev@gmail.com wrote:
- IMAGE TARBALLS
I noticed that these are offered by HTTP and FTP, but not RSYNC. Will that change?
rsync://ftpmirror.your.org/wikimedia-imagedumps has been added to our rsyncd.
-- Kevin
Dear Kevin,
Outstanding! Thank you very much.
Sincerely Yours, Kent
On 1/14/13, Kevin Day kevin@your.org wrote:
On Jan 14, 2013, at 8:00 PM, wp mirror wpmirrordev@gmail.com wrote:
- IMAGE TARBALLS
I noticed that these are offered by HTTP and FTP, but not RSYNC. Will that change?
rsync://ftpmirror.your.org/wikimedia-imagedumps has been added to our rsyncd.
-- Kevin
xmldatadumps-l@lists.wikimedia.org