Dear Jeremy,
Happy New Year, and thanks for your e-mail of 2012-12-31.
0) ADMINISTRATIVE
I fixed the capitalization of "Wikimedia" in both documentation and home page.
I am now subscribed to this list and have read the last two years of postings.
1) SPDY
WP-MIRROR 0.5 and prior versions, obtains image files from
<http://upload.wikimedia.org/>. SPDY would reduce latency. WP-MIRROR
0.6 (not yet released) uses HTTP/1.1 persistent connections.
WP-MIRROR 0.6 has built-in profiling, and the image downloading
process now uses 64% less (wall clock) time. Therefore SPDY may not
provide much advantage. Thanks also for informing me of the image
tarballs.
Conclusion: I will not pursue SPDY, for lack of a requirement.
Action Item: WP-MIRROR 0.6 will make use of image tarballs.
2) METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt
image files. Most of these were partial downloads. cURL would
time-out and leave corrupt files. I currently deal with that by
validating the images. Validation, however, consumes a lot of time.
So I am looking for ways to improve the reliability of downloading.
Metalink was brought to my attention by Jason Skomorowski. Relevant
documents are RFC 5854, RFC 6249. From the later we have:
"This document describes a mechanism by which the benefit of mirrors
can be automatically and more effectively realized. All the
information about a download, including mirrors, cryptographic
hashes, digital signatures, and more can be transferred in
coordinated HTTP header fields, hereafter referred to as a
"Metalink". This Metalink transfers the knowledge of the download
server (and mirror database) to the client. Clients can fall back to
other mirrors if the current one has an issue. With this knowledge,
the client is enabled to work its way to a successful download even
under adverse circumstances. All this can be done without
complicated user interaction, and the download can be much more
reliable and efficient. In contrast, a traditional HTTP redirect to
a mirror conveys only minimal information -- one link to one server
-- and there is no provision in the HTTP protocol to handle failures.
Furthermore, in order to provide better load distribution across
servers and potentially faster downloads to users, Metalink/HTTP
facilitates multi-source downloads, where portions of a file are
downloaded from multiple mirrors (and, optionally, Peer-to-Peer)
simultaneously.
Upon connection to a Metalink/HTTP server, a client will receive
information about other sources of the same resource and a
cryptographic hash of the whole resource. The client will then be
able to request chunks of the file from the various sources,
scheduling appropriately in order to maximize the download rate."
The benefit to WP-MIRROR would be much more reliable downloads, that
would obviate the file validation process.
The benefit to folks on this e-main list are: a) Your mirror sites
would get more traffic (Ariel mentioned that they are getting very
little); b) the download process (for metalink capable clients) would
be robust against the outage of any one mirror; and c) metalink
capable clients are now common (cURL, kget, ...).
I understand that the idea for metalink originated in those who posted
GNU/Linux distributions in .iso format. With each new .iso release,
there would be a surge of downloading, causing many partial downloads
(i.e. much wasted bandwidth). Metalink helped spread the load; and,
by transporting hashes, improved download integrity.
Conclusion: I will table the issue of metalink, for lack of an
immediate requirement.
Action Item: WP-MIRROR 0.6 will incorporate your list of dump/tarball
mirror sites as a configurable parameter.
3) RSYNC
Thanks for letting me know that dumps and tarballs are available using
rsync. I much prefer rsync over http and ftp. I mirror the Debian
archive, and recently switched from apt-mirror which uses wget, to
ftpsync which uses rsync; and am very happy with the results.
Action Item: WP-MIRROR 0.6 will make use of rsync.
Ariel raised some other points with I shall address in a separate email.
Sincerely Yours,
Kent
Dear List Members,
Does anyone know if the WikiMedia Foundation plans to support Metalink
or SPDY for its dump files and/or image files? See RFP references
below.
WP-MIRROR downloads dump and image files to build a mirror of a set of
wikipedias. WP-MIRROR 0.5 is feature complete. I am now looking for
ways to optimize performance (i.e. reduce mirror build time). Were
the WMF to support the above two protocols, downloads would be faster
and require less time spent on validation.
Sincererly Yours,
Kent
On 12/29/12, Sumana Harihareswara <sumanah(a)wikimedia.org> wrote:
> Hello! I'm sorry, but I don't know the answer to these questions;
> perhaps you could email the dumps mailing list
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l ? My
> apologies.
>
>
> Sumana Harihareswara
> Engineering Community Manager
> Wikimedia Foundation
>
>
> On Sun, Dec 16, 2012 at 6:14 AM, wp mirror <wpmirrordev(a)gmail.com> wrote:
>> Dear Sumana,
>>
>> 1) Metalink. Does the Wikimedia Foundation have any plans to support
>> metalink for either its dump files or its image files?
>>
>> Documentation:
>> <http://tools.ietf.org/html/rfc5854>, "The Metalink Download Description
>> Format"
>> <http://tools.ietf.org/html/rfc6249>, "Metalink/HTTP: Mirrors and Hashes"
>>
>> 2) SPDY. Does the Wikimedia Foundation have any plans to support SPDY?
>>
>> Documentation: <http://www.chromium.org/spdy>
>>
>> 3) WP-MIRROR. We last communicated 2012-01-06 in regards to WP-MIRROR.
>>
>> Status: WP-MIRROR 0.5 is `feature complete', and works
>> `out-of-the-box' for the GNU/Linux distributions: Debian 7.0 (wheezy)
>> and Ubuntu 12.10 (quantal).
>>
>> Future: Attention is turning towards performance enhancement and
>> porting to other distributions.
>>
>> Homepage: <http://www.nongnu.org/wp-mirror/>
>>
>> Please give it a try. Feedback is most welcome.
>>
>> Sincerely Yours,
>> Kent
>
I had an email exchange wth one of the folks at our mirror sites about
the low volume of traffic they are getting. Clearly we need to
publicize this list better, bearing in mind that files on our mirrors
may be a day behind the live site. I wouldn't think that a day's delay
is very important in the grand scheme of things though.
So I'm looking for suggestions on how to best make the list of mirrors
visible to dumps users/downloaders. This includes changes to [1] and
[2] among other things. Bear in mind that'best' also implies 'easy to
do' or 'here is a patch' :-D
Ariel
[1]
https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldu…
(downliad page for all dumps showing each dump in order of completion)
[2]
https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldu…
(download page for a given dump)
Snapshot1, which was running several dumps for 'big' wikis, fell over
due to swapdeath today. While we investigate the issue, those jobs will
be stalled. I'll send an update as soon as we have more info.
Ariel
Hi,
I am trying to tranlations from Wiktionaries in different languages.
Currently I use the "All pages, current versions only" dump. Is there a way
to find out the language template tags (is that the correct term?) for each
Wiktionary and each language?
For example:
This is the Hungarian page 'karcsu' (slim, slender)
http://hu.wiktionary.org/wiki/karcs%C3%BA (the edit page:
http://hu.wiktionary.org/w/index.php?title=karcs%C3%BA&action=edit)
The translation table always (?) starts like this:
{{-ford-}}
{{trans-top}}
*{{en}}: {{t|en|slim}}, {{t|en|slender}}
Where {{-ford-}} comes from the word forditas (translation in Hungarian, I
skipped the accents). The translations look like the 3rd row and
(hopefully) contain the other languages wiki codes (en, fr, de).
Also on the page 'slim' in the Hungarian Wiktionary there are some tags
which nobody would understand unless they are Hungarian and they have
learned some Hungarian grammar.
http://hu.wiktionary.org/wiki/slim and
http://hu.wiktionary.org/w/index.php?title=slim&action=edit
The first line is:
{{engmell|comp=slimmer|sup=slimmest|pron=/slɪm/|audio=us}}
Where 'engmell' is derived from 'english melleknev', melleknev meaning
adjective in Hungarian. There rest is similarly confusing.
It gets even more confusing if I look at other Wiktionaries. It seems that
there are no standards that all Wiktionaries follow.
Is this meta-information available somewhere?
I hope I managed to explain it clearly and I am asking on the right list.
Thank you in advance,
Judit Acs
Hello,
I am new to this list and have a question about importing XML dumps from
Wikipedia (http://dumps.wikimedia.org/enwiki/20121101/) into an offline
MediaWiki database. I have locally installed XAMPP on Windows 8 and replaced
the included 32-bit MySQL version with the latest 64-bit version. I then
installed MediaWiki 1.20.0 with an empty database.
When trying to import an XML dump (Nov 2011 dump) with importDump.php in the
maintenance folder of the MediaWiki installation, I get the following error
after about 2 seconds:
"WikiRevision given a null title in import. You may need to adjust
$wgLegalTitleChars." which is thrown at line 1032 in Import.php, because
some $title seems to be null. Replacing the exception with "$this->title =
null" (evil ^^) leads to other errors.
xml2sql and mwdumper seem to be outdated as I cannot get them working with
the current dumps. Special:Import is not an option due to the size of the
XML files.
Any help would be appreciated :)
P.S. it's not the missing + in $wgLegalTitleChars that is missing which is
suggested by a Google Search on that error
Best Regards
Chris
Yay, the network (or nfs) performance issues on your.org seem to have
been straightened out and last month's full is available; this month's
is running now.
Ariel
This dump is failing and due to our MediaWiki config setup on the
production cluster we don't get the exception message so I have no idea
what the problem is. I'll do some live hacks and look at this tomorrow.
Thanks for your patience.
Ariel