Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hi all,
I am currently planning to process the last french dump. I would like to
ask if somebody has already found or used a good OpenNLP french sentence
detection model. If yes please let me know where to find one.
Thanks in advance,
Best regards,
Benoit.
As you may have heard, we're going to be switching to the Ashburn
datacenter as the primary data center this week. The first window is
due to start in a few hours. Databases may be read only, things may
die.
I am not going to do anything to the dumps during the transition except
shoot them if needed. Once things have settled down, by the end of the
week, if we have had issues I'll be able to address them in an orderly
fashion, but not while craziness is going on.
Hold on tight, the ride starts soon :-)
Ariel
Hi, everyone.
I am doing a research project about wikipedia searching. I downloaded wiki
dumps from this page: http://dumps.wikimedia.org/enwiki/20121001/
But I have a quick question about a symbol in the dumps. I am wondering the
meaning of " ''' " in all wiki pages. For example, '''Port St. Lucie''' is
a city in St. Lucie county, Florida. I thought the phrase between ''' is
the title of the wiki page. But I saw, in the same page, several other
phrases, like "city council" and "city Manager" are also quoted by '''. So
could you help on this?
Thanks in advance.
Chong Wang
I'd like to mirror just the category structure of the English Wikipedia, and
I'm wondering which of the dump files I need to start with.
I don't need the page content, just the page names, and only for the most
current revision. I need the categories and category members, and I'd like
to exclude hidden categories. I also need to distinguish redirects, because
I don't want to treat them as separate pages. As much as possible I'd like
to work with SQL files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in
understanding the schemas.
Thanks,
Robert
Dear Ariel,
0) WP-MIRROR
WP-MIRROR 0.6 now works with dumps from your.org. I am turning my
attention to the other mirror sites.
1) LATEST
I read with interest the thread about `latest' directories that began
with <http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000610.html>.
I have some additional questions.
The mirror sites at C3SL and Masaryk Univ. do not have a `latest'
directory in the project directories that I looked at. Compare for
example:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/ | tail -n 2
drwxr-xr-x 242 2013/01/04 07:52:13 20130102
drwxr-xr-x 1101 2013/01/03 18:48:34 latest
(shell)$ rsync wikipedia.c3sl.ufpr.br::wikipedia/enwiki/ | tail -n 2
drwxr-xr-x 61440 2012/11/10 10:47:05 20121101
drwxr-xr-x 61440 2012/12/10 09:21:34 20121201
WP-MIRROR looks for the `latest' directory on the assumption that any
links found there point to complete files (i.e. no partials). Whereas
files found in dated directories may be partials. For example, the
most recent `imagelinks':
This file is complete:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20121201/
| grep imagelinks
-rw-r--r-- 356437362 2012/12/01 07:08:54 enwiki-20121201-imagelinks.sql.gz
This file is a partial:
(shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20130102/
| grep imagelinks
-rw-r--r-- 20 2013/01/02 07:47:35 enwiki-20130102-imagelinks.sql.gz
The `latest' link points to the complete file:
(shell)$ rsync -a
dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/ | grep image
lrwxrwxrwx 40 2013/01/02 03:52:49 enwiki-latest-image.sql.gz
-> ../20130102/enwiki-20130102-image.sql.gz
So I am wondering what algorythm I should use if I want WP-MIRROR to
pull dump files from C3SL and Masaryk U. Can you help with the
following questions?
2) C3SL
In the absence of a `latest' directory, can I be sure that all the
files found there are complete files (i.e. not partials)? Is the
mirroring process atomic?
3) Masaryk Univ.
Several issues: a) No `latest' directories; b) no `enwiki'; and c)
most recent dumps date from November:
(shell)$ rsync ftp.fi.muni.cz::pub/wikimedia/zuwiki/ | tail -n 2
drwxr-xr-x 4096 2012/10/23 14:04:02 20121023
drwxr-xr-x 4096 2012/11/05 15:02:33 20121105
Will they be catching up?
Sincerely Yours,
Kent
All,
I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8.
http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
but:
$ isutf8 zh-langlinks.sql
zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
$ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C
00000000 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie |
00000010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit|
00000020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa|
00000030 61 6c |al|
00000032
There might be other occurencies, but one is enough to make my import scripts crash, so... you guys are warned.
--
K.
Well it's time for more alpha code, and I'm a bit behind on my mail from
the weekend, so if there is stuff I should be replying to, that will
happen tomorrow. As of MW 1.19 we ise interwiki.cdb on the projects,
instead of the sql table. This makes life harder for folks setting up
their own copies. So here's some docs and a tool, not vetted by anyone
at all yet:
http://www.mediawiki.org/wiki/Interwiki_cache#Setting_this_up_on_your_own_w…
Please follow the link to the 'cheap(er) way' if you are willing to be a
guineau pig. Also if you see errors or know something that was left
out, feel free to edit. Hey, it's a wiki!
Thanks,
Ariel
Don't start downloading them yet even if you see them showing up; some
of them may be corrupt. I'll send an update when they are ready to go,
there was a hardware issue at our hoster's end. Thanks.
Ariel
Dear Jeremy,
Happy New Year, and thanks for your e-mail of 2012-12-31.
0) ADMINISTRATIVE
I fixed the capitalization of "Wikimedia" in both documentation and home page.
I am now subscribed to this list and have read the last two years of postings.
1) SPDY
WP-MIRROR 0.5 and prior versions, obtains image files from
<http://upload.wikimedia.org/>. SPDY would reduce latency. WP-MIRROR
0.6 (not yet released) uses HTTP/1.1 persistent connections.
WP-MIRROR 0.6 has built-in profiling, and the image downloading
process now uses 64% less (wall clock) time. Therefore SPDY may not
provide much advantage. Thanks also for informing me of the image
tarballs.
Conclusion: I will not pursue SPDY, for lack of a requirement.
Action Item: WP-MIRROR 0.6 will make use of image tarballs.
2) METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt
image files. Most of these were partial downloads. cURL would
time-out and leave corrupt files. I currently deal with that by
validating the images. Validation, however, consumes a lot of time.
So I am looking for ways to improve the reliability of downloading.
Metalink was brought to my attention by Jason Skomorowski. Relevant
documents are RFC 5854, RFC 6249. From the later we have:
"This document describes a mechanism by which the benefit of mirrors
can be automatically and more effectively realized. All the
information about a download, including mirrors, cryptographic
hashes, digital signatures, and more can be transferred in
coordinated HTTP header fields, hereafter referred to as a
"Metalink". This Metalink transfers the knowledge of the download
server (and mirror database) to the client. Clients can fall back to
other mirrors if the current one has an issue. With this knowledge,
the client is enabled to work its way to a successful download even
under adverse circumstances. All this can be done without
complicated user interaction, and the download can be much more
reliable and efficient. In contrast, a traditional HTTP redirect to
a mirror conveys only minimal information -- one link to one server
-- and there is no provision in the HTTP protocol to handle failures.
Furthermore, in order to provide better load distribution across
servers and potentially faster downloads to users, Metalink/HTTP
facilitates multi-source downloads, where portions of a file are
downloaded from multiple mirrors (and, optionally, Peer-to-Peer)
simultaneously.
Upon connection to a Metalink/HTTP server, a client will receive
information about other sources of the same resource and a
cryptographic hash of the whole resource. The client will then be
able to request chunks of the file from the various sources,
scheduling appropriately in order to maximize the download rate."
The benefit to WP-MIRROR would be much more reliable downloads, that
would obviate the file validation process.
The benefit to folks on this e-main list are: a) Your mirror sites
would get more traffic (Ariel mentioned that they are getting very
little); b) the download process (for metalink capable clients) would
be robust against the outage of any one mirror; and c) metalink
capable clients are now common (cURL, kget, ...).
I understand that the idea for metalink originated in those who posted
GNU/Linux distributions in .iso format. With each new .iso release,
there would be a surge of downloading, causing many partial downloads
(i.e. much wasted bandwidth). Metalink helped spread the load; and,
by transporting hashes, improved download integrity.
Conclusion: I will table the issue of metalink, for lack of an
immediate requirement.
Action Item: WP-MIRROR 0.6 will incorporate your list of dump/tarball
mirror sites as a configurable parameter.
3) RSYNC
Thanks for letting me know that dumps and tarballs are available using
rsync. I much prefer rsync over http and ftp. I mirror the Debian
archive, and recently switched from apt-mirror which uses wget, to
ftpsync which uses rsync; and am very happy with the results.
Action Item: WP-MIRROR 0.6 will make use of rsync.
Ariel raised some other points with I shall address in a separate email.
Sincerely Yours,
Kent