don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Chinese Wikipedia supports a few variants, zh-cn, zh-tw, zh-hk, same
wikitext is rendered differently under these variants. e.g. "software" in
zh-cn  and "software" in zh-tw .
But seems no HTML are included in dump file zhwiki.
Do you know where can I get the HTML version of articles on Chinese
This email may be confidential or privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it went to
the wrong person. Thanks.
Is there any way to distinguish between categories like History, or
Literature for example, and what I would think of as categories that are
used for internal housekeeping like "Unprintworthy_redirects" or
"Nonindexed_pages"? They're not hidden categories, but conceptually there
is a clear difference between housekeeping categories and categories that
define fields of knowledge. But is there anything in the tables that
I am wondering if the NOTE on the manual page of the redirect table 
still applies and, if this is the case, how many data are missing and
what would be the best way to incorporate the pagelinks table, since it
seems to lack the information about whether a link is a redirect or not.
Giovanni Luca Ciampaglia
✆ (812) 287-3471
I am currently planning to process the last french dump. I would like to
ask if somebody has already found or used a good OpenNLP french sentence
detection model. If yes please let me know where to find one.
Thanks in advance,
So I was thinking about things I can't undertake, and one of those
things is the 'dumps 2.0' which has been rolling around in the back of
my mind. The TL;DR version is: sparse compressed archive format that
allows folks to add/subtract changes to it random-access (including
What do folks think? Workable? Nuts? Low priority? Interested?
We've once again been notified that our mirror of the Wikimedia images is "hosting malware". A quick check appears to mostly be more newly uploaded PDFs with one or more exploits in them, but there are also a few other media types that seem to be similarly damaged.
I'm personally okay with ignoring it, it's not hurting us any, but ideally I'd like to see things like this get removed. Many of the infected PDFs appear to be arabic language documents that would be of interest to people critical of their government, so the implications of what's going on here are probably bigger than just random viruses getting added to files.
I'm happy to scan everything again and post a list of things. I'm also willing to automate this if it would help (periodic scans and uploading a list of all questionable images to a wiki page somewhere?) Anyone have any suggestions on what to do here?
Folks will have noticed that the de wikipedia dumps failed after getting
about 2/3 of the way through the meta history dump step (100GB written).
I'm in the process of setting up for the completion of that job, it will
take a few days and there won't be a progress report visible on the
regular html page.
By the way, the cause of the breakage was that the database server at
the other end of the connection went away in the middle of the run, and
no new connection to a server could be obtained, for long enough that
the program ran out of retries and gave up.
In my continued quest to Make Imports Suck Less (tm), I've written a
little perl script to shovel data from a tab-delimited escaped file to a
fifo in pieces while forking off mysql to LOAD DATA INFILE from the fifo
for each chunk. It's only been tested on linux, specifically my laptop,
but I did run it using current article content dumps and all the
auxiliary tables for a wiki of a few hundred thousand articles, and it
You can find it in the xmlfileutils directory of my branch of the git
You'll notice that all my tools are linux + mysql, and that's because
that is what I use. If folks want similar tools for other platforms
they'll have to write them, I don't have the expertise for that.
Ah also the docs on Meta about dumps have been reorganized and
rewritten, not that they are either error-free or complete but they
should be in much better shape now:
And lastly, the uncompressed en wp meta history dumps are now over 10T.
As always, feedback, edits, patches welcome.
P.S. Sorry Platonides but if you were going to rework a script of yours
you were too slow ;-) (However if you have such a script with
different/better features I'll still take it.)