don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
A discussion on the toolserver list brought up the question, once again,
of what would be needed to fork the projects. Because some data is
private we aren't going to be able to provide data for perfect copies,
but the content can be preserved. The question is how close we can get.
In particular I would like folks to think about how we can manage the
user account issue.
It would be very nice indeed if users could reclaim their accounts on a
copy of the project, and yet we cannot provide any outside project a
copy of the user table (which has email addresses and other useful bits
in it). And many many users don't give an email address anyways.
I'd like to hear proposals for how this could be handled. Wouldn't it
be awesome if this could be done today, and Wikipedia editors could have
editing privileges on copies of the project around the globe that
provided different experimental features? Assuming of course that there
were groups or organizations that wanted to run such copies of the
I am planning to decompress XML formatted bzip2 file by downloading the
file using Java class URL. Then I plan to decompress it on the fly using
Apache Ant and then parse to store in MySQL database. I am not sure if
there are better ways to do that. Also is there a way to update my database
without having to go through the whole process every time?
On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores <f.roberto.isc(a)gmail.com> wrote:
> I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for
> the iPhone, which does a somewhat decent job at interpreting the wiki
> markup into HTML.
> However, there are too many templates for me to program (not to mention,
> it's a moving target).
> Without converting these templates, many articles are simply unreadable and
Templates are dumped just like all other pages are. Have you found
them in the dumps? which dump are you looking at right now?
> Could you please provide HTML dumps (I mean, with the templates
> pre-processed into HTML, everything else the same as now) every 3 or 4
3 or 4 month frequency seems unlikely to be useful to many people.
Otherwise no comment.
> Or alternatively, could you make the template API available so I could
> import it in my program?
How would this template API function? What does import mean?
can anybody with access execute these 3 SQL queries and provide me the results of it?
--1 SELECT user_id, user_name, user_registration FROM user INNER JOIN logging ON log_user = user_id WHERE LEFT(user_registration, 4) = 2012 AND user_id NOT IN (SELECT ipb_user FROM ipblocks) AND log_type = 'newusers' AND log_action = 'create'; --2 SELECT page_id, page_title, page_namespace, page_is_redirect FROM page; --3 INSERT INTO u_hoo.dbq189 SELECT user_name FROM user INNER JOIN logging ON log_user = user_id WHERE LEFT(user_registration, 4) = 2012 AND user_id NOT IN (SELECT ipb_user FROM ipblocks) AND log_type = 'newusers' AND log_action = 'create'; SELECT rev_id, rev_page, rev_comment, rev_user_text, rev_user, rev_timestamp FROM revision INNER JOIN u_hoo.dbq189 ON rev_user_text = dbq189.user_name WHERE rev_deleted = 0 AND rev_user != 0;
I need it for a study of a friend of mine, I really appreciate your help.
I also made a ticket, but nobody reacted so far and it's a bit urgent:
Hi dump producers,
I know there's more to the choice of compression format than the size of
the resulting dumps (e.g. time, memory, portability, existing code
investment) and I read that you looked at LZMA and found it to be of
insignificant benefit , but I noticed over at the Large Text
Compression Benchmark site that they use 7-zip in PPMd mode and did some
The bzip dumps use 900k blocks and according to the bzip2.org
implementation's manual it takes around 7600k while compressing and
around 3700k while decompressing. Like LZMA, PPMd apparently uses the
same amount of memory for decompression as it used during compression,
so I recompressed the XML dump with various amounts of memory so you can
make your own comparisons.
Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran:
for MEM in 3700k 7600k 16m 512m; do
bzcat enwiki-20120802-pages-articles.xml.bz2 \
| 7z -a -si -m0=PPMd:mem=$MEM \
bzcat enwiki-20120802-pages-articles.xml.bz2 \
| 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z
for the following resulting file sizes in bytes (% of .bz2 version):
original bz2: 9143865996
$MEM=3700k : 8648303296 (94.6%)
$MEM=7600k : 8043626528 (88.0%)
$MEM=16m : 7910637814 (86.5%) (the default for both PPMd & LZMA)
LZMA: 7705327210 (84.3%)
$MEM=512m : 7076755355 (77.4%)
I wasn't looking to compare running times and absolute values wouldn't
compare to your servers but for what it's worth I noticed that LZMA took
over twice as long as any PPMd run. I was expecting PPMd to beat LZMA,
hence the several PPMd runs.
There's probably some value in experimenting with PPMd's "model order"
too, which I didn't try. Google "model order for PPMd" or see
As the dump servers only have to do it once to save that bandwidth for
every download from every mirror that month, perhaps it's worth giving
7zip more memory than bzip or even more than the default, although I
appreciate that you drive some users out of the market if compression
memory requirements equal decompression requirements and you start using
a few gig to compress. Also while you can (with a little effort) seek
around bz2s and extract individual blocks, PPMd's seekability isn't
something I've explored.
Just some thoughts.
: "7-Zip's LZMA compression produces significantly smaller files for
the full-history dumps, but doesn't do better than bzip2 for our other
files." --- http://meta.wikimedia.org/wiki/Data_dumps