Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Ack, sorry for the (no subject); again in the right thread:
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
On Tue, Jan 21, 2014 at 2:19 PM, Randall Farmer <randall(a)wawd.com> wrote:
> > For external uses like XML dumps integrating the compression
> > strategy into LZMA would however be very attractive. This would also
> > benefit other users of LZMA compression like HBase.
>
> For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
>
> That has a 4 MB buffer, compression ratios within 15-25% of
> current 7zip (or histzip), and goes at 30MB/s on my box,
> which is still 8x faster than the status quo (going by a 1GB
> benchmark).
>
> Re: trying to get long-range matching into LZMA, first, I
> couldn't confidently hack on liblzma. Second, Igor might
> not want to do anything as niche-specific as this (but who
> knows!). Third, even with a faster matching strategy, the
> LZMA *format* seems to require some intricate stuff (range
> coding) that be a blocker to getting the ideal speeds
> (honestly not sure).
>
> In any case, I left a note on the 7-Zip boards as folks have
> suggested: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
>
> Thanks for the reply,
> Randall
>
>
Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the
same avg compression ratio as 7zip. Can anyone help me test more or
experimentally deploy?
As I understand, compressing full-history dumps for English Wikipedia and
other big wikis takes a lot of resources: enwiki history is about 10TB
unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's
over a day of server time. There's been talk about ways to speed that up in
the past.[1]
It turns out that for history dumps in particular, you can compress many
times faster if you do a first pass that just trims the long chunks of text
that didn't change between revisions. A program called rzip[2] does this
(and rzip's _very_ cool, but fatally for us it can't stream input or
output). The general approach is sometimes called Bentley-McIlroy
compression.[3]
So I wrote something I'm calling histzip.[4] It compresses long repeated
sections using a history buffer of a few MB. If you pipe history XML
through histzip to bzip2, the whole process can go ~100 MB/s/core, so we're
talking an hour or three to pack enwiki on a big box. While it compresses,
it also self-tests by unpacking its output and comparing checksums against
the original. I've done a couple test runs on last month's fullhist dumps
without checksum errors or crashes. Last full run I did, the whole dump
compressed to about 1% smaller than 7zip's output; the exact ratios varied
file to file (I think it's relatively better at pages with many revisions)
but were +/- 10% of 7zip's in general.
Also, less exciting, but histzip's also a reasonably cheap way to get daily
incr dumps about 30% smaller.
Technical datadaump aside: *How could I get this more thoroughly tested,
then maybe added to the dump process, perhaps with an eye to eventually
replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk
to to get started? (I'd dealt with Ariel Glenn before, but haven't seen
activity from Ariel lately, and in any case maybe playing with a new tool
falls under Labs or some other heading than dumps devops.) Am I nuts to be
even asking about this? Are there things that would definitely need to
change for integration to be possible? Basically, I'm trying to get this
from a tech demo to something with real-world utility.
Best,
Randall
[1] Some past discussion/experiments are captured at
http://www.mediawiki.org/wiki/Dbzip2, and some old scripts I wrote are at
https://git.wikimedia.org/commit/operations%2Fdumps/11e9b23b4bc76bf3d89e1fb…
[2] http://rzip.samba.org/
[3]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&t…
[4] https://github.com/twotwotwo/histzip
Hi,
I am trying to set up a local Wikipedia mirror. While I was reading up on how to import xml dumps and install extensions manually - I find it hard to match and install all the extensions manually and properly. I have been testing this with the Simple Wikipedia.
What is the best and easiest way to install a local mirror?
Best,
Bastian
Dear Ariel,
Happy New Year. I am gearing up for wp-mirror-0.7. To that end, I would
like to list some issues that I see; and I would like to offer my help in
solving them.
0) Problem Statements
0.1) Page Rendering. Wp-mirror-0.6 works well in the sense that it builds
a faithful mirror of any of your wikis. However, during 2013 the rendering
of pages eroded materially. For example,
o interlanguage links have vanished both from rendered pages and from
dump files;
o infoboxes are no longer rendered;
o most transclusions now render as redlinks even though the templates
are easily found in the underlying database; etc.
I understand that this erosion occurred because wp-mirror-0.6 still uses
mediawiki-1.19, whereas WMF has moved on to mediawiki-1.23. For example, I
understand that:
o interlanguage links have been removed to the wikidata project, the
rendering of which requires mediawiki-1.21+;
o infoboxes now require the scribunto extension which requires
mediawiki-1.20+
0.2) Database Schema. Some differences in database schema have appeared.
o category - dump files now have 5 fields, whereas the database schema
has 6 fields;
o exterallinks - dump files now have 4 fields, whereas the database
schema has 3 fields.
Loading these two tables generate the error message: ``Column count
doesn't match value at row 1.''
0.3) Version Lifecycle. According to <
http://www.mediawiki.org/wiki/Version_lifecycle> mediawiki 1.23 LTS is
slated for May 2014. However, the Debian packaging team is silent as to
their plans for a transition from mediawiki-1.19 LTS to mediawiki-1.23 LTS.
0.4) Image Dumps. The large image dump tarballs are now a year old. This
means that, while wp-mirror still downloads the bulk of its images from
these tarballs, there are a growing number that must be downloaded
individually from WMF.
0.5) Thumbs. One person has asked me if dump files of thumbs could be made
available. We are beginning to see thumb dumps from the xowa project.
0.6) IPv6. I am glad to see that <gerrit.wikimedia.org> has an IPv6
address. However, <bastion.wmflabs.org> still does not. My internal
network is IPv6 only.
1) mwxml2sql
This utility from Ariel Glenn has proved invaluable to the wp-mirror
project. This utility, together with MySQL 5.5 fast index creation, allows
wp-mirror to build mirrors much faster than before (80% less time).
1.1) Need for update. According to its version information, mwxml2sql may
only be valid through mediawiki-1.21.
(shell)$ mwxml2sql --version
mwxml2sql 0.0.2
Supported input schema versions: 0.4 through 0.8.
Supported output MediaWiki versions: 1.5 through 1.21.
Whereas, I am looking forward to mediawiki-1.23 LTS (see below), I would
like to know if mwxml2sql should be updated.
1.2) Help Offer. If mwxml2sql does need updating, I would be happy to help
with this; and to package it for Debian as I have done before. Perhaps we
could call it mwxml2sql-0.0.3.
2) mediawiki-1.23 LTS.
2.1) Vision. I would like wp-mirror-0.7 to be able to build a mirror that
serves pages that look no different than those served by WMF.
2.2) DEB package. To that end, I am thinking of packaging mediawiki-1.23
together with the extensions needed for rendering WMF wikis with wikidata
content, infoboxes, math, transclusions, etc. Given WMF's ``continuous
integration'' development model, I would like to be able to automatically
generate a tarball and DEB package each time WMF pushes an update to its
servers.
2.3) Debian package repository. Such a DEB package would be distributed
with wp-mirror. In preparation for this, I have set up a Debian package
repository at <http://download.savannah.gnu.org/releases/wp-mirror/>. It
is currently used to distribute wp-mirror-0.6 and an unstable version of
wp-mirror-0.7. Home page <http://www.nongnu.org/wp-mirror/>.
2.4) Help Offer. I am happy to do most of this work myself. However, I
will need some guidance on interacting with the appropriate GIT
repositories. I hope that you can put me in touch with someone involved in
the ``continuous integration'' process.
3) Media dumps
I am thinking that updating the image dumps annually would be adequate.
Including thumbs in those dumps would materially assist the off-line
community. I could easily update wp-mirror-0.7 to give the user a choice
(no media files, thumbs only, full size media files).
Sincerely Yours,
Kent