Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Dear Asaf,
0) Use cases
Use cases for this project are simple (motivation stated in parentheses):
a) Originally I just wanted a mirror of the `enwiki' on my desktop
that I could browse when my Internet service provider went down
(availability);
b) Later I wanted a mirror of the `simplewiki' and `simplewiktionary'
on my laptop so I could move about (mobility);
c) Then came unhappy disclosures about domestic surveillance which
make it prudent to browse offline (privacy);
d) Still later I wanted mirrors of other projects, such as
`enwikisource' and `enwikiversity', because I like reading books offline,
usually keeping them open for days (availability); and
e) Now I want to generate ZIM files and all other dump files from
these mirrors to create a `WMF in a microcosm.' This is for use by the
offline community, and for archiving, experimenting, etc. For example:
o You can have a desktop where the `enwikinews' mirror updates
daily and generates a ZIM file daily. This ZIM file can be synced to your
handheld device at your convenience. (availability, mobility);
o You can periodically generate and archive an image thumbs
tarball (durability); and
o You can dump your mirror, conduct experiments that may trash
your database, and then rebuild your mirror (durability).
The recent release WP-MIRROR 0.7.4 delivers all but use case (e).
1) Road map
Dump file generating capability is planned for the next version series,
WP-MIRROR 0.8.x, which will be packaged for Debian 8 (jessie) and Ubuntu
14.04 (trusty).
Sincerely Yours,
Kent
On Sun, Nov 30, 2014 at 12:02 AM, Asaf Bartov <asaf.bartov(a)gmail.com> wrote:
> Thanks! (and \o/ LISP!)
>
> Could you tell us a little about the use case that drove you to develop
> this?
>
> A.
>
> On Sat, Nov 29, 2014 at 6:42 PM, wp mirror <wpmirrordev(a)gmail.com> wrote:
>
>> Dear list members,
>>
>> WP-MIRROR 0.7.4 is now available.
>>
>> 0) Features
>>
>> Configuration of MediaWiki has been greatly improved.
>> Incremental XML data dump files now used.
>> SSL enabled so that wikis are now protocol independent (may access via
>> HTTPS).
>> URL fallback list increases reliability of downloading XML data dump
>> files.
>> Wiki `talk' pages now installed.
>>
>> 1) Updates
>>
>> Dependencies have been brought up to date:
>>
>> MediaWiki updated to 1.24.22.
>> MediaWIki extensions updated to 1.24.22.
>> XML Data Dump Schema updated to 0.10.
>>
>> 2) MediaWiki extensions
>>
>> Many extensions have been added for use with various WMF projects:
>>
>> Wikinews: DynamicPageList;
>> Wikipedia: CommonsMetadata, JsonConfig, Mantle, MultimediaViewer,
>> PageImages,
>> Popups;
>> Wikisource: DoubleWiki, Proofreadpage, RandomRootPage;
>> Wikiversity: Quiz; and
>> Wikivoyage: CustomData, GeoCrumbs, MapSources.
>>
>> 3) Home pages updated
>>
>> <https://www.mediawiki.org/wiki/Wp-mirror>
>> <http://www.nongnu.org/wp-mirror/>
>>
>> 4) Thanks
>>
>> I would especially like to thank the following contributors:
>>
>> Luiz Augusto for submitting bug reports, and for requesting features of
>> importance to the wikisource and wikiversity projects.
>>
>> Guy Catagnoli for reading the WP-MIRROR code and submitting many
>> comments; for submitting bug reports with log files containing valuable
>> debug info, and for feature requests.
>>
>> Sincerely Yours,
>> Kent
>>
>> _______________________________________________
>> Offline-l mailing list
>> Offline-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/offline-l
>>
>>
>
>
> --
> Asaf Bartov <asaf.bartov(a)gmail.com>
>
Dear list members,
WP-MIRROR 0.7.4 is now available.
0) Features
Configuration of MediaWiki has been greatly improved.
Incremental XML data dump files now used.
SSL enabled so that wikis are now protocol independent (may access via
HTTPS).
URL fallback list increases reliability of downloading XML data dump files.
Wiki `talk' pages now installed.
1) Updates
Dependencies have been brought up to date:
MediaWiki updated to 1.24.22.
MediaWIki extensions updated to 1.24.22.
XML Data Dump Schema updated to 0.10.
2) MediaWiki extensions
Many extensions have been added for use with various WMF projects:
Wikinews: DynamicPageList;
Wikipedia: CommonsMetadata, JsonConfig, Mantle, MultimediaViewer,
PageImages,
Popups;
Wikisource: DoubleWiki, Proofreadpage, RandomRootPage;
Wikiversity: Quiz; and
Wikivoyage: CustomData, GeoCrumbs, MapSources.
3) Home pages updated
<https://www.mediawiki.org/wiki/Wp-mirror>
<http://www.nongnu.org/wp-mirror/>
4) Thanks
I would especially like to thank the following contributors:
Luiz Augusto for submitting bug reports, and for requesting features of
importance to the wikisource and wikiversity projects.
Guy Catagnoli for reading the WP-MIRROR code and submitting many comments;
for submitting bug reports with log files containing valuable debug info,
and for feature requests.
Sincerely Yours,
Kent
Dear list members,
1) XML Data Dump Schema
On Wed 2014-Nov-05, dump files using XML Data Dump Schema 0.10 will appear
on WMF sites.
Proposal: <
https://lists.wikimedia.org/pipermail/wikitech-l/2014-October/079163.html>
Patch: <https://gerrit.wikimedia.org/r/#/c/168583/>
Announcement: <https://bugzilla.wikimedia.org/show_bug.cgi?id=66663>
Comment 11.
2) XML Dump Utilities
`mwxml2sql' is a utility for converting XML dumps into SQL files for
`page', `revision' and `text' tables. For `mwxml2sql' this is a breaking
change.
WP-MIRROR 0.7.3 depends on `mwxml2sql', and so, for WP-MIRROR 0.7.3, this
is a breaking change.
3) Patches
Soon after XML dumps using schema 0.10 appear, I shall submit a patch for
`mwxml2sql'.
WP-MIRROR 0.7.4 will be released immediately thereafter.
Sincerely Yours
Kent