Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hi all,
Has anybody already tried to compute the hindi wikipedia dump ? If yes, or
if somebody intend to do it I am ready for some collaboration.
All the best,
Benoit.
Ack, sorry for the (no subject); again in the right thread:
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
Dear Ariel,
Thank you for your guidance. I pushed another change to gerrit for
review that should address the issue of the new `page_links_updated'
field.
Sincerely Yours,
Kent
On 2/7/14, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
> Last reply: I double-checked the content/format model stuff, and the
> only nagging question I have remaining is how well it works with
> non-text handlers. But that would not be a new issue, and the code for
> the base case is certainly correct. So I think we are down to just the
> page_links_updated variable for > 1.22 and that would do it.
>
> Ariel
>
> Στις 05-02-2014, ημέρα Τετ, και ώρα 02:43 -0500, ο/η wp mirror έγραψε:
>> Dear Ariel,
>>
>> I have been reading your code for `mwxml2sql-0.0.2' with a view
>> towards updating it for mediawiki-1.23 LTS.
>>
>> 0) Support status
>>
>> Currently, the version info for `mwxml2sql' states the following:
>>
>> (shell)$ mwxml2sql --version
>> mwxml2sql 0.0.2
>> Supported input schema versions: 0.4 through 0.8.
>> Supported output MediaWiki versions: 1.5 through 1.21.
>>
>> 1) Current input schema version
>>
>> Currently, your XML dump files have the following header:
>>
>> (shell)$ head -n 1 zuwiki-20140121-pages-articles.xml
>> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/
>> http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8"
>> xml:lang="zu">
>>
>> From this I gather that XML schema is still 0.8, and that `mwxml2sql'
>> needs no update on that head.
>>
>> 2) Current output MediaWiki version
>>
>> I reviewed the database schema for the `page', `revision', and `text'
>> tables:
>>
>> <https://www.mediawiki.org/wiki/Manual:Page_table>,
>> <https://www.mediawiki.org/wiki/Manual:Revision_table>, and
>> <https://www.mediawiki.org/wiki/Manual:Text_table>
>>
>> It appears that the most recent changes to the schema for these three
>> tables occurred for mediawiki versions 1.21, 1.21, and 1.19,
>> respectively.
>>
>> From this I gather that the database schema used for mediawiki 1.23
>> LTS is the same as that used for mediawiki 1.21; and therefore
>> `mwxml2sql' needs no update on that head.
>>
>> 3) Recommended updates
>>
>> From a review of your code, I concluded that two minor changes would be
>> useful.
>>
>> 3.1) mwxml2sql.c
>>
>> The following three lines:
>>
>> (shell)$ grep 21 mwxml2sql.c
>> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through
>> 1.21.\n\n");
>> /* we know MW 1.5 through MW 1.21 even though there is no MW 1.21 yet
>> */
>> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 21) {
>>
>> should read
>>
>> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through
>> 1.23.\n\n");
>> /* we know MW 1.5 through MW 1.23 */
>> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 23) {
>>
>> 3.2) mwxmlelts.c
>>
>> The following line:
>>
>> (shell)$ grep 21 mwxmlelts.c
>> <generator>MediaWiki 1.21wmf6</generator>
>>
>> should read
>>
>> <generator>MediaWiki 1.23wmf10</generator>
>>
>> 4) Request
>>
>> Please let me know if you agree with the above assessment. If you do,
>> I would be happy to submit the changes to
>> <https://gerrit.wikimedia.org/> for review.
>>
>> Sincerely Yours,
>> Kent
>
>
>
Dear list members,
I am pleased to announce the release of WP-MIRROR 0.7.
0) SUMMARY
The main design objective was this: RENDERING.
A page rendered by the mirror is now very similar to the same page
rendered by the WMF server. Indeed, not only do pages look almost the
same, they now *behave* almost the same (e.g. editting, searching,
user account creation, beta features).
Most of this improvement was accomplished by packaging `mediawiki
1.23' and dozens of its extensions including: EasyTimeline, Math,
Mobile Frontend, Score, Scribunto, Timed Media Handler, Titlekey,
Universal Language Selector, Visual Editor, and Wikidata.
A mirror of the `wikidata wiki' is now built `out-of-the-box', and
provides data to the other wikis (e.g. for populating infoboxes).
Mirrors of `simplewiki' and `simplewiktionary' are also build
`out-of-the-box' as before.
1) PACKAGING
Dependencies: Four new DEB packages were prepared as dependencies for
WP-MIRROR 0.7:
o mediawiki-mwxml2sql_0.0.2-2_amd64.deb: contains an upgrade of
`mwxml2sql' suitable for processing XML dumps for mediawiki 1.23.
o wp-mirror-mediawiki_1.23-1_all.deb: contains git branch wmf/1.23wmf14.
o wp-mirror-mediawiki-extensions_1.23-1_all.deb: contains 47
extensions from git branch wmf/1.23wmf14 (except for the `Wikidata'
extension which is from git branch mw1.23-wmf11)
o wp-mirror-mediawiki-extensions-math-texvc_1.23-1_amd64.deb: contains
programs needed to render MathJax.
Testing: The DEB package for WP-MIRROR 0.7 works `out-of-the-box'
with no user configuration for the following distributions:
o Debian GNU/Linux 7.4 (wheezy) with backports. This has been tested
on both a host machine, and on a virtual machine.
2) INSTALLATION
WP-MIRROR 0.7 is now available in the form of a Debian package repository.
3) USE
Virtual Hosts: Browsing of mirrored wikis is done via virtual hosts
with names like <http://simple.wikipedia.site/>,
<http://simple.wiktionary.site/>, and <http://www.wikidata.site/>.
Simply take the URL that WMF offers, and replace `.org' with `.site'.
3) FURTHER INFO
Project Home Page: <http://www.nongnu.org/wp-mirror/> has been
updated. Please browse there if you are interested in trying
WP-MIRROR.
Documentation has been updated. There is a new section on virtual
machines; and there is a new section on the `mediawiki' extensions
that were packaged for this release.
Feedback is welcome.
4) ACKNOWLEDGEMENTS
I would like to thank several people for providing valuable assistance.
o Jason Cooper - for recommendations on DEB packaging, repository
design, and hosting; and for getting me interested in virtual
machines, which are now part of my tool chain (this is how I test that
WP-MIRROR 0.7 works `out-of-the-box' on a clean install of Debian
7.4).
o Kevin Day - for providing IPv6 support at <ftpmirror.your.org>,
which is highly appreciated by those of us on IPv6 only networks.
o Frederico Leva (Nemo) - for advice that helped track down a bug with
XML dump importation.
o Gnosygnu - for many pointers on Wikidata, Scribunto, database schema
changes, and XOWA thumb dumps.
o Guy Castagnoli - for performing a code review of WP-MIRROR 0.6,
which led to numerous improvements in WP-MIRROR 0.7; and for spending
an evening with me discussing a wide range of topics including: news
of other offline projects, showing me how pages are rendered on his
mobile devices, and starting a list of feature requests for a future
WP-MIRROR 0.8 (requests for features including parsoid, importation of
daily dumps, generation of thumb dumps and ZIM files).
Sincerely Yours,
Dr. Kent L. Miller