Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hello everybody,
First thread in the list, i'm relatively new to wikipedia/media usage (1
month).
I followed the doc (RTFM as usual) and every steps seems to finish in a
wall.
My simple question, how to correctly install a wikipedia mirror from dumps
in mediawiki ?
My goal:
Create an offline wikimedia server, from FR, EN or PT dumps, (1 language
only).
I did some tests with the wowiki which is small enough for testing purpose
but I encountered problems:
The tools mentioned in the doc are 404 for most of them or outdated. for
example the xml2sql or mwdumper.
the original importer of wikimedia takes 20 minutes to import the wowiki
1.6Mo dump which means approximatly a decade for the FR one...
I tried with the last MWDumper found on github, it can quickly generate a
good sql file but it seems that special characters in page text are not
"slashed" so \n return SQL error...
Some columns are missing in original mediawiki installer, the
"page_counter" in "page" table for example, is there some needed extensions
to install dumps?
How can i found/get medias (images low res are enough)
How to do with interwiki medias/links/pages?
To follow next step will be updates procedure, incremental or complete...
Seems french wikimedia team are stuck with my questions..
I'm not closed to development (python, PHP or JEE) but i need an entry
point to begin if nobody work on it...
Thank you for your work, i know you are quite busy with the dump generation
but our project is quite serious and lots of people want it and we need
quick answers for the feasability.
Best regards,
Yoni