Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
It's amazing there are already so many years available for download.
Especially the larger zips must have been somewhat time-consuming to
compile! It would be great if 2008 or pre-2006 packages become available in
the near future. It is a really interesting development to see
http://dumps.wikimedia.org being interested in not only compiling large
current packaged databases of various wikis, but also more historic
content. In the past, the Internet Archive was the sole distributor of
older (also historic) wiki packages.
Eventually in the far future, there will have to be some sort of viable
mechanism for cloning all the images stored on wikimedia, though as for now
the Picture of the Year packages are very interesting for those more
interested in the pretty images of wikipedia. The POTY images also make
great wallpaper packs!
As usual there have been a couple of hiccups after the rollout of mw
1.19wmf1 everywhere. In the expctation that there may be other
surprises, I'm running just one process over the weekend and will check
on it from time to time. Of course if anyone notices something awry,
please bring it up.
Thanks,
Ariel
Hello,
the dewiki dump 20120221 is complettly broken, see
http://dumps.wikimedia.org/dewiki/20120221/ . In my opinion the best
thing to do is to stop this dump process and make a new dump. I think it
makes no sense to restart the 20120221 dump.
Best regards
Andreas
Hello,
while working towards the xmldumps-backup test suite, I came across ExternalStorageHttp.
While it's easy to setup (read-only), testing it is rather a burden.
Is ExternalStorageHttp actually used for some MediaWiki at WikiMedia?
Kind regards,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hello,
I am currently developing a test suite for the XML dumps, and I am curious about the specification of text.old_flags in MediaWiki's maintainance/tables.sql.
The file describes the 'object' flag as
text field contained a serialized PHP object.
object either contains multiple versions compressed
to achieve a better compression ratio, or it refers
to another row where the text can be found.
Is the „multiple versions” part still used in some project?
If so, how should this be set up [1]?
Kind regards,
Christian
P.S.: In #wikimedia-dev I was told, to bring up the question on this list. If there are further lists, where I should ask, please let me know.
[1] Before r6138 (back then still in Article.php not Revision.php), it seems the text was obtained by
$object = unserialize( $text );
$text = $object->getItem( $hash );
. There it is somewhat obvious how a single object may return different texts. However, beginning with p6138 it seems the text is simply fetched by
$obj = unserialize( $text );
[...]
$text = $obj->getText();
. If a single object should return different texts, how does it determine, which text to return?
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Good morning campers! :-)
At the POTY collection link http://dumps.wikimedia.org/other/poty/
you'll notice that the 2009 files have been added.
For people who have wanted older (2002 through 2006) dumps, one or two
dumps of the projects for most of that period are now online at our new
archive link: http://dumps.wikimedia.org/archive/
I'd love to get a couple dumps of each of the projects for the years
2005, 2007 and 2008. If you have an old mirror on a hard drive
gathering dust in your closet, give me a shout. Thanks!
Hmm, and one other thing while I'm at it, I've been cleaning up our
dumps documentation on wikitech, and there's now a rough outline of the
dumps history here: http://wikitech.wikimedia.org/view/Dumps/History
If people remember any milestones that should be added there, or if you
see any glaring errors, please edit, or send me mail with the
corrections.
Ariel