Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hi,
after a month of work on my GSoC project Incremental Dumps [1], I think I
have now something worth sharing and talking about, though it's still far
from complete.
What the code can do now is to read a pages-history XML dump and create the
various kinds of dumps (pages/stub, current/history) in the new format from
that.
It can then convert a dump in the new format back to XML.
The XML output is almost the same as existing XML dumps, but there are some
differences [2].
The current state of the new format also now has a detailed specification
[3] (this describes the current version, the format is still in flux and
can change daily).
If you want, you can also try running the code. [4]
It's not production-quality yet (e.g. it doesn't report errors properly),
but it should work.
Compilation instructions are in the README file.
Any comments or questions are welcome.
Petr Onderka
User:Svick
[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
[2]:
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/XML_…
[3]:
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/Spec…
[4]: https://github.com/wikimedia/operations-dumps-incremental/tree/gsoc
Hello,
there is a problem with the extracted page abstracts for Yahoo on the
big wikis moved to the new infrastructure. During generation everything
seems to be fine, but it ended with a 159kb file.
An other question: Why is this step not parallelized?
Best regards
Andreas Meier
Hello XML dump users,
The workers that handle the rest of the wikis will finish what they are
working on (whichever runs are in progress now) and then will terminate.
When all are stopped I will be starting up these processes on hosts in
the other data center.
I'll be running 6 workers at a time which should put us in the time
frame of 8-9 days between runs per wiki. We'll see how that goes.
Ariel
Hi,
I was wondering on the order/sorting of revisions inside the pages-meta-history dumps, especially with respect to the namespaces. Does the order of revisions in the dumps account for namespaces (e.g. are revisions from the Template namespace located towards the end of the dump?) or is the order bound to any other parameter which potentially influences the location of revisions from certain namespaces?
I'm currently processing the (March 2013) dewiki dump.
Regards,
Johannes
---
Johannes Daxenberger
Doctoral Researcher | IT Administration
Ubiquitous Knowledge Processing (UKP Lab)
FB 20 Computer Science Department
Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany
email: daxenberger(at)ukp.informatik.tu-darmstadt.de
phone: [+49] (0)6151 16-6227, fax: -5455, room: S2/02/B111
www.ukp.tu-darmstadt.de<http://www.ukp.tu-darmstadt.de/>
Web Research at TU Darmstadt (WeRC)
www.werc.tu-darmstadt.de<http://www.werc.tu-darmstadt.de/>
Hi all,
I've imported the data dumps into mysql and after running some queries I've
noticed the column "rev_len" in revision is empty and "page_len" in the
pages table is always equal to 0 for every row. Can anybody tell me
anything about this? I was really hoping to use this information.
Best,
Xavi
--
Xavier Vinyals-Mirabent
-----------------------------------------------------
Mail: 1925, 4th Street South
4-101 Hanson Hall
Minneapolis, MN 55455-0462
Email: v <renes(a)umn.edu>inya002(a)umn.edu
Office: 3-157 Hanson Hall
Phone: +1 6126257837
Homepage: www.econ.umn.edu/~vinya002
Hi, is there a file somewhere with a list of all namespace names and
numbers for all the Wikimedia wikis. The list should at least have the
canonical names, but preferably also any aliases. This is useful to find
out which namespaces interwiki links and language links goes to.
The canonical names are in the siteinfo section of the XML dumps of each
wiki, but not the aliases - and it is not practical to download complete
dumps for all projects just to get namespace names.
Regards,
- Byrial
The following wikis are now running out of our datacenter in Ashburn:
eswiki, ptwiki, plwiki, ruwiki, jawiki, dewiki, frwiki, nlwiki, itwiki
Two worker processes handle these but each wiki is dumped in 4 parallel
jobs. This means that we'll be back to shorter and more frequent run
times for them all. A reminder that we will not be recombining the full
history dumps into one file, as that would undercut the speed gains we
make by parallelizing.
As with the enwiki move, you won't see up to the minute updates for
these wikis on the html page, since the data must be synced over before
updates can show up.
In a few days we'll start working on the move of the rest of the wikis.
Ariel