Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hello Wikiteam!
just in time to say have a good vacation and happy 2016 :)
well I am also here about corrupted files :)
I downloaded three times from different wifi networks and using download
managers from Firefox:
enwiki-20151201-pages-articles-multistream.xml.bz2
and two times:
enwiki-latest-pages-articles-multistream.xml.bz2
MD5 checksum is correct (*-latest-* have checksum of -*20151201-*)
but file is corrupted.
Cannot use bzip2recover for file is too large and I should recompile it
because it will have more than maxlimit of generated tokens... and I think
it is way better to take a fixed file :D
Could you please check if I am the only one having this issue?
dumps from other languages had worked fine for me, en-* is problematic.
I see where almost all the dumps have "Dump complete" next to them and the
data has been transferred to labs. Problem is, the dumps are not
complete. Is this the new paradigm?... After each stage of the dump, label
them done and then transfer what files were generated? Wash, rinse and
repeat?
Bryan
I assume you have all seen https://phabricator.wikimedia.org/T116907
"Explore the possibility of splitting dewiki and frwiki into smaller
chunks"
If not, and you ever use frwiki or dewiki page content dumps, go read
it now. Or if you know of anyone who uses them, please nag them to go
read it.
The upshot is that we will most likely on January 1st 2016 do all
further dump runs of frwiki and dewiki with so-called 'checkpointing'.
This change is being made so that if one of these jobs is interrupted
for whatever reason, it can be rerun with only the missing page ranges
dumped on the second run, saving quite a lot of time. A second reason
is to ease the burden on downloaders, who generally prefer downloading
several smaller files rather than one large 90gb file (example taken
from dewiki history dumps).
WHat does this mean in practice for you, users of the dumps? It means
that filenames for the page content (article, meta-current and meta
-history) dumps will have pXXXXpYYYY in the names, where XXXX is the
first page id in the file and YYY is the last pageid in the file. For
examples of this you can look at the enwiki page content dumps, which
have been running that way for a few years now.
This notice should give you plenty of time to convert your tools to use
the new nameing scheme. I encourage you to forward this message to
other appropriate people or groups.
Thanks,
Ariel
Also if you are a dumps user or have thoughts about how you would redo
them from scratch, get your ideas in now. We're not waiting for the
Dev Summit to get the work started. See
https://phabricator.wikimedia.org/T114019 for details, especially the
document linked at the end of the task description under 'FOR CURRENT
DISCUSSION'.
We need: comments on the strawman proposed model in the document
mentioned above; proposals for code we can reuse for any of those
pieces, especially the job queue/management piece. (Celery? Something
else?) What should the object store be based on, if we have one? Is
Ceph a dead end? Will Swift be deadly slow or should we use it since we
have it in house already? Etc. All comments on the ticket please so we
have them all in one place.
Expect that most of the code written here at WMF will be in python
unless someone else volunteers to write some. Anyone interested in
doing some of the development can step right up too of course.
Please forward this on to other fora as appropriate.
Ariel