Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hi data dumpers,
Starting today, some of the URLs I've been using to find the latest dumps
for current article revisions have begun 404ing.
Japanese (failing):
$ curl -I "
http://dumps.wikimedia.org/enwiki/latest/jawiki-latest-pages-articles.xml.b…
"
HTTP/1.1 404 Not Found
Server: nginx/1.1.19
Date: Tue, 07 Jul 2015 23:40:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 169
Connection: keep-alive
English (working):
$ curl -I "
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.b…
"
HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Tue, 07 Jul 2015 23:39:36 GMT
Content-Type: application/octet-stream
Content-Length: 11984805689
Last-Modified: Fri, 05 Jun 2015 23:45:33 GMT
Connection: keep-alive
Accept-Ranges: bytes
Are these particular dump files going away, or is the "latest" symlink
being updated before all dumps have completed?
--
Devesh
Dear Sir of Madam,
I would like to see what impact Flow will have on dump files.
I see that LiquidThreads still appear in the XML dumps of enwikinews (as of
20150602).
Will the tables containing Flow records be dumped in SQL or XML format?
Are there now any SQL or XML dumps that contain Flow records?
If not, can you forecast when Flow will appear in the dumps?
SIncerely Yours,
Kent
Hi All,
Sorry for the spam!
I need an old wikipedia dump (Oct 2008) for some research work. I can't
seem to find it anywhere.
Can someone please suggest how can I procure the dump?
If someone has the dump lying around on their machine(s), it would be
great if they can share the dump.
The dump file is named enwiki-20081008-pages-articles.xml.bz2
Thanks and Regards,
Saurabh
Hi,
I would like to know if there is a way to obtain dumps that are not
listed on the
website https://dumps.wikimedia.org/.
In particular, I am looking for enwiki-20081008-pages-articles.xml.bz2
.
Thanks & Regards,
Saurabh