Im running a basic check (parsing every page in the wiki) using the toolserver's 6-1 copy. Ill let you know if I see any issues.
John
On Thu, Jun 7, 2012 at 2:45 PM, Felipe Ortega glimmer_phoenix@yahoo.eswrote:
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" <
xmldatadumps-l@lists.wikimedia.org>
Enviado: Jueves 7 de junio de 2012 18:52 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
On 06/06/12 20:22, Felipe Ortega wrote:
Hello.
I'm finding strange issues when trying to decompress the 7z version of
this dump for the French Wikipedia:
http://dumps.wikimedia.org/frwiki/20120430/
At some point around 3M revisions the 7z process stalls. After a long
time
(few hours) it recovers normal execution, but then stalls again around
55M
revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that
subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump
(http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in
case I
find any more problems.
Best, Felipe.
It apparently decompresses ok.
time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z
e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
78eda06a57ea738a2e21697e31e52128
frwiki-20120430-pages-meta-history.xml.7z
real 25m55.503s user 0m28.549s sys 0m19.489s
7-Zip 4.55 beta Copyright (c) 1999-2007 Igor Pavlov 2007-09-05 p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: frwiki-20120430-pages-meta-history.xml.7z
Extracting frwiki-20120430-pages-meta-history.xml
Everything is Ok
Thanks, Platonides.
It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I haven't had any issues with other languages (including the many files in English).
So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).
Best, Felipe.
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951
real 163m59.124s user 138m30.290s sys 0m29.328s
The content might be completely bogus, though. It'd need further checks. ----- Mensaje original -----
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l