Hello.
I'm finding strange issues when trying to decompress the 7z version of this dump for the French Wikipedia:
http://dumps.wikimedia.org/frwiki/20120430/
At some point around 3M revisions the 7z process stalls. After a long time (few hours) it recovers normal execution, but then stalls again around 55M revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I find any more problems.
Best, Felipe.
On 06/06/12 20:22, Felipe Ortega wrote:
Hello.
I'm finding strange issues when trying to decompress the 7z version of this dump for the French Wikipedia:
http://dumps.wikimedia.org/frwiki/20120430/
At some point around 3M revisions the 7z process stalls. After a long time (few hours) it recovers normal execution, but then stalls again around 55M revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I find any more problems.
Best, Felipe.
It apparently decompresses ok.
time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null ) 78eda06a57ea738a2e21697e31e52128 frwiki-20120430-pages-meta-history.xml.7z
real 25m55.503s user 0m28.549s sys 0m19.489s
7-Zip 4.55 beta Copyright (c) 1999-2007 Igor Pavlov 2007-09-05 p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: frwiki-20120430-pages-meta-history.xml.7z
Extracting frwiki-20120430-pages-meta-history.xml
Everything is Ok
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951
real 163m59.124s user 138m30.290s sys 0m29.328s
The content might be completely bogus, though. It'd need further checks.
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Jueves 7 de junio de 2012 18:52 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
On 06/06/12 20:22, Felipe Ortega wrote:
Hello.
I'm finding strange issues when trying to decompress the 7z version of
this dump for the French Wikipedia:
http://dumps.wikimedia.org/frwiki/20120430/
At some point around 3M revisions the 7z process stalls. After a long time
(few hours) it recovers normal execution, but then stalls again around 55M revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that
subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump
(http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I find any more problems.
Best, Felipe.
It apparently decompresses ok.
time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z
e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
78eda06a57ea738a2e21697e31e52128 frwiki-20120430-pages-meta-history.xml.7z
real 25m55.503s user 0m28.549s sys 0m19.489s
7-Zip 4.55 beta Copyright (c) 1999-2007 Igor Pavlov 2007-09-05 p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: frwiki-20120430-pages-meta-history.xml.7z
Extracting frwiki-20120430-pages-meta-history.xml
Everything is Ok
Thanks, Platonides.
It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I haven't had any issues with other languages (including the many files in English).
So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).
Best, Felipe.
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951
real 163m59.124s user 138m30.290s sys 0m29.328s
The content might be completely bogus, though. It'd need further checks. ----- Mensaje original -----
Im running a basic check (parsing every page in the wiki) using the toolserver's 6-1 copy. Ill let you know if I see any issues.
John
On Thu, Jun 7, 2012 at 2:45 PM, Felipe Ortega glimmer_phoenix@yahoo.eswrote:
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" <
xmldatadumps-l@lists.wikimedia.org>
Enviado: Jueves 7 de junio de 2012 18:52 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
On 06/06/12 20:22, Felipe Ortega wrote:
Hello.
I'm finding strange issues when trying to decompress the 7z version of
this dump for the French Wikipedia:
http://dumps.wikimedia.org/frwiki/20120430/
At some point around 3M revisions the 7z process stalls. After a long
time
(few hours) it recovers normal execution, but then stalls again around
55M
revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that
subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump
(http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in
case I
find any more problems.
Best, Felipe.
It apparently decompresses ok.
time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z
e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
78eda06a57ea738a2e21697e31e52128
frwiki-20120430-pages-meta-history.xml.7z
real 25m55.503s user 0m28.549s sys 0m19.489s
7-Zip 4.55 beta Copyright (c) 1999-2007 Igor Pavlov 2007-09-05 p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: frwiki-20120430-pages-meta-history.xml.7z
Extracting frwiki-20120430-pages-meta-history.xml
Everything is Ok
Thanks, Platonides.
It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I haven't had any issues with other languages (including the many files in English).
So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).
Best, Felipe.
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951
real 163m59.124s user 138m30.290s sys 0m29.328s
The content might be completely bogus, though. It'd need further checks. ----- Mensaje original -----
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
De: John phoenixoverride@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: Platonides platonides@gmail.com; "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Jueves 7 de junio de 2012 21:22 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
Im running a basic check (parsing every page in the wiki) using the toolserver's 6-1 copy. Ill let you know if I see any issues.
Thanks, John. It will be very useful. I'm getting a fresh copy of frwiki 20120430 dump and I will try to first decompress and then parse the plain xml.
In any case, it is also a good chance to check possible delays of using pipes to redirect output of 7z e -so, rather than using the plain file directly.
Felipe.
John
On Thu, Jun 7, 2012 at 2:45 PM, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
De: Platonides platonides@gmail.com
Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Jueves 7 de junio de 2012 18:52 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
On 06/06/12 20:22, Felipe Ortega wrote:
Hello.
I'm finding strange issues when trying to decompress the 7z version of
this dump for the French Wikipedia:
http://dumps.wikimedia.org/frwiki/20120430/
At some point around 3M revisions the 7z process stalls. After a long time
(few hours) it recovers normal execution, but then stalls again around 55M revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that
subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump
(http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I find any more problems.
Best, Felipe.
It apparently decompresses ok.
time md5sum frwiki-20120430-pages-meta-history.xml.7z && ( time 7z
e -so frwiki-20120430-pages-meta-history.xml.7z > /dev/null )
78eda06a57ea738a2e21697e31e52128 frwiki-20120430-pages-meta-history.xml.7z
real 25m55.503s user 0m28.549s sys 0m19.489s
7-Zip 4.55 beta Copyright (c) 1999-2007 Igor Pavlov 2007-09-05 p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: frwiki-20120430-pages-meta-history.xml.7z
Extracting frwiki-20120430-pages-meta-history.xml
Everything is Ok
Thanks, Platonides.
It's strange, then it might be something related to process scheduling (in Ubuntu server 12.04), but I haven't had any issues with other languages (including the many files in English).
So, last alternative would be to decompress it first and parse the xml (I see the size is ~125 GB).
Best, Felipe.
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951
real 163m59.124s user 138m30.290s sys 0m29.328s
The content might be completely bogus, though. It'd need further checks. ----- Mensaje original -----
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
I needed to change the dump I was working on (I was only doing the current version dump) but you can track the status of the parse at * http://tinyurl.com/frstatus *
De: John phoenixoverride@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: Platonides platonides@gmail.com; "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Jueves 7 de junio de 2012 23:26 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
I needed to change the dump I was working on (I was only doing the current version dump) but you can track the status of the parse at http://tinyurl.com/frstatus
Thanks for your help, John. I really appreciate it. It looks like the new copy is working fine, now. I will check again tomorrow morning. It should finish overnight.
Best, Felipe.
On 08/06/12 00:13, Felipe Ortega wrote:
Thanks for your help, John. I really appreciate it. It looks like the new copy is working fine, now. I will check again tomorrow morning. It should finish overnight.
Best, Felipe.
$ time 7z e -so frwiki-20120430-pages-meta-history.xml.7z | sha256sum 7-Zip 4.55 beta Copyright (c) 1999-2007 Igor Pavlov 2007-09-05 p7zip Version 4.55 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: frwiki-20120430-pages-meta-history.xml.7z
Extracting frwiki-20120430-pages-meta-history.xml
Everything is Ok
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951 5349850c5e9a7c03f0dee071a9143660a7d12847948bcb0b1564060c8a21e8c0 -
real 436m29.562s user 357m41.568s sys 49m58.540s
----- Mensaje original -----
De: Felipe Ortega glimmer_phoenix@yahoo.es Para: Platonides platonides@gmail.com CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Jueves 7 de junio de 2012 20:45 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org"
xmldatadumps-l@lists.wikimedia.org
Enviado: Jueves 7 de junio de 2012 18:52 Asunto: Re: [Xmldatadumps-l] Problems with frwiki dumps
Total: Folders: 0 Files: 1 Size: 1249323572065 Compressed: 7526979951
real 163m59.124s user 138m30.290s sys 0m29.328s
Oops. Not so fast. It's 1.25 TB. Ok, trying again from compressed file.
Felipe.
The content might be completely bogus, though. It'd need further
checks.
----- Mensaje original -----
xmldatadumps-l@lists.wikimedia.org