Hi Tomasz,<br><br>I did some comparisons between -20100312- [31.9 GB] and -20100130- [15.8 GB] archives.<br><br>0) -20100312- [31.9 GB] archive contains the declared number of revisions 313797035.<br><br> -20100130- [15.8 GB] archive contains only 184777888 revisions. <br>
Last pages/revisions were:<br>R 184777820 ETA 75.6 : 5137501 53106624 Cote-des-Neiges (Montreal Metro)<br>R 184777822 ETA 75.6 : 5137502 53106677 Duchcov<br>R 184777866 ETA 75.6 : 5137504 53106706 Cote-Des-Neiges (Montreal Metro)<br>
R 184777867 ETA 75.6 : 5137506 53106711 Lynn Haney<br>R 184777888 ETA 75.6 : 5137507 9882553 Wikipedia:Administrators' noticeboard/Incidents<br>The xml stream seems to be broken at that point. SyntaxError: no element found: line 36473988846, column 522<br>
<br><br>1) For many pages in the archive -20100312- [31.9 GB] revisions between 2005-01-14T and 2005-05-14 have empty text field.<br>New archive -20100130- [15.8 GB] doesn't seem to have that problem. I couldn't identify any revisions with missing text in the [15.8 GB] (aside from blanked pages).<br>
<br>Some statistics on empty text revisions: <br>[31.9 GB] Revisions 313797035. Empty Revisions 1524837.<br>[15.8 GB] Revisions 184986173. Empty Revisions 370982<br>[31.9 GB] Revisions 185000000. Empty Revisions 1158890. (same position in the the archive)<br>
<br>2) I've analyzed first 500000 revisions (archive enumeration) and could find any revisions in the [31.9 GB] missing in the [15.8 GB] archive.<br>3) In the first 500000 revisions texts seems to match exactly (except for missing texts - see 1.).<br>
4) In the first 500000 revisions comments seems to match exactly.<br><br>-- Regards, Dmitry<br><br><br>P.S. <br>After I've patched pywikipedia.xmlparser to include .7z support and had fixed memory leaks it seems to work fine with en.wiki archives. You can actually parse 5TB of text in python :)<br>
Only takes ~36Hrs :) Here is a code snipped printing revisions with empty texts: <a href="http://wrdese.googlecode.com/svn/trunk/b/verify-wiki-dump-print-empty.py">http://wrdese.googlecode.com/svn/trunk/b/verify-wiki-dump-print-empty.py</a><br>
<br>