Hi,
Nice graphs! Since pages in the dump files are in order of page id (more or less in order of article creation date), but most of the missing data is from revisions that occured in the timeframe [2005-01-14T - 2005-05-14] the data was probably lost during the SQL database format to xml conversion step and not in the bzip or 7z compression step. My guess is an intermittent SQL database read timeout/error.
cheers, Jamie
----- Original Message ----- From: Dmitry Chichkov dchichkov@gmail.com Date: Monday, May 17, 2010 11:22 pm Subject: Re: [Xmldatadumps-admin-l] FYI: comparison between enwiki-20100130-pages-meta-history.xml.7z and enwiki-20100312-pages-meta-history.xml.7z To: Jamie Morken jmorken@shaw.ca, xmldatadumps-admin-l@lists.wikimedia.org
I've tried filtering and plotting empty text revisions using the followingcriteria: comment starts on '/*' (section edits) AND not an IP edit; The idea is that generally section edits do not result in the deletion of the complete article text and registered users tend to vandalize less. Consequently we can somewhat see what revisions text were missed due to backup.
Resulting plots are attached for both [enwiki-20100130 31.9 GB] and [enwiki-20100312 15.8 GB] files.
If anybody is interested in the raw filtered data here's a link to the zipped .csv(s): http://76.126.237.67/tmp/missing.revisions.enwiki-20100xx.7z The .csv files have the following format: 'pageid, revisionid, unixtime,pagetitle'.
-- Cheers, Dmitry