Yes. It is fairly easy to produce the list limited to a time period, or any other custom stats (e.g. 'reverted edits ratios' for anonymous users, etc). It's just several hours of processing. But it is limited with the time frame of the recent database dump. For the en-wiki it is 2010/01/30. Send your complains to the xmldatadumps-l (xmldatadumps-l@lists.wikimedia.org) ;) .
By the way, I've posted (somewhat cleaned-up) python script that I've used to calculate that list. It's available here: http://code.google.com/p/pymwdat/
For en-wiki dump requires: * 31 Gb enwiki-20100130-pages-meta-history.xml.7z download; * 250 Gb free disk space (for intermediate data & dump); * ~week to pre-process the dump (modern desktop); * ~3 hours to do a simple run (e.g calculate the list like I did).
Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s for page revisions, calculating page diffs and pickling the results (alongside with other metadata) to disk. It uses a custom diff algorithm optimized for the wikipedia (regular diff is a way too slow and doesn't handle copy editing well).
It needs memory if one wants to calculate/hold stats for every editor/page (4Gb minimal, 8Gb recommended, 24Gb+ preferred). But obviously one can filter yourself a data subset or even work on a single page.
Requires System/Libraries: * Python 2.6+, Linux (I've never tried it on Windows); * PyWikipedia/Trunk ( http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ ) * OrderedDict (available in Python 2.7 or http://pypi.python.org/pypi/ordereddict/) * 7-Zip (command line 7za)
-- Dmitry
On Thu, Aug 19, 2010 at 8:46 AM, John Vandenberg jayvdb@gmail.com wrote:
On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov dchichkov@gmail.com wrote:
If anybody is interested, I've made a list of 'most reverted pages' in
the
english wikipedia based on the analysis of the enwiki-20100130 dump. Here
is
the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt
Lovely!
This could be used to add semi-protection or pending-changes to reduce the amount of unnecessary work.
Is it easy to limit this to reverts within a period, such as the last 12 months?
It would also be useful to filter out irregular edit-wars, or pages which were subject to frequent reverts, but have become stable.
-- John Vandenberg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l