* 250 Gb free disk space (for intermediate data & dump);
* ~week to pre-process the dump (modern desktop);
* ~3 hours to do a simple run (e.g calculate the list like I did).

Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s for page revisions, calculating page diffs and pickling the results (alongside with other metadata) to disk. It uses a custom diff algorithm optimized for the wikipedia (regular diff is a way too slow and doesn't handle copy editing well).

It needs memory if one wants to calculate/hold stats for every editor/page (4Gb minimal, 8Gb recommended, 24Gb+ preferred).
But obviously one can filter yourself a data subset or even work on a single page.

Requires System/Libraries:
* Python 2.6+, Linux (I've never tried it on Windows);
* PyWikipedia/Trunk ( http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ )
* OrderedDict (available in Python 2.7 or http://pypi.python.org/pypi/ordereddict/)
* 7-Zip (command line 7za)

On Thu, Aug 19, 2010 at 8:46 AM, John Vandenberg <jayvdb@gmail.com> wrote:

On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov <dchichkov@gmail.com> wrote:
> If anybody is interested, I've made a list of 'most reverted pages' in the
> english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
> the list:
> http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
> http://wpcvn.com/enwiki-20100130.most.reverted.txt

Lovely!

This could be used to add semi-protection or pending-changes to reduce
the amount of unnecessary work.

Is it easy to limit this to reverts within a period, such as the last 12 months?

It would also be useful to filter out irregular edit-wars, or pages
which were subject to frequent reverts, but have become stable.

--
John Vandenberg

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l