* 250 Gb free disk space (for intermediate data & dump);
* ~week to pre-process the dump (modern desktop);
* ~3 hours to do a simple run (e.g calculate the list like I did).
Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s for page revisions, calculating page diffs and pickling the results (alongside with other metadata) to disk. It uses a custom diff algorithm optimized for the wikipedia (regular diff is a way too slow and doesn't handle copy editing well).
It needs memory if one wants to calculate/hold stats for every editor/page (4Gb minimal, 8Gb recommended, 24Gb+ preferred).
But obviously one can filter yourself a data subset or even work on a
single page.
Requires System/Libraries:
* Python 2.6+, Linux (I've never tried it on Windows);
* PyWikipedia/Trunk (
http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ )
* OrderedDict (available in Python 2.7 or
http://pypi.python.org/pypi/ordereddict/)
* 7-Zip (command line 7za)