If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt
This list was calculated using the following sampling criteria: * All pages from the enwiki-20100130 dump; ** Filtered pages with more than 1000 revisions; ** Filtered pages with revert ratios > 0.3; * Sorted in descending revert ratios.
Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum; BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).
-- Regards, Dmitry
Thanks, this is great fun! As an Italian, let me quote:
(0.42525520906166969, (7151, 3041, 59, 514, 63, 2519, 955), 'Penis') (0.42516069788797062, (1089, 463, 29, 27, 16, 470, 84), 'Inner core') (0.42490272373540855, (1285, 546, 11, 64, 27, 515, 122), 'Stuff') (0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun') (0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey') (0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas') (0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian Renaissance') (0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon') (0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy') (0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')
It is absolutely great to see that Italian Renaissance (with Incas) is one of the few cultural topics that makes it as high in the list as the usual excrement-sex-infantile type of things!!
Luca
On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov dchichkov@gmail.comwrote:
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt
This list was calculated using the following sampling criteria:
- All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions; ** Filtered pages with revert ratios > 0.3;
- Sorted in descending revert ratios.
Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum; BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).
-- Regards, Dmitry
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Yes, working with large data sets is fun. There are always surprises. For example top of the 'words, most reverted by trusted users' are not the expected infantile type of things either. I haven't done the analysis on the full dump yet, but on the subset from the full histories of articles from the PAN 10 LAB test set following words came up top (sorted by chi-square, note that this is very preliminary and tokenization/regularization might have been wrong):
token, chi-sq, regular-diff-tok-cnt, revert-diff-tok-cnt Image:Example.jpg|Ca 87959701.9043 113 7568 [[Media:Example.ogg] 5549492.56549 62 2196 title]][http://www.e 606182.771025 0 363 aaaaaaaaaaaaaaaaaaaa 305908.640902 0 365 [http://youtube.com/ 253267.5237 189 407 pooooooooooooooooooo 214921.014803 0 375 you 154597.596739 18655 102007 ffffffffffffffffffff 129822.419517 1 238 value="transparent"> 129575.702482 1 168 ____________________ 126503.143626 23 166 language|Macedonian] 123467.452157 121 164 hhhhhhhhhhhhhhhhhhhh 119613.359035 0 280 !!!!!!!!!!!!!!!!!!!! 118479.373501 5 686 AAAAAAAAAAAAAAAAAAAA 114581.582068 2 158 oooooooooooooooooooo 110263.074451 0 155 i 109590.406785 2620 55971
-- Cheers, Dmitry
On Fri, Aug 13, 2010 at 4:06 PM, Luca de Alfaro luca@dealfaro.org wrote:
Thanks, this is great fun! As an Italian, let me quote:
(0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun') (0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey') (0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas') (0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian Renaissance') (0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon') (0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy') (0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')
It is absolutely great to see that Italian Renaissance (with Incas) is one of the few cultural topics that makes it as high in the list as the usual excrement-sex-infantile type of things!!
Luca
On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov dchichkov@gmail.comwrote:
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt
This list was calculated using the following sampling criteria:
- All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions; ** Filtered pages with revert ratios > 0.3;
- Sorted in descending revert ratios.
Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum; BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).
-- Regards, Dmitry
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov dchichkov@gmail.com wrote:
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt
Lovely!
This could be used to add semi-protection or pending-changes to reduce the amount of unnecessary work.
Is it easy to limit this to reverts within a period, such as the last 12 months?
It would also be useful to filter out irregular edit-wars, or pages which were subject to frequent reverts, but have become stable.
-- John Vandenberg
Yes. It is fairly easy to produce the list limited to a time period, or any other custom stats (e.g. 'reverted edits ratios' for anonymous users, etc). It's just several hours of processing. But it is limited with the time frame of the recent database dump. For the en-wiki it is 2010/01/30. Send your complains to the xmldatadumps-l (xmldatadumps-l@lists.wikimedia.org) ;) .
By the way, I've posted (somewhat cleaned-up) python script that I've used to calculate that list. It's available here: http://code.google.com/p/pymwdat/
For en-wiki dump requires: * 31 Gb enwiki-20100130-pages-meta-history.xml.7z download; * 250 Gb free disk space (for intermediate data & dump); * ~week to pre-process the dump (modern desktop); * ~3 hours to do a simple run (e.g calculate the list like I did).
Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s for page revisions, calculating page diffs and pickling the results (alongside with other metadata) to disk. It uses a custom diff algorithm optimized for the wikipedia (regular diff is a way too slow and doesn't handle copy editing well).
It needs memory if one wants to calculate/hold stats for every editor/page (4Gb minimal, 8Gb recommended, 24Gb+ preferred). But obviously one can filter yourself a data subset or even work on a single page.
Requires System/Libraries: * Python 2.6+, Linux (I've never tried it on Windows); * PyWikipedia/Trunk ( http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ ) * OrderedDict (available in Python 2.7 or http://pypi.python.org/pypi/ordereddict/) * 7-Zip (command line 7za)
-- Dmitry
On Thu, Aug 19, 2010 at 8:46 AM, John Vandenberg jayvdb@gmail.com wrote:
On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov dchichkov@gmail.com wrote:
If anybody is interested, I've made a list of 'most reverted pages' in
the
english wikipedia based on the analysis of the enwiki-20100130 dump. Here
is
the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt
Lovely!
This could be used to add semi-protection or pending-changes to reduce the amount of unnecessary work.
Is it easy to limit this to reverts within a period, such as the last 12 months?
It would also be useful to filter out irregular edit-wars, or pages which were subject to frequent reverts, but have become stable.
-- John Vandenberg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org