Yes, working with large data sets is fun. There are always surprises. For example top of the 'words, most reverted by trusted users' are not the expected infantile type of things either. I haven't done the analysis on the full dump yet, but on the subset from the full histories of articles from the PAN 10 LAB test set following words came up top (sorted by chi-square, note that this is very preliminary and tokenization/regularization might have been wrong):

token, chi-sq, regular-diff-tok-cnt, revert-diff-tok-cnt
Image:Example.jpg|Ca 87959701.9043 113 7568
[[Media:Example.ogg] 5549492.56549 62 2196
title]][http://www.e 606182.771025 0 363
aaaaaaaaaaaaaaaaaaaa 305908.640902 0 365
[http://youtube.com/ 253267.5237 189 407
pooooooooooooooooooo 214921.014803 0 375
you 154597.596739 18655 102007
ffffffffffffffffffff 129822.419517 1 238
value="transparent"> 129575.702482 1 168
____________________ 126503.143626 23 166
language|Macedonian] 123467.452157 121 164
hhhhhhhhhhhhhhhhhhhh 119613.359035 0 280
!!!!!!!!!!!!!!!!!!!! 118479.373501 5 686
AAAAAAAAAAAAAAAAAAAA 114581.582068 2 158
oooooooooooooooooooo 110263.074451 0 155
i 109590.406785 2620 55971

-- Cheers, Dmitry




On Fri, Aug 13, 2010 at 4:06 PM, Luca de Alfaro <luca@dealfaro.org> wrote:
Thanks, this is great fun!  As an Italian, let me quote: 

(0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun')
(0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey')
(0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas')
(0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian Renaissance')
(0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon')
(0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy')
(0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')

It is absolutely great to see that Italian Renaissance (with Incas) is one of the few cultural topics that makes it as high in the list as the usual excrement-sex-infantile type of things!!

Luca

On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov <dchichkov@gmail.com> wrote:
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list:
http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
http://wpcvn.com/enwiki-20100130.most.reverted.txt

This list was calculated using the following sampling criteria:
* All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions;
** Filtered pages with revert ratios > 0.3;
* Sorted in descending revert ratios.

Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum;
BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).

-- Regards, Dmitry

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l