Here is an experiment I did about two months ago with First Normal Form (yikes!) and GNU.
I just never posted this yet.
I collected alle edits from all wikis using recent full history stub dumps, with a perl
script, which took some 30 hours.
The total file for all Wikimedia wikis is 240 GB uncompressed, 2.94 billion lines.
Each edit yields a record with wiki, timestamp, namespace, user name, article title and
then some.
Now querying that file with grep,cut,sort,uniq,wc is pretty straightforward and remarkably
fast (30-40 min), and rather versatile.
A) Top editors in namespace 'Module' (=828) on English Wikipedia:
grep -P "^enwiki," EditsTimestampsTitlesAll.csv | cut -d ',' -f 6,9 |
grep -P "^828," | sort | uniq -c | sort -rn | head -n 500 >
top500_edits_enwiki_namespace_828.txt
should yield figures similar to Aaron's quarry query [3], but in fact they are way
higher
e.g. top editor (according to quarry) for that namespace 'Module' (=828) user
'Mehmedsons' had a two months ago 1251 instead of quarry's 234 edits, which is
confirmed by [3]
[1]
https://stats.wikimedia.org/archive/scan_edits/edits_namespace_828_enwiki.t…
[2]
https://quarry.wmflabs.org/query/17556
[3]
https://en.wikipedia.org/w/index.php?title=Special:Contributions&contri…
B) similarly, but for all wikis and namespaces: total edits per wiki per namespace per
user (filter as you like)
cut -d ',' -f 1,6,9 EditsTimestampsTitlesAll.csv | sort -t\, -k 1,1 -k 2,2n -k 3,3
| uniq -c > edits_per_wiki_namespace_user.txt
[4]
https://stats.wikimedia.org/archive/scan_edits/edits_per_wiki_namespace_use…
C) Most edited articles all over:
here are top 10, for top 10,000 see [5]
1259058 enwiki,4,Wikipedia,Administrator intervention against vandalism
955842 enwiki,4,Wikipedia,Administrators' noticeboard/Incidents
788061 enwiki,2,User,Cyde/List of candidates for speedy deletion/Subpage
654559 enwiki,4,Wikipedia,Sandbox
578992 metawiki,2,User,COIBot/LinkReports
446429 dewiki,4,Wikipedia,Vandalismusmeldung
434556 enwiki,4,Wikipedia,Requests for page protection
433781 enwiki,4,Wikipedia,Reference desk/Science
390821 commonswiki,4,Commons,Quality images candidates/candidate list
369557 enwiki,4,Wikipedia,Help desk
[5]
https://stats.wikimedia.org/archive/scan_edits/top_10000_most_edited_articl…
-----Original Message-----
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu
Sent: Friday, March 24, 2017 3:13
To: Andre Klapper; analytics(a)lists.wikimedia.org
Subject: Re: [Analytics] Top editors in a certain namespace across sites?
We are working real hard to make cross-site querying easy from quarry, by pointing it to
the new data we're working on. So we hope to have that out as soon as the new labs db
servers have data for all projects. A quick question on this topic: how far back do you
all need to go? Whole history for most things or will you get a lot of value out of one or
two years, with further back just being nice to have?
Original Message
From: Andre Klapper
Sent: Thursday, March 23, 2017 15:55
To: analytics(a)lists.wikimedia.org
Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest
in Wikipedia and analytics.
Subject: Re: [Analytics] Top editors in a certain namespace across sites?
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query
cross-site and maybe restrict to "recent edits").
andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics