Hi Erik,

My query only looks at the last 30 days.  See the "(in the last 30 days)" suffix on the title :)  This explains the discrepancy in our counts. 

-Aaron

On Thu, Jun 1, 2017 at 3:54 PM, Erik Zachte <ezachte@wikimedia.org> wrote:
Here is an experiment I did about two months ago with First Normal Form (yikes!) and GNU. I just never posted this yet.

I collected alle edits from all wikis using recent full history stub dumps, with a perl script, which took some 30 hours.
The total file for all Wikimedia wikis is 240 GB uncompressed, 2.94 billion lines.

Each edit yields a record with wiki, timestamp, namespace, user name, article title and then some.
Now querying that file with grep,cut,sort,uniq,wc is pretty straightforward and remarkably fast (30-40 min), and rather versatile.

A) Top editors in namespace 'Module' (=828) on English Wikipedia:

grep -P "^enwiki," EditsTimestampsTitlesAll.csv | cut -d ',' -f 6,9 | grep -P "^828," | sort | uniq -c | sort -rn | head -n 500 > top500_edits_enwiki_namespace_828.txt

should yield figures similar to Aaron's quarry query [3], but in fact they are way higher
e.g. top editor (according to quarry) for that namespace 'Module' (=828) user 'Mehmedsons' had a two months ago 1251 instead of quarry's 234 edits, which is confirmed by [3]

[1] https://stats.wikimedia.org/archive/scan_edits/edits_namespace_828_enwiki.txt
[2] https://quarry.wmflabs.org/query/17556
[3] https://en.wikipedia.org/w/index.php?title=Special:Contributions&contribs=user&target=Mehmedsons&namespace=828&tagfilter

B) similarly, but for all wikis and namespaces: total edits per wiki per namespace per user (filter as you like)

cut -d ',' -f 1,6,9 EditsTimestampsTitlesAll.csv | sort -t\, -k 1,1 -k 2,2n -k 3,3 | uniq -c > edits_per_wiki_namespace_user.txt

[4] https://stats.wikimedia.org/archive/scan_edits/edits_per_wiki_namespace_user.zip

C) Most edited articles all over:

here are top 10, for top 10,000 see [5]

1259058 enwiki,4,Wikipedia,Administrator intervention against vandalism
 955842 enwiki,4,Wikipedia,Administrators' noticeboard/Incidents
 788061 enwiki,2,User,Cyde/List of candidates for speedy deletion/Subpage
 654559 enwiki,4,Wikipedia,Sandbox
 578992 metawiki,2,User,COIBot/LinkReports
 446429 dewiki,4,Wikipedia,Vandalismusmeldung
 434556 enwiki,4,Wikipedia,Requests for page protection
 433781 enwiki,4,Wikipedia,Reference desk/Science
 390821 commonswiki,4,Commons,Quality images candidates/candidate list
 369557 enwiki,4,Wikipedia,Help desk


[5] https://stats.wikimedia.org/archive/scan_edits/top_10000_most_edited_articles.txt

-----Original Message-----
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu
Sent: Friday, March 24, 2017 3:13
To: Andre Klapper; analytics@lists.wikimedia.org
Subject: Re: [Analytics] Top editors in a certain namespace across sites?

We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history for most things or will you get a lot of value out of one or two years, with further back just being nice to have?

  Original Message
From: Andre Klapper
Sent: Thursday, March 23, 2017 15:55
To: analytics@lists.wikimedia.org
Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Top editors in a certain namespace across sites?

On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
> https://quarry.wmflabs.org/query/17556

Thanks a lot everybody for your replies and explanations!

A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").

andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics