Here is an experiment I did about two months ago with First Normal Form (yikes!) and GNU. I just never posted this yet.
I collected alle edits from all wikis using recent full history stub dumps, with a perl script, which took some 30 hours. The total file for all Wikimedia wikis is 240 GB uncompressed, 2.94 billion lines.
Each edit yields a record with wiki, timestamp, namespace, user name, article title and then some. Now querying that file with grep,cut,sort,uniq,wc is pretty straightforward and remarkably fast (30-40 min), and rather versatile.
A) Top editors in namespace 'Module' (=828) on English Wikipedia:
grep -P "^enwiki," EditsTimestampsTitlesAll.csv | cut -d ',' -f 6,9 | grep -P "^828," | sort | uniq -c | sort -rn | head -n 500 > top500_edits_enwiki_namespace_828.txt
should yield figures similar to Aaron's quarry query [3], but in fact they are way higher e.g. top editor (according to quarry) for that namespace 'Module' (=828) user 'Mehmedsons' had a two months ago 1251 instead of quarry's 234 edits, which is confirmed by [3]
[1] https://stats.wikimedia.org/archive/scan_edits/edits_namespace_828_enwiki.tx... [2] https://quarry.wmflabs.org/query/17556 [3] https://en.wikipedia.org/w/index.php?title=Special:Contributions&contrib...
B) similarly, but for all wikis and namespaces: total edits per wiki per namespace per user (filter as you like)
cut -d ',' -f 1,6,9 EditsTimestampsTitlesAll.csv | sort -t, -k 1,1 -k 2,2n -k 3,3 | uniq -c > edits_per_wiki_namespace_user.txt
[4] https://stats.wikimedia.org/archive/scan_edits/edits_per_wiki_namespace_user...
C) Most edited articles all over:
here are top 10, for top 10,000 see [5]
1259058 enwiki,4,Wikipedia,Administrator intervention against vandalism 955842 enwiki,4,Wikipedia,Administrators' noticeboard/Incidents 788061 enwiki,2,User,Cyde/List of candidates for speedy deletion/Subpage 654559 enwiki,4,Wikipedia,Sandbox 578992 metawiki,2,User,COIBot/LinkReports 446429 dewiki,4,Wikipedia,Vandalismusmeldung 434556 enwiki,4,Wikipedia,Requests for page protection 433781 enwiki,4,Wikipedia,Reference desk/Science 390821 commonswiki,4,Commons,Quality images candidates/candidate list 369557 enwiki,4,Wikipedia,Help desk
[5] https://stats.wikimedia.org/archive/scan_edits/top_10000_most_edited_article...
-----Original Message----- From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Friday, March 24, 2017 3:13 To: Andre Klapper; analytics@lists.wikimedia.org Subject: Re: [Analytics] Top editors in a certain namespace across sites?
We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history for most things or will you get a lot of value out of one or two years, with further back just being nice to have?
Original Message From: Andre Klapper Sent: Thursday, March 23, 2017 15:55 To: analytics@lists.wikimedia.org Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Top editors in a certain namespace across sites?
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics