Hi,
Does anyone know of a way to look up the top editors for a certain namespace (like "Module") across all Wikimedia sites?
I'm asking as I'm wondering how to get more aware of developer activity outside of Wikimedia Git/Gerrit.
Thanks for any ideas (or pointing out a better place to ask)!
Cheers, andre
Andre Klapper, 22/03/2017 13:51:
Does anyone know of a way to look up the top editors for a certain namespace (like "Module") across all Wikimedia sites?
The easiest way is usually to run the relevant SELECT queries with a small bash script on Labs or with sql.php on tin (e.g. https://phabricator.wikimedia.org/T128326#3100126 ).
Nemo
On Wed, Mar 22, 2017 at 3:44 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Andre Klapper, 22/03/2017 13:51:
Does anyone know of a way to look up the top editors for a certain namespace (like "Module") across all Wikimedia sites?
The easiest way is usually to run the relevant SELECT queries with a small bash script on Labs or with sql.php on tin (e.g. https://phabricator.wikimedia.org/T128326#3100126 ).
I would discourage running long running scripts into production (tin is reserved for that)- we have special analytics slaves for that. However, probably those stats do not require access to private data. A query like:
use <wiki you are interested in>;
SELECT rev_user_text, count(*) as count FROM revision JOIN page ON rev_page = page_id and page_namespace=<desired numeric namespace> GROUP BY rev_user_text ORDER BY count(*) DESC LIMIT <number of top users>;
(maybe revision_userindex instead of revision?)
...on labs would work (this is by heart, it may have mistakes)
If it is a large wiki or there is a lot of edits, that may not finish- so you may have to do the analysis in small chunks and the aggregate results. Maybe someone has pre-crunched stats if you need it faster?
https://quarry.wmflabs.org/query/17556
On Wed, Mar 22, 2017 at 11:43 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
(I confirm my advice. I usually use Labs of course.)
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre
The recentchanges table only contains edits from the past 30 days. And you can filter by namespace! - J
On Thu, Mar 23, 2017 at 1:55 PM, Andre Klapper aklapper@wikimedia.org wrote:
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre
Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history for most things or will you get a lot of value out of one or two years, with further back just being nice to have?
Original Message From: Andre Klapper Sent: Thursday, March 23, 2017 15:55 To: analytics@lists.wikimedia.org Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Top editors in a certain namespace across sites?
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre
Here is an experiment I did about two months ago with First Normal Form (yikes!) and GNU. I just never posted this yet.
I collected alle edits from all wikis using recent full history stub dumps, with a perl script, which took some 30 hours. The total file for all Wikimedia wikis is 240 GB uncompressed, 2.94 billion lines.
Each edit yields a record with wiki, timestamp, namespace, user name, article title and then some. Now querying that file with grep,cut,sort,uniq,wc is pretty straightforward and remarkably fast (30-40 min), and rather versatile.
A) Top editors in namespace 'Module' (=828) on English Wikipedia:
grep -P "^enwiki," EditsTimestampsTitlesAll.csv | cut -d ',' -f 6,9 | grep -P "^828," | sort | uniq -c | sort -rn | head -n 500 > top500_edits_enwiki_namespace_828.txt
should yield figures similar to Aaron's quarry query [3], but in fact they are way higher e.g. top editor (according to quarry) for that namespace 'Module' (=828) user 'Mehmedsons' had a two months ago 1251 instead of quarry's 234 edits, which is confirmed by [3]
[1] https://stats.wikimedia.org/archive/scan_edits/edits_namespace_828_enwiki.tx... [2] https://quarry.wmflabs.org/query/17556 [3] https://en.wikipedia.org/w/index.php?title=Special:Contributions&contrib...
B) similarly, but for all wikis and namespaces: total edits per wiki per namespace per user (filter as you like)
cut -d ',' -f 1,6,9 EditsTimestampsTitlesAll.csv | sort -t, -k 1,1 -k 2,2n -k 3,3 | uniq -c > edits_per_wiki_namespace_user.txt
[4] https://stats.wikimedia.org/archive/scan_edits/edits_per_wiki_namespace_user...
C) Most edited articles all over:
here are top 10, for top 10,000 see [5]
1259058 enwiki,4,Wikipedia,Administrator intervention against vandalism 955842 enwiki,4,Wikipedia,Administrators' noticeboard/Incidents 788061 enwiki,2,User,Cyde/List of candidates for speedy deletion/Subpage 654559 enwiki,4,Wikipedia,Sandbox 578992 metawiki,2,User,COIBot/LinkReports 446429 dewiki,4,Wikipedia,Vandalismusmeldung 434556 enwiki,4,Wikipedia,Requests for page protection 433781 enwiki,4,Wikipedia,Reference desk/Science 390821 commonswiki,4,Commons,Quality images candidates/candidate list 369557 enwiki,4,Wikipedia,Help desk
[5] https://stats.wikimedia.org/archive/scan_edits/top_10000_most_edited_article...
-----Original Message----- From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Friday, March 24, 2017 3:13 To: Andre Klapper; analytics@lists.wikimedia.org Subject: Re: [Analytics] Top editors in a certain namespace across sites?
We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history for most things or will you get a lot of value out of one or two years, with further back just being nice to have?
Original Message From: Andre Klapper Sent: Thursday, March 23, 2017 15:55 To: analytics@lists.wikimedia.org Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Top editors in a certain namespace across sites?
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Erik,
My query only looks at the last 30 days. See the "(in the last 30 days)" suffix on the title :) This explains the discrepancy in our counts.
-Aaron
On Thu, Jun 1, 2017 at 3:54 PM, Erik Zachte ezachte@wikimedia.org wrote:
Here is an experiment I did about two months ago with First Normal Form (yikes!) and GNU. I just never posted this yet.
I collected alle edits from all wikis using recent full history stub dumps, with a perl script, which took some 30 hours. The total file for all Wikimedia wikis is 240 GB uncompressed, 2.94 billion lines.
Each edit yields a record with wiki, timestamp, namespace, user name, article title and then some. Now querying that file with grep,cut,sort,uniq,wc is pretty straightforward and remarkably fast (30-40 min), and rather versatile.
A) Top editors in namespace 'Module' (=828) on English Wikipedia:
grep -P "^enwiki," EditsTimestampsTitlesAll.csv | cut -d ',' -f 6,9 | grep -P "^828," | sort | uniq -c | sort -rn | head -n 500 > top500_edits_enwiki_namespace_828.txt
should yield figures similar to Aaron's quarry query [3], but in fact they are way higher e.g. top editor (according to quarry) for that namespace 'Module' (=828) user 'Mehmedsons' had a two months ago 1251 instead of quarry's 234 edits, which is confirmed by [3]
[1] https://stats.wikimedia.org/archive/scan_edits/edits_ namespace_828_enwiki.txt [2] https://quarry.wmflabs.org/query/17556 [3] https://en.wikipedia.org/w/index.php?title=Special: Contributions&contribs=user&target=Mehmedsons&namespace=828&tagfilter
B) similarly, but for all wikis and namespaces: total edits per wiki per namespace per user (filter as you like)
cut -d ',' -f 1,6,9 EditsTimestampsTitlesAll.csv | sort -t, -k 1,1 -k 2,2n -k 3,3 | uniq -c > edits_per_wiki_namespace_user.txt
[4] https://stats.wikimedia.org/archive/scan_edits/edits_per_ wiki_namespace_user.zip
C) Most edited articles all over:
here are top 10, for top 10,000 see [5]
1259058 enwiki,4,Wikipedia,Administrator intervention against vandalism 955842 enwiki,4,Wikipedia,Administrators' noticeboard/Incidents 788061 enwiki,2,User,Cyde/List of candidates for speedy deletion/Subpage 654559 enwiki,4,Wikipedia,Sandbox 578992 metawiki,2,User,COIBot/LinkReports 446429 dewiki,4,Wikipedia,Vandalismusmeldung 434556 enwiki,4,Wikipedia,Requests for page protection 433781 enwiki,4,Wikipedia,Reference desk/Science 390821 commonswiki,4,Commons,Quality images candidates/candidate list 369557 enwiki,4,Wikipedia,Help desk
[5] https://stats.wikimedia.org/archive/scan_edits/top_10000_ most_edited_articles.txt
-----Original Message----- From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Friday, March 24, 2017 3:13 To: Andre Klapper; analytics@lists.wikimedia.org Subject: Re: [Analytics] Top editors in a certain namespace across sites?
We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history for most things or will you get a lot of value out of one or two years, with further back just being nice to have?
Original Message From: Andre Klapper Sent: Thursday, March 23, 2017 15:55 To: analytics@lists.wikimedia.org Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Top editors in a certain namespace across sites?
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre
Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah point taken, @Aaron.
Though to my defense this is what the title reads on my screen:
" Top 100 editors in Module namespace in English Wikipedia (in th"
I use MS Unicode Sans as default browser font, which is a bit larger than most, with this unexpected side effect. J
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Aaron Halfaker Sent: Friday, June 02, 2017 17:47 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Top editors in a certain namespace across sites?
Hi Erik,
My query only looks at the last 30 days. See the "(in the last 30 days)" suffix on the title :) This explains the discrepancy in our counts.
-Aaron
On Thu, Jun 1, 2017 at 3:54 PM, Erik Zachte ezachte@wikimedia.org wrote:
Here is an experiment I did about two months ago with First Normal Form (yikes!) and GNU. I just never posted this yet.
I collected alle edits from all wikis using recent full history stub dumps, with a perl script, which took some 30 hours. The total file for all Wikimedia wikis is 240 GB uncompressed, 2.94 billion lines.
Each edit yields a record with wiki, timestamp, namespace, user name, article title and then some. Now querying that file with grep,cut,sort,uniq,wc is pretty straightforward and remarkably fast (30-40 min), and rather versatile.
A) Top editors in namespace 'Module' (=828) on English Wikipedia:
grep -P "^enwiki," EditsTimestampsTitlesAll.csv | cut -d ',' -f 6,9 | grep -P "^828," | sort | uniq -c | sort -rn | head -n 500 > top500_edits_enwiki_namespace_828.txt
should yield figures similar to Aaron's quarry query [3], but in fact they are way higher e.g. top editor (according to quarry) for that namespace 'Module' (=828) user 'Mehmedsons' had a two months ago 1251 instead of quarry's 234 edits, which is confirmed by [3]
[1] https://stats.wikimedia.org/archive/scan_edits/edits_namespace_828_enwiki.tx... [2] https://quarry.wmflabs.org/query/17556 [3] https://en.wikipedia.org/w/index.php?title=Special:Contributions https://en.wikipedia.org/w/index.php?title=Special:Contributions&contribs=user&target=Mehmedsons&namespace=828&tagfilter &contribs=user&target=Mehmedsons&namespace=828&tagfilter
B) similarly, but for all wikis and namespaces: total edits per wiki per namespace per user (filter as you like)
cut -d ',' -f 1,6,9 EditsTimestampsTitlesAll.csv | sort -t, -k 1,1 -k 2,2n -k 3,3 | uniq -c > edits_per_wiki_namespace_user.txt
[4] https://stats.wikimedia.org/archive/scan_edits/edits_per_wiki_namespace_user...
C) Most edited articles all over:
here are top 10, for top 10,000 see [5]
1259058 enwiki,4,Wikipedia,Administrator intervention against vandalism 955842 enwiki,4,Wikipedia,Administrators' noticeboard/Incidents 788061 enwiki,2,User,Cyde/List of candidates for speedy deletion/Subpage 654559 enwiki,4,Wikipedia,Sandbox 578992 metawiki,2,User,COIBot/LinkReports 446429 dewiki,4,Wikipedia,Vandalismusmeldung 434556 enwiki,4,Wikipedia,Requests for page protection 433781 enwiki,4,Wikipedia,Reference desk/Science 390821 commonswiki,4,Commons,Quality images candidates/candidate list 369557 enwiki,4,Wikipedia,Help desk
[5] https://stats.wikimedia.org/archive/scan_edits/top_10000_most_edited_article...
-----Original Message----- From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Friday, March 24, 2017 3:13 To: Andre Klapper; analytics@lists.wikimedia.org Subject: Re: [Analytics] Top editors in a certain namespace across sites?
We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history for most things or will you get a lot of value out of one or two years, with further back just being nice to have?
Original Message From: Andre Klapper Sent: Thursday, March 23, 2017 15:55 To: analytics@lists.wikimedia.org Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Top editors in a certain namespace across sites?
On Wed, 2017-03-22 at 17:25 -0500, Aaron Halfaker wrote:
Thanks a lot everybody for your replies and explanations!
A welcome reminder that I have to learn more about Quarry (plus find a way to also query cross-site and maybe restrict to "recent edits").
andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics