Erik Zachte, 25/02/2015 23:34:
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis (though, kudos! many more small wikis are included than in wikistats). For instance, 20 more percentage points for USA on Breton and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese bots they look similar, though in some cases I'm not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Great work!
One way for further analysis of such kind of geolinguistic aggregate is to do some data normalization, or geographic normalization, as demonstrated by my previous work http://www.opensym.org/os2014-files/proceedings/p611.pdf: http://www.opensym.org/os2014-files/proceedings/p611.pdf
Any one is welcome to do some data normalization using the geolinguistic size indicators here https://github.com/hanteng/pyGeolinguisticSize/blob/master/size_geolinguistic.tsv: https://github.com/hanteng/pyGeolinguisticSize/blob/master/size_geolinguisti...
Currently, it has Population (LP), Internet users (IPop), Economy Size (PPPGDP), etc. estimation based on "even distribution" across percentage share of language population per country based on the Unicode CLDR 25 Territory-Language Information.
A simple linear regression can reveal, say, which geo-linguistic, geographic, or linguistic category has less-than-expected or more-than-expected proportional of viewing traffic, with the expected values being generated according to the sizes of population, Internet population, economy.
I hope this great work by Nemo can be extended to cover
(1) time-series report and data release
(2) edits aggregate
Altogether the tools and datasets will be a major milestone to monitor the language/project development across Wikimedia projects. Congrats!
Best, han-teng liao
2015-02-26 8:31 GMT+01:00 Federico Leva (Nemo) nemowiki@gmail.com:
Erik Zachte, 25/02/2015 23:34:
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/ SquidReportPageViewsPerLanguageBreakdown.htm
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis (though, kudos! many more small wikis are included than in wikistats). For instance, 20 more percentage points for USA on Breton and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese bots they look similar, though in some cases I'm not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org